FanPost

A Quick Statistical Dive into Offensive Inconsistency

After the Cubs’ unceremonious exit from the playoffs, a lot of the talk has been about how inconsistent the offense has been, culminating in the replacement of the hitting coach. In my quest to better understand the field of statistics, I thought I would take a quick dive into the numbers and share with you all what I found. Disclaimer: I am not a fully trained statistician, and thus the probability of errors is non-trivial. All data taken from either baseball-reference.com or retrosheet.org.

To get a sense of what the Cubs’ offense has done, let’s take a look at a plot of their runs scored per game:

iBrjZlK.0.png

Well, that second bar (1 run scored) definitely looks ominously high. Just to be sure, let’s compare it to the plot of all of MLB for 2018:

3XkyHLF.0.png

There’s a pretty stark difference there - the highest bar is 3 runs scored, with a smooth drop-off on both sides. Clearly, something is up. Perhaps the Cubs just didn’t score a lot of runs? Let’s take a look at their average runs per game, relative to the rest of the league. I added the MLB average and the average of playoff teams.

T6T7kf6.0.png

Team Runs Per Game
Cubs 4.67 (11th)
MLB Average 4.45
Playoff Teams 4.93

Some things to notice about this graph - all of the playoff teams were in the top 12 in runs scored, which stands to reason. The AL East powerhouses stood head and shoulders above everyone else. The Cubs only outscored one other playoff team (the Brewers). Miami sucks. (Not super relevant to our analysis, but I couldn’t resist). Based on just those points, we could be justified in saying that the Cubs scored just enough to get into the playoffs, but couldn’t stand with the big boys once there.

Now, we could leave it there, but where’s the fun in that? There are so many other ways we can play around with numbers, and a few of them might even give us some tiny insight into reality (or baseball, whichever you think is more important). Of course, there are many many dead ends when using statistics, where the model you generate has little to no bearing on reality. This is why people talk about "lying with statistics". It’s not that the math is wrong, it’s that the conclusions drawn are not justified by the analysis. This should not stop someone from pushing forward, however, as long as they are cognizant of the pitfalls.

So what if the Cubs were just inconsistent on offense? The most common way we look at "consistency" statistically is the standard deviation (and its cousin variance, which is just the standard deviation squared). If you already know this, feel free to skip to the next chart. Standard deviation is a measure of how "spread out" the data is. The higher it is, the more likely a given point will be far from the average. For example, let’s say you two teams. Team A scores 3 runs a game every game, without fail. Team B scores 0 runs half the time and 6 runs the other half. Both teams score on average 3 runs a game, but team A will have a standard deviation of 0 while team B has 3. Which is better? It probably depends on if you have low or high runs per game.

Here is a plot of the standard deviation of runs scored, again with playoff teams and all of MLB added in:

Ykuqubn.0.png

The Cubs rank pretty high, yes, but so do a lot of the other high-scoring teams. What’s going on here? Without doing quite a bit more analysis, I don’t want to say for sure, but my hunch is that all teams will have a decent chunk of low scoring games just due to luck/pitcher/weather/whatever. The high scoring teams will then also have a large number of high-scoring games, while the bad teams won’t, which would mean the spread on high-scoring teams should be higher. Again, this warrants further investigation.

Perhaps standard deviation isn’t appropriate? Is there another measure we can try? Perhaps dividing the standard deviation by the mean, which is called the "coefficient of variation." The advantage of this measure is it can be more easily compared between various systems. The disadvantage is that it breaks if the mean is near zero. I also do not know if it is statistically valid to look at here, but what the heck, let’s take a peak:

tDHcfh8.0.png

Now our high-scoring teams are all over the place. On the far right side we have the Red Sox and Yankees, with Minnesota sandwiched in there. Over on the left, we have bad teams like the Orioles, Mets, and Nationals, followed by two playoff teams - Cubs and Dodgers. One could probably say that this fits the narrative of the Cubs being great in the first half while being inconsistent to bad in the second half. One could also fit this to the Dodgers’ slow start, followed by their charge into the postseason. On the other hand, the Red Sox and Yankees were consistently good the whole season. One could say these things, if one were a little less cautious about drawing conclusions from data sets they don’t fully understand.

To wrap this up, let’s compare our numbers against the rest of the Theo years. BBED4i9.0.png t4UpWGl.0.png dzwKmoU.0.png

One thing that stood out to me is just how low 2016’s coefficient of variation is. 2017 scored more runs per game, but had a much higher variance. Again, this is not enough information to truly draw conclusions. What it does do is suggest potentially interesting lines of inquiry to pursue. If I get the right combination of bored and motivated, I may do just that.

FanPosts are written by readers of Bleed Cubbie Blue, and as such do not reflect the views of SB Nation or Vox Media, nor is the content endorsed by SB Nation, Vox Media or Al Yellon, managing editor of Bleed Cubbie Blue or reviewed prior to posting.