Monday, January 4, 2016

Seeing Patterns That Don't Exist: Sports Edition

I found a good example of how not to think about data in Time magazine's 2015 "Answers Issue." Among the many examples of analysis that could have been deeper, one stood out:


"Which team has the best home-field advantage?" is essentially one big graphic illustrating the home-field advantage (the difference between its winning percentages at home and away) for every major American sports team. On top of this graphic, they have placed some random observations.  I cannot resist critiquing a few of these before I get to my main point:

  • "Stadiums don't generally have a great influence on win percentage except in baseball, where each stadium is unique."   If they mean only that playing-field peculiarities play no role in sports where all playing fields are identical, then---duh!  If they are saying that peculiarities of the playing field do have a great influence in baseball, then---whoa!   These peculiarities could play a role, but Time hasn't shown any data, or even a quote from a player, to support this.
  • "The Ravens [the team with the best overall home-field advantage, with a 35% difference: 78% at home vs 43% away] play far better when in Baltimore. They lost every 2005 road game but were undefeated at home in 2011."  Why would they compare the road record in 2005 to the home record six years later?  This is a clue that they are "cherry-picking": looking for specifics that support their conclusion rather than looking for the fairest comparison.  I don't follow sports much but I know six years is enough time to turn over nearly the entire team, thus making this a comparison between the home and road records of essentially different teams (with different coaches).  This is easy enough to look up: the 2005 Ravens were 0-8 on the road and 6-2 at home (a 75% difference with a 6-10 overall record), while the 2011 Ravens were 4-4 on the road and 8-0 at home (a 50% difference with a 12-4 overall record). This suggests the Ravens maintain a substantial home advantage, not only when they are a strong team overall but also when they are a weak team.  Rather than make this "substantial and consistent" point Time's factoid misleads us into thinking that a single team has an overwhelming home advantage.
  • "Grueling travel---especially in the NHL and NBA, where many road games are back-to-back---can take a toll on visitors."  This may explain why the NBA overall has a 19% home advantage---but why then does the NHL have only a 10% home advantage, nearly the lowest of the four major sports? It seems as if Time's "data-driven journalism" is limited to "explaining" selected facts without a serious attempt to investigate patterns.
Now to the main point.  A skeptical, data-driven person must ask: couldn't many of these numbers have arisen randomly?  The overall home advantage in the NFL is 15%: a 57.5% winning percentage at home, vs. 42.5% on the road. Imagine that each of the 32 teams has a real 15% home advantage.  They play only 8 home and 8 away games each season, so a typical team expects something like a 5-3 record at home and 3-5 on the road. If random events cause them to win just one more home game and lose just one more road game, they now have an apparent 50% home advantage (6-2 or 75% at home, vs 2-6 or 25% on the road).  They could also randomly win one less at home and one more on the road, for an apparent 0% home advantage.  This is roughly equal to Time's "worst" team, the Cowboys (to whom we will return later).  So the observed spread in home-field advantage is plausibly due to randomness, without requiring us to believe that the Cowboys really have no home advantage and that the Ravens really have a huge home advantage.

In science we have something called Occam's razor: we prefer the simplest model that matches the data.  A complicated model of the NFL is one in which we assign a unique home-field advantage to each team.  A simpler model is that each team has a true 15% home advantage, and that the spread is only in the apparent advantage as measured by the actual won-lost record.  The previous paragraph shows that the simpler model is plausible, at least for a single year.  How do we make this more quantitative and compare to Time's 10 years of data?  Let's flip a coin for the outcome of each game.   This has to be a biased coin, with a 57.5% chance of yielding a win for the home team and 42.5% for the visitors.  We don't need a physical coin; it's easier to use a computer's random number generator.  For each of 32 NFL teams, we flip this "coin" 160 times (for the ten years of games examined by Time) and just see what are the minimum and maximum home vs. away differences.  This takes surprisingly few lines of code in Python:

import numpy
import numpy.random as npr
nteams = 32
ngames = 80 # ten years of home (or away) games in NFL
homegames = (npr.random(size=(nteams,ngames)))>=0.425
homepct = homegames.sum(axis=1)/float(ngames)
awaygames = (npr.random(size=(nteams,ngames)))>=0.575
awaypct = awaygames.sum(axis=1)/float(ngames)
print numpy.sort(homepct-awaypct)


This prints out a set of numbers reflecting the apparent 10-year home advantage for each of 32 simulated teams, for example:

[-0.0375 -0.0125  0.      0.0125  0.0625  0.075   0.1     0.1125  0.1125
  0.1125  0.125   0.125   0.125   0.1375  0.1375  0.1375  0.15    0.15
  0.1625  0.175   0.175   0.175   0.175   0.175   0.2     0.2125  0.225
  0.2375  0.25    0.275   0.275   0.35  ]

As you can see, the largest apparent home advantage is 35%, exactly matching the Ravens, and the smallest apparent home advantage is -3.75%, about the same as the Cowboys' -2%.  Time's entire premise is consistent with being a mirage!

This modeling approach is at the heart of science, and is really fun. There are several directions we could take this if we had more time, and they are illustrative of the process of science:

  • making my statement "consistent with a mirage" more precise. I did this by running many simulations like the one above and I found that a number as large as 35% comes up 17% of the time (meaning in 17% of simulated 10-year periods of football). Thus there is no evidence that the Ravens have a greater than 15% home advantage.*  And even if they do, the fact that the average simulation (31%) comes so close to their record means that most of their apparent advantage is likely to be random. The burden of proof is on those who think the effect is real, to tease out what the effect is and show that it can't be random.  If you find something that really doesn't fit the simple model, congratulations---you have made a discovery!  For example, it is plausible that (as Time suggests) the Cowboys do well on the road because they are "America's team."  With 10 years of data, their home vs. road record is still consistent with the NFL average, but if you like the "America's team" hypothesis you may be able to prove it by looking at 30 or more years of data, where random fluctuations will be smaller.
  • making a more sophisticated model.  I have to stress how brain-dead my model is. For example, each simulated team has a 50% winning record overall.  This is a really simple model that would be inadequate for predicting, for example, the lengths of winning streaks.  We could make the model more sophisticated by programming in the overall winning percentage of each team. I'm fairly confident this won't affect the home advantage, because most teams have a 10-year winning percentage not too far from 50% (in the 40-60% range, with the Ravens at 60.5%), and the exceptions (the Lions with 30% and Patriots with 77% overall winning percentage) still have home advantages consistent with the typical 15%.  But if you were determined to test the simple home-advantage model, you would want to write the extra code to make sure.  (Note that calling for a more sophisticated model here does not violate Occam's razor.  We know that some teams truly are good and some truly are bad, so we should include this in our model if we want to model the data thoroughly.  It just so happens that overall winning percentage is probably not important in modeling home-field advantage.)
  • modeling additional features of the data.  Upgrading the model as described in the previous paragraph would allow you to have even more fun, because this model would allow you to predict other things like the lengths of winning streaks.  It is truly satisfying to have a relatively simple model that explains a wide variety of data.
  • making your model more universal (in this case, extending it to additional sports). This is actually pretty easy; even Time may be capable of this.  Modifying my Python script to do basketball is trivial: just change the home/road winning percentages to 59.5%/40.5% and the number of games at each venue to 41 per year, or 410 in ten years. Before we do that, let's predict what will happen: random fluctuations will play a smaller role in an 82-game season.  The "best" and "worst" teams in the NBA will therefore show smaller deviations from the NBA average (19%) than we saw in football.  In fact, the Jazz lead the NBA with an apparent 27% advantage and the Nets trail with 12%---both consistent with my simulations. I encourage interested readers to do hockey and baseball for themselves.  
I can imagine two types of results from modeling a wide variety of sports, each of which would be rewarding.  First, it could be that randomness explains the variations in all sports.  This would be an impressive achievement for such a simple model.  Second, it could be that randomness explains the variations in most sports, but that there is some interesting exception.  If baseball is an exception then perhaps baseball stadiums do matter.  If Denver is an exception, then perhaps altitude matters.**  

The same thinking tool can be used in many other contexts. The New York Times set a great example with How Not To Be Misled By The Jobs Report.  They showed how uncertainties in the process of counting jobs could lead from an actual job gain of 150,000 to a wide range of apparent job gains, and thus to misleading conclusions about the economy if people take any one jobs report too seriously.

Summary: whether in science, in data-driven journalism, or just as part of being a thinking person, you should have a model in mind when you look at data or make observations.  This will prevent you from over-interpreting apparent features and help you make true discoveries.

*If you think the 17% indicates something unlikely, consider that it is not much less than the chance of getting two heads in two coin tosses, and no one would suggest that there must be something special about a coin that yields two heads in two tosses.  To even think about investigating something further, you should demand that what you observe would have arisen randomly in less than 5% of simulations.

**Spoiler alert: it turns out that neither baseball nor Denver is an exception.  

No comments:

Post a Comment