The results are finally in, my spreadsheet has been updated, and the model which best predicted the US election of 2016 was…
.. the Princeton Economic Consortium’s.
The model that gave Clinton a 99% chance of victory did better than every other poll aggregator out there? What the heck?!
One clue as to why comes courtesy the New York Times. Their result tracker singles out ten key states, which are difficult to predict because their demographics mirror the US as a whole, and thus they reflect the degree of partisanship nationally.
|Other States||Aggregator||10 Key States|
Princeton’s model was more bullish on the odds of states that were easy to predict, while Five Thirty Eight’s model did better on the remaining ten. Since most states are fairly certain, though, that alone gave Princeton’s model an edge.
Simply getting states right isn’t enough, though. Princeton, the New York Times, and PredictWise all flubbed five states, while Five Thirty Eight got six predictions wrong. And yet, Five Thirty Eight’s model was second-best of the bunch.
|NYT||PW||PEC (both)||538 (both)||DW||HuffPo|
|North Carolina||North Carolina||North Carolina||North Carolina||North Carolina|
|Maine (CD 2)*||Maine (CD 2)*||Maine (CD 2)*||Maine (CD 2)*|
|(misses)||Nebraska (CD 2)*|
To go deeper than clues, we have to dive into how the Princeton model works. Based on a reading of the source code, it ignores national polls, and merely scrapes state polls catalouged by the Huffington Post. For each state, and each day of the election, it creates the largest pool of polls it can from these two criteria:
- The last three polls prior to the date in question.
- Every poll conducted between a week prior to the most recent poll, and that most recent poll, before the date.
It then calculates the median of this pool and estimates the standard error from the median absolute deviation (colourfully shortened to MAD). Oddly, it doesn’t do a weighted median; polls with a sample size of 400 are given equal weight to one with 3,000. This is a strange choice, because we know one is less reliable than the other.
Notice too that there’s no incorporation of prior knowledge. We know voters are pretty set in their opinions, so their preferences don’t vary much over time. If a model is accurate, it too should remain pretty stable.
Princeton’s model flitters around like a drunk butterfly, making Five Thirty Eight’s “twitchy” models look like rocks in comparison. This sensitivity, though, means Princeton’s state numbers are a lot quicker to respond to sudden changes. So when the FBI director tossed out those gifts to Republicans in the final week of the election, nearly all pollsters under-weighted the quick shift on the premise that voters don’t swing their vote much.
This explains why Princeton’s two entries have different levels of success; the “latest” one was run by me personally, using polls released up until and including election day, while the other’s predictions were taken from the Upshot and likely a few days old. It also explains why Five Thirty Eight did so well, as they explicitly factored in the unusual number of undecided voters and thus could better accommodate a quick shift at the finish line.
That’s all well and good, but we’re still left with that paradox: how can the most accurate model also predict a Clinton victory with 99% certainty?
When Princeton’s model combines the state estimates to calculate an overall certainty, it incrementally marches through the probability space of each possible outcome. That clever trick allows it to do analytically what would normally be done empirically via sampling, getting the same result in less time.
But notice we’re making a critical assumption here: while the estimate of each individual state may be off, we fix the net bias of the system to 0. We know that doesn’t happen, as the polls have been slightly biased in every US election we have on record. If we assume these biases fall into a Gaussian distribution, we can tweak Princeton’s model to incorporate those biases as a prior.
While the Princeton model doesn’t explicitly include these priors, it does try to fudge them.
% Use statistics from data file polls.margin=polldata(:,3)'; polls.SEM=polldata(:,4)'; polls.SEM=max(polls.SEM,zeros(1,51)+3); totalpollsused=sum(polldata(:,1));
That bolded line up there? It alters the standard error of the margin so that it’s the maximum of itself and 3. In all but two states, though, the calculated standard error is less than 3!
This artificially boosts the uncertainty of the mean, simulating the fuzzing effect we saw in the prior. Remove this fudge factor, and the model grants Clinton a cartoonishly-high 99.99999999996% chance of winning…
|Metric||538 (latest)||PEC (latest)||PEC (fuzz)||PEC (SEM)||PEC (fuzz+SEM)|
We’re on the right track, though: the Princeton model calculates the standard error from the spread of the polls, it ignores the fact that each poll carries its own standard error. That’s typically in the 3-5% range, and the pooled standard error is rarely much lower. If we switched Princeton’s model to use conjugate distributions when combining state polls, we could restore the uncertainty without resorting to ad-hoc fudges.
As despised as it was internally, Five Thirty Eight’s “NowCast” was a good idea. It acted like a simplified version of their Polls-Only model, and because of that was influenced by the polls much like Princeton’s model. Peeling back some of the bells-and-whistles on Five Thirty Eight’s models might lead to something that performs as well or better, yet takes less computer horsepower.
Before I go, I’d be remiss if I didn’t mention Buzzfeed’s article on the same topic. They come to similar conclusions, but their numbers don’t match up with mine. Looking at the code, some of that’s due to their exclusion of some electoral college votes from Maine and Nebraska, some comes from how extreme values are rounded (I usually turn 0’s into 0.5% for instance, while they round to 0.01%), I think some comes from the choice of date (their Princeton model dates to November 7th, so it may be excluding some polls released on election day), and of course I’m running the code so I have full precision while they’re scraping pre-calculated data that’s lost a few decimals due to rounding.
Still worth the read, even if they have fewer graphs.