In case you weren’t aware, pollsters are going through a bit of a crisis.
The US polling industry has been suffering a crisis of insight over the past decade or so; its methods have become increasingly bad at telling which way America is leaning. Like nearly everyone who works in politics, [Dan] Wagner and [David] Shor knew the polling establishment was liable to embarrass itself this year. It wasn’t a question of if, but when—and how badly.
It didn’t take long to find out. About 10 days before the Iowa caucuses in February, two major polls came out: One put Hillary Clinton ahead by 29 points; the other, as if it were tracking an entirely different race, showed Bernie Sanders leading by eight. In the Republican contest, Donald Trump topped the state’s final 10 polls and averaged a seven-point advantage. On the night of the caucus itself, the Civis office in Chicago was crowded with staffers gathered around a big flatscreen TV for a viewing party. They all watched as Clinton—and Ted Cruz—won the state.
To be fair, this isn’t just a US thing. The problem seems to be a mix of dying landlines and distrust fostered by phone spam. Whatever the cause, it’s leaving us blind on policy decisions. Fortunately, one solution not only fixes those issues, it’s astonishingly accurate.
By night’s end, the analytics team proved to be precisely correct—Obama won by the Cave’s predicted 126 electoral votes. Even more impressive, the Cave was accurate down to individual precincts. In Ohio, for instance, it had forecast Obama would receive 57.68 percent of the vote in Cincinnati’s Hamilton County; the final number was 57.16 percent.
So how did “the Cave” manage this feat?
in 2006, veteran politico Harold Ickes joined forces with one of McAuliffe’s techies, Laura Quinn, to go private. They built an $11 million not-for-profit data warehouse for Democrats called Catalist, recruiting talent from companies like Amazon and assembling more than 450 commercial and private data layers on each adult American.
450 cross-linked databases?! Mention this to a Computer Scientist, and they either start drooling or firing up Tor. I’m in the latter camp, and my reason can be summed up in a single word: de-anonymization.
We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.
Narayanan, Arvind, and Vitaly Shmatikov. “Robust de-anonymization of large sparse datasets.” 2008 IEEE Symposium on Security and Privacy (sp 2008). IEEE, 2008.
Almost every database contains more than just the bare minimum of what it needs to, there’s usually extra demographic information or even timestamps along with it. Even if the database is perfectly sanitized, human beings are creatures of habit and love re-using usernames and passwords, have their verbal tics, and so on. By comparing the overlap between two databases, you can merge the information contained in both and build up a richer picture of individuals. The more databases you have on hand, the clearer the picture, and in this case it’s ridiculously sharp.
For the first time, they could link voters to a unique, seven-digit identifier—a kind of lifetime political passport number—that would follow them across the country no matter how many times they moved.
This sort of polling power would be an excellent thing in the hands of Statistics Canada, but imagine if a political party knew how you would vote before you’d consciously chosen, by extrapolating from your demographic data. Actually, don’t imagine that, because for some of you it’s already happened.
Obama’s 2012 presidential campaign crunched poll numbers and voter data to determine a proprietary 0-to-100 “persuadability score” for every voter, which indicated the likelihood that person would choose Obama.
Imagine instead precisely-targeted robocalling to spread misinformation about polling places, more efficient district gerrymandering, policies which subtly disenfranchise or even disadvantage voters who didn’t vote for you. It’s a nightmare, one the techno-utopian magazine Wired completely glosses over in their article.
But it’s one we’re going to have to face, at least in North America, because the data genie is well out of the bottle over here.