No, data is not dead – but we need to be more thoughtful about our statistics

In the wake of the shocking election results this week, I have heard and read that data is dead, since it failed to predict how soundly Donald Trump would defeat Hillary Clinton. This is a tragic conclusion, because data did not fail to predict this election, people did. A more accurate reaction should be that this outcome calls for more thoughtful and contextual modeling practices based on real world characteristics.

First of all, it bears pointing out that the models per se were not “wrong,” but statistics is a difficult concept to understand.  When a model said Clinton had say an 80% chance of winning, it meant that in some theoretical world where the 2016 US election was repeated over and over and over again, Clinton would win 80% of the time and Trump would win 20% of the time.  The 80% meant that Clinton would win more times, but Trump would win sometimes too.  The other aspect of that is that the 80% estimate is just that – an estimate – accompanied by a confidence interval, or the span of values that represent, to a certain degree of confidence, contain the “true” parameter. For modeling complex and dynamic processes with small sample sizes, the confidence interval can be large.  So going back to the theoretical world where the 2016 US election was repeated over and over again, Trump winning more times than Clinton is within the confidence interval for some of the models, meaning that it is actually a possibility for the “true” parameter.  I think many saw the models highly touted point estimates and felt overly confident, but the confidence interval is the real story there.

All that being said, some of the surprises (hello Wisconsin, Michigan, and Pennsylvania) and disappointments (hello Florida) point to something beyond a misunderstanding of frequencies and confidence intervals.  They speak to an over reliance on statistical properties, such as historical behavior predicting future behavior and demographic shifts, and not enough reliance on real time, real world context.  Models by themselves are great for predicting things when the underlying trends stay the same, but elections are usually not relying on trends that are constant from the last 4, 8, 12+ years. This is compounded by the nature of elections data coming from people.  People-generated data is messy: people may change their minds, people may envision themselves one way but end up acting another way, people may lie (especially in various social situations), and the list goes on.  Perfect example: data from the General Social Survey shows that responders are statistically significantly more likely to answer with favorable opinions of blacks when they have a black interviewer compared with when they have a white interviewer. This kind of data can be rewarding when you find the quirks and account for them in the model, but ignoring them is not an option.  These types of modeling situations render context and contextual modeling all the more important.

The quality of a model is limited by the quality of the data and the validity of the assumptions, and the data and assumptions should be driven by a deep contextual understanding.  A model is only as good as its data and underlying assumptions is a common statistical adage. In the case of the campaign or public polls, I would assume that the assumptions would also drive the assessment of data quality.  For example, at least publicly there were no recent polls in states that were assumed to be blue, but ended up being red.  This assumption in hindsight seems weak given that those states housed significant support for the components of Bernie Sanders’s messages that resembled components of Trump’s messages.  This should have spurred a closer examination of data coming from those states to test the validity of the blue-state assumption, and updating the models accordingly.

In summary, no, the 2016 election did not kill data or analytics, but it did highlight the absolute and dire need to deeply consider context and how that context affects the data collection decisions and the underlying assumptions.  This is all the more necessary when the data is from people and predicting about people.  When used correctly, data and modeling can still be a powerful tool, even for predicting elections.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s