Search Data Help us Understand Voters and Make Better Forecasts
Where were you on Saturday, November 7, 2020, when Pennsylvania was called for Biden? The race was so close we barely breathed for three days.
Why did we not see this coming? Biden only won Georgia by .25%, Arizona by .3%, and Wisconsin by .6%. It wasn’t supposed to be this way.
Why do pollsters keep getting presidential election forecasts wrong? There are three main reasons.
1. You Miss One State …
The US electoral college makes presidential election forecasting nearly impossible. The states (except Maine and Nebraska) award all of their electoral votes to the state’s popular vote winner. So, even though a state’s popular vote can be very close, the electoral victory is tremendously lopsided.
FiveThirtyEight is a highly regarded US presidential election forecaster. In 2016, they called 83.8% of the state’s electoral votes correctly. By most forecasting standards, that is a very high level of accuracy. But it is not nearly enough for US presidential elections: Trump won, not Clinton. And, Trump won a sizable 57% of the electoral college (306 out of 538).
Had FiveThirtyEight called Florida correctly, they would have at least predicted a much closer race: half of Trump’s victory margin over Clinton was tied to Florida.
2. Systematic Polling Bias
There is clear evidence of systematic bias in election polling. The problem is abundantly clear when analyzing pollster accuracy for 2016 and 2020.
FiveThirtyEight collects and analyzes surveys from pollsters to create their election forecasts. If pollster data is problematic, then their forecasts will be inaccurate.
Victory margin is measured in percentage points of the state’s popular vote. For example, FiveThirtyEight forecasted Trump would beat Clinton in Tennesee by 12 points in 2016. In actuality, he beat Clinton by 26 points for a victory margin error of 14 points.
Here are two key facts that demonstrate systematic polling bias:
- In 2016, Trump beat FiveThirtyEight’s victory margin forecast in all 30 states he won.
- In 2020, Trump beat FiveThirtyEight’s victory margin forecast in 24 of the 25 states he won.
A summary of the systemic forecast bias in FiveThirtyEight’s 2020 forecasting is provided below.
Clearly, pollster error is not spread randomly. Otherwise, there would be a similar number of blue bars above and below the line. The same for the red bars. Instead, most blue bars are below the line (Biden’s strength was consistently overestimated), and most of the red bars are above the line (Trump’s strength was consistently underestimated).
FiveThirtyEight systematically misses Republican and Democratic strength. This error pattern gives Democrats a false sense of security. With so much at stake in our presidential elections, high forecast error can dramatically impact confidence in the media, politicians, and society in general. And it can lead to considerable social anxiety and poor decision-making by voters and politicians.
3. Lying to Pollsters
Systemic polling error may be due, in part, to lying. People do not always tell the truth when asked political questions. They might tell pollsters they are …
- Registered to vote when they are not.
- Planning to vote when they actually are not.
- Undecided when they are actually decided.
- Voting for Biden when they actually intend to vote for Trump. This lie was documented multiple times on major news channels during the 2020 election cycle.
Truthfulness is very hard for pollsters to detect and measure. How do you know if someone is telling the truth about their political views and voting intentions?
Trump has made lying normative. Republicans often wage their political wars with lies. One research study found that Republicans are twice as likely as Democrats to hide their true voting intentions. Such a proclivity to lie would indeed contribute to a tremendous level of polling error.
We can no longer assume people tell pollsters the truth. We need to find other, more accurate ways to forecast election outcomes.
Are Pollsters Improving?
FiveThirtyEight hoped pollsters had learned from the 2016 forecasting disaster and were actively improving their methods. Indeed, their accuracy improved. FiveThirtyEight correctly forecasted a Biden victory in 2020. And they only missed two states: Florida and North Carolina.
But Florida and North Carolina represent 8.2% of the electoral vote.
Worse, FiveThirtyEight did not anticipate tight races in Georgia, Arizona, Wisconsin, and Pennsylvania. By election day, they predicted Biden would take 65.1% of the electoral vote! This was terribly inaccurate.
In combination, FiveThirtyEight again gave Democrats a false sense of security and totally missed the closeness of the 2020 election.
Search Data: A Powerful Alternative
The author used search data in four different publications to accurately forecast the 2020 election.
August 6, 2020
My first published prediction of the 2020 election was made on August 6, 2020. That forecast used search data for 26 paired comparisons. This forecast indicated Biden would win 51% of the electoral college, and I concluded the race was too close to call.
On the same day, FiveThirtyEight predicted 59.7% of the electoral vote would go to Biden.
August 24, 2020
On August 24, I used search data for 20 variables, weighted by their search popularity across the US, to predict the 2020 election. That model indicated Biden would win 57% of the electoral college.
On the same day, FiveThirtyEight predicted 60.2% of the electoral vote would go to Biden.
August 31, 2020
On August 31, I updated my August 24th forecast and found the 20 weighted variables still predicted Biden would win 57% of the electoral vote.
On that same day, FiveThirtyEight predicted 58.2% for Biden.
October 30, 2020
On October 30, I focused on 13 unweighted variables that were calibrated on the 2016 election (as were all of my 2020 forecasts). This was my final published forecast before the election. I predicted Biden would win 57.6% of the electoral vote.
On that same day, FiveThirtyEight predicted 64.5% for Biden.
Were My Forecasts Accurate?
The accuracy of my published 2020 presidential election forecasts versus FiveThirtyEight is summarized below. My last three forecasts were very close to the actual electoral college outcome. And, my forecast error was much less than FiveThirtyEight in three out of four predictions.
My forecasts were more accurate than FiveThirtyEight as election day approached.
Throughout the election cycle, FiveThirtyEight never estimated Biden’s electoral vote would be as low as 306. Their lowest forecast was 313 on August 31, 2020. Worse, as election day drew near, their inaccuracy jumped. This finding is consistent with the systematic polling bias problem noted earlier.
A Better Method
Throughout the 2020 election cycle, I was concerned about the dramatic increase in early voting and its impact on forecast accuracy. After my October 30 forecast, I found a way to address early voting and detect weekly voter sentiment shifts. For example, I wanted sensitivity to early Democratic voting and election day Republican voting.
My previous forecast models used search activity for aggregate time frames (e.g., September 1 to November 8, 2016). I wanted to see if weekly sensitivity would produce better forecast results. The new enhancements involved four changes:
- Track search behavior weekly starting 12 weeks before election day. This captures search behavior before early voting begins and detects weekly changes throughout the voting cycle.
- Use raw scores for 20-variables. In other words, drop the weighting method used to make the August 24 and 31 2020 forecasts. This increases model sensitivity.
- Add weekly tracking to the 13-variable model. Now the 13 and 20 variable models work the same way: weekly tracking and no weighting.
- Calculate a final election day forecast for 2016 and 2020 based on a sum of all weekly scores.
All four changes were used to forecast the 2016 election. The same models were then used to forecast the 2020 election, except “coronavirus” was added to the 19-variable model for 2020, for obvious reasons.
The table below compares my forecast accuracy versus FiveThirtyEight using the three tests listed in the first column. The first test is obvious: was the winner predicted correctly? The second test assesses overall electoral vote error: actual total electoral vote percentage for the party forecasted to win minus the forecast. The third test tracks the percent of all state electoral votes that were forecasted correctly.
The second column shows FiveThirtyEight’s performance in the 2016 election. The next two columns show my new model calibrations for 2016. Both out-perform FiveThirtyEight, but this is not an apples-to-apples comparison. These two columns show how well the two new models calibrated by looking back on the 2016 election, but only in preparation to make my 2020 election forecasts.
The last three columns in the table cover 2020 forecast accuracy, first for FiveThirtyEight and then for my two new models. Three observations are readily made:
- All three approaches correctly predicted Biden’s victory in 2020.
- My 13-variable model was most accurate in predicting the overall electoral vote outcome. Both of my models were more accurate than FiveThirtyEight on this measure.
- FiveThirtyEight was slightly better on the third test, followed closely by my 20-variable model.
Two conclusions can be made. (1) My 13-variable model is best for predicting the overall electoral vote outcome. (2) A combination of FiveThirtyEight and my 20-variable model are best for predicting state-by-state outcomes.
What Drove the 2020 Election Outcome?
As we can see, it is possible to approximate or improve on pollster accuracy by using search data. And it is possible to do so with far less expense.
But the advantages go well beyond accuracy and cost. The most important reason to use search data in election forecasting is to determine why people vote the way they do.
In the 20-variable model, the top two search terms tied to Biden’s actual 2020 popular vote percentage are “pussy” and “pornhub.” States with high interest in these two terms tended to vote for Trump. And states with low interest in these two terms tended to vote for Biden.
These two variables measure a peculiar, heightened interest among Republicans in pornography and sexist language. The variables also capture Democratic bewilderment over the lack of consequences for Trump’s frequent and blatant display of misogynistic attitudes and behaviors towards women. These two variables are the most consistent indicators of a Trump or Biden voter.
The third-highest correlation with Biden’s popularity was low interest in Facebook. I explored the highly revelatory connection between Facebook and Republican voting patterns in the article: Our Worst Marketing Nightmare. Then, in the article Just Say It: Racist, I uncovered the unsettling correlation in 2016 between searches for Facebook and the term “nigger.”
Facebook claimed it would moderate its role as the leading publisher of violent and political misinformation during the 2020 election cycle. Many observers concluded they did too little too late to deliver on their promises.
The clear conclusion is that Facebook remained a strong, Republican-leaning publisher during the 2020 presidential election cycle. On a relative basis, Democratic-leaning states avoided it, and Republican-leaning states embraced it.
3. Real News and Facts
CNN, Wikipedia, and Fox News were the fourth through sixth-highest correlated search terms with Biden versus Trump-leaning states. States that were more inclined to search for CNN and Wikipedia were more likely to vote for Biden.
Conversely, states with low interest in CNN and Wikipedia were more likely to vote for Trump. This finding connects to the earlier issue about lying to pollsters. Republican-leaning states prefer fake news and conspiracy theories on Facebook and Fox News rather than real news and facts from CNN and Wikipedia.
The above research further supports the growing concern about the tremendous ideological divergences among Americans. The typical Republican-leaning state prefers misogyny, Facebook, and Fox News. The typical Democratic-leaning state prefers facts from CNN and Wikipedia and is repulsed by misogyny or fake news from Facebook and Fox News.
The most uncomfortable aspect of the conclusion is that it likely means nothing to a Republican. These findings would likely be assailed as fake and rejected out of hand, just like data about the value of coronavirus testing or mask-wearing during a pandemic.
The good news is this: you can’t solve a problem if you can’t define it accurately. My highest hope is that my forecasting work takes us one step closer toward a clear understanding of why America is in so much trouble.