Social (Media) Data: a gold mine for Digital Humanities?

Lev Manovich identifies two types of data used in social and cultural studies during the twentieth century: “‘surface data’ about lots of people and ‘deep data’ about a few individuals or small groups” (Manovich, 2012). An intermediate method is used in statistics, where a researcher chooses a sample to represent an entire country for example. Comparing this aproach to Photoshop, Manovich goes on to say:

A “pixel” that originally represented one person comes to represent one thousand people who are all assumed to behave in exactly the same way. (Manovich, 2012)

This is exactly what happened in polls preceding the recent Presidential Elections in the United States of America, and predicting that Hillary Clinton would win. Nate Silver’s FiveThirtyEight 2016 general election forecast predicted Donald Trump would only have a 28.2% chance of winning, although they estimated a 10% chance that Clinton would win the popular vote and lose the electoral college vote. Their model containing three versions needs quite a lengthy user guide, their polls-plus version of the model combining polls with an economic index and each version following four major steps:

  1. Collect, weight and average polls. – based on Pollster ratings.
  2. Adjust polls.
  3. Combine polls with demographic and (in the case of the polls-plus) economic data.
  4. Account for uncertainty and simulate the election thousands of times.

Nate Silver later defended his model in Why FiveThirtyEight Gave Trump A Better Chance Than Almost Anyone Else saying that:

People mistake having a large volume of polling data for eliminating uncertainty. […] the polls sometimes suffer from systematic error: Almost all of them are off in the same direction. (Silver, 2016)

The four objections Manovich states with regard to the rise of social media and new computational tools that can process massive amounts of data (Manovich, 2012) can help explain why The Polls Missed Trump.

The first objection Manovich describes, is the lack of availablity of data outside of the social media companies, specifically for transactional data (Manovich, 2012). In the case of the election polls, one of the recurrent errors is the nonresponse bias, or “failing to get supporters of one candidate to respond with the same enthusiasm as supporters of his opponent” as Carl Balik and Harry Enten stated in their article asking Pollsters Why.

The second objection Manovich formulates, is the lack of authenticity, since communications over social media and digital footprints are often carefully curated and systematically managed (Manovich, 2012). However, “several pollsters rejected the idea that Trump voters were too shy to tells [sic] pollsters whom they were supporting” (Balik and Enten, 2016). However, automated-dialer calls which used a recorded voice registered more Trump voters as opposed to live-interviews (Balik and Enten, 2016).

Manovich also raises a third objection, referring to the size versus depth issue since different data leads to different questions, patterns, and insights (Manovich, 2012). In the aftermath of the elections, many explinations for why the polls were off took the stage in several articles. Even on the FiveThirtyEight website, I found at least three articles offering a different point of view, even contradicting each other: Jed Kolko explaining Trump Was Stronger Where The Economy Is Weaker, Carl Bialik stating that Voter Turnout Fell, Especially In States That Clinton Won, but also claiming No, Voter Turnout Wasn’t Way Down From 2012, whereas Clare Malone blamed the outcome on the sentiment of Americans Don’t Trust Their Institutions Anymore. The differing approaches even amonst the same redaction team shows how some refer to individuals’ emotions, while others state voter turnout or a weak economy.

Finally Manovich’s fourth objection points out the need for specialized expertise especially in computer science, statistics and data mining, needed to work on large data sets and especially combining the data as Nate Silver did for his general election forecast. Even though he clearly has a well-defined method, adding several factors and ranking pollsters based on historical data on their accuracy, polling needs to “get more comfortable with uncertainty” (Balik and Enten, 2016). One of the people they interviewed even went as far as to state that “the incentives now favor offering a single number that looks similar to otheer polls instead of really trying to report on the many possible campaign elements that could affect the outcome. Certainty is rewarded, it seems” (Balik and Enten, 2016).

If Digital Humanists want to make sure that:

The rise of social media, along with new computational tools that can process massive amounts of data, makes possible a fundamentally new approach to the study of human beings and society. (Manovich, 2012)

We need to change how students in humanities are being educated, something the Advanced Master in Digital Humanities of the KU Leuven is certainly trying to achieve.

Bibliography

Manovich, Lev. “Trending: The Promises and the Challenges of Big Social Data.” Debates in the Digital Humanities. Minneapolis, MN: University of Minnesota Press, 2012.

Silver, Nate. “2016 Election Forecast.” FiveThirtyEight, November 8, 2016. https://projects.fivethirtyeight.com/2016-election-forecast/#plus.

Balik, Carl, and Enten, Harry. “The Polls Missed Trump. We Asked Pollsters Why.” FiveThirtyEight, November 9, 2016.  http://fivethirtyeight.com/features/the-polls-missed-trump-we-asked-pollsters-why/.