Representation in Digital Traces

Jana Lasser

TU Graz

Foundations of Computational Social Systems

Who is included in the research?

Milgram 1974: The Dilemma of Obedience

Who is included in the research?

Milgram 1974: The Dilemma of Obedience

Who is included in the research?

  • Empirical social science wants to find generalizable patterns.
  • Convenience and social norms lead us to exclude many.
  • Large swathes of published research is not as generally applicable as claimed.
  • Milgram 1974: The Dilemma of Obedience

    Computational Social Science to the rescue!

    Digital traces ...

  • Twitter
  • Search histories
  • Digitized books
  • ... What else?
  • ... are great

  • millions of observations
  • from free services
  • everybody is included
  • ... right?
  • Computational Social Science to the rescue!

    Digital traces ...

  • Twitter
  • Search histories
  • Digitized books
  • ... What else?
  • ... are great

  • millions of observations
  • from free services
  • everybody is included
  • ... right?
  • Computational Social Science to the rescue!

    Digital traces ...

  • Twitter
  • Search histories
  • Digitized books
  • ... What else?
  • ... are great

  • millions of observations
  • from free services
  • everybody is included
  • ... right?
  • Whom trace data measure

    Digital trace data are subject to a different kind of sampling

    Who uses a service at all?

    Who uses a service when?

    Who uses a service how much?

    Who discloses which kind of information?

    How do users behave on different platforms?

    Who uses a service at all?

    Sloan et al. 2015: Who Tweets? Deriving the Demographic Characteristics of Age, Occupation and Social Class from Twitter User Meta-Data

    Demographic biases in digital platforms and services

    Can you think of services that are used more by ...

  • ... older people?
  • ... conservative people?
  • ... non-Americans?
  • ... poorer people?
  • And what about non-internet users?
  • Demographic biases in digital platforms and services

    Can you think of services that are used more by ...

  • ... older people?
  • ... conservative people?
  • ... non-Americans?
  • ... poorer people?
  • And what about non-internet users?
  • Systematic biases in the Google Books Corpus

    Pechenick et al. 2015: Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution

    Who uses a service when?

    Gender of crowdworkers on prolific

    The Verge: A Teenager on TikTok disrupted thousands of scientific studies with a single video

    Who uses a service how much?

    Who would be represented in a uniform sample of tweets?

    Zarrinkalam et al. 2015: Semantics-Enabled User Interest Detection from Twitter

    Who uses a service how much?

    Who would be represented in a uniform sample of tweets?

    Zarrinkalam et al. 2015: Semantics-Enabled User Interest Detection from Twitter

    Who discloses which kind of information?

    Different kinds of analyses might require you to include or exclude users who disclose different kinds of information:

  • Users who disclose locations (Girnberg et al. 2019)
  • Users who disclose their name
  • Users who disclose party affiliation (Bakshy et al. 2015)
  • How could such a filtering bias the sample?

    How do users behave on different platforms?

    Lim et al. 2015: #mytweet via Instagram: Exploring User Behaviour across Multiple Social Networks

    Filtering out bots?

    Filtering out bots for example with Botometer is common practice in many analysis pipelines. But there is a false-positive problem!

    Rauchfleisch & Kaiser 2020: The False positive problem of automatic bot detection in social science research

    Summary

  • Platforms and systems that we collect data from were generally not designed with research in mind.
  • Our data collection systems are biased away from minority- and marginal populations.
  • Platforms and usage contexts are constantly changing and so are the digital traces left on them.
  • What can we do?

  • Dont force generality: A study of Turkish immigrants that use Twitter and live in Vienna is fine – if you don't sell it for something else.
  • Think clearly about systematic biases: Many platforms disclose demographic information. If the platform fits the context of the research question, there is no shame in using a biased dataset.
  • Rebalancing data can help generalizability. See for example Wang et al. 2015: Forecasting elections with non-representative polls.