reveal.js

Representation in Digital Traces

Jana Lasser

TU Graz

Foundations of Computational Social Systems

Who is included in the research?

Milgram 1974: The Dilemma of Obedience

Who is included in the research?

Milgram 1974: The Dilemma of Obedience

Who is included in the research?

Empirical social science wants to find generalizable patterns.

Convenience and social norms lead us to exclude many.

Large swathes of published research is not as generally applicable as claimed.

Milgram 1974: The Dilemma of Obedience

Computational Social Science to the rescue!

Digital traces ...

Twitter

Search histories

Digitized books

... What else?

... are great

millions of observations

from free services

everybody is included

... right?

Computational Social Science to the rescue!

Digital traces ...

Twitter

Search histories

Digitized books

... What else?

... are great

millions of observations

from free services

everybody is included

... right?

Computational Social Science to the rescue!

Digital traces ...

Twitter

Search histories

Digitized books

... What else?

... are great

millions of observations

from free services

everybody is included

... right?

Whom trace data measure

Digital trace data are subject to a different kind of sampling

Who uses a service at all?

Who uses a service when?

Who uses a service how much?

Who discloses which kind of information?

How do users behave on different platforms?

Who uses a service at all?

Sloan et al. 2015: Who Tweets? Deriving the Demographic Characteristics of Age, Occupation and Social Class from Twitter User Meta-Data

Demographic biases in digital platforms and services

Can you think of services that are used more by ...

... older people?

... conservative people?

... non-Americans?

... poorer people?

And what about non-internet users?

Demographic biases in digital platforms and services

Can you think of services that are used more by ...

... older people?

... conservative people?

... non-Americans?

... poorer people?

And what about non-internet users?

Systematic biases in the Google Books Corpus

Pechenick et al. 2015: Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution

Who uses a service when?

Gender of crowdworkers on prolific

The Verge: A Teenager on TikTok disrupted thousands of scientific studies with a single video

Who uses a service how much?

Who would be represented in a uniform sample of tweets?

Zarrinkalam et al. 2015: Semantics-Enabled User Interest Detection from Twitter

Who uses a service how much?

Who would be represented in a uniform sample of tweets?

Zarrinkalam et al. 2015: Semantics-Enabled User Interest Detection from Twitter

Who discloses which kind of information?

Different kinds of analyses might require you to include or exclude users who disclose different kinds of information:

Users who disclose locations (Girnberg et al. 2019)

Users who disclose their name

Users who disclose party affiliation (Bakshy et al. 2015)

How could such a filtering bias the sample?

How do users behave on different platforms?

Lim et al. 2015: #mytweet via Instagram: Exploring User Behaviour across Multiple Social Networks

Filtering out bots?

Filtering out bots for example with Botometer is common practice in many analysis pipelines. But there is a false-positive problem!

Rauchfleisch & Kaiser 2020: The False positive problem of automatic bot detection in social science research

Summary

Platforms and systems that we collect data from were generally not designed with research in mind.

Our data collection systems are biased away from minority- and marginal populations.

Platforms and usage contexts are constantly changing and so are the digital traces left on them.

What can we do?

Dont force generality: A study of Turkish immigrants that use Twitter and live in Vienna is fine – if you don't sell it for something else.

Think clearly about systematic biases: Many platforms disclose demographic information. If the platform fits the context of the research question, there is no shame in using a biased dataset.

Rebalancing data can help generalizability. See for example Wang et al. 2015: Forecasting elections with non-representative polls.