A number of recent studies have demonstrated the utility of social media data for inferring societal attributes such as public opinion and health. A commonly declared limitation of this methodology is the selection bias inherent in this approach – social media users are a non-representative sample of the population. This is exacerbated by filtering steps that further limit the sample set in biased ways. Building on recent work in computational linguistics that infers demographic attributes of people based on their communications, we investigate methods to quantify and control for selection bias in social media studies. We present results estimating several county-level health statistics (e.g., obesity, diabetes, access to healthy foods) based on the Twitter activity of the top 100 counties in the U.S., and we compare strategies for reducing selection bias.


  author =       {Aron Culotta},
  title =        {Reducing Sampling Bias in Social Media Data for County Health Inference},
  booktitle = {JSM Proceedings},
  year =         2014