Social media are increasingly being used to complement traditional survey methods in health, politics, and marketing. However, little has been done to adjust for the sampling bias inherent in this approach. Inferring demographic attributes of social media users is thus a critical step to improving the validity of such studies. While there have been a number of supervised machine learning approaches to this problem, these rely on a training set of users annotated with attributes, which can be difficult to obtain. We instead propose training a demographic attribute classifiers that uses county-level supervision. By pairing geolocated social media with county demographics, we build a regression model mapping text to demographics. We then adopt this model to make predictions at the user level. Our experiments using Twitter data show that this approach is surprisingly competitive with a fully supervised approach, estimating the race of a user with 80% accuracy.


  author =       {Ehsan Mohammady and Aron Culotta},
  title =        {Using county demographics to infer attributes of Twitter users},
  booktitle = {ACL Joint Workshop on Social Dynamics and Personal Attributes in Social Media},
  year =         2014