Abstract

Annotating data is a common bottleneck in building text classifiers. This is particularly problematic in social media domains, where data drift requires frequent retraining to maintain high accuracy. In this paper, we propose and evaluate a text classification method for Twitter data whose only required human input is a single keyword per class. The algorithm proceeds by identifying exemplar Twitter accounts that are representative of each class by analyzing Twitter Lists (human-curated collections of related Twitter accounts). A classifier is then fit to the exemplar accounts and used to predict labels of new tweets and users. We develop domain adaptation methods to address the noise and selection bias inherent to this approach, which we find to be critical to classification accuracy. Across a diverse set of tasks (topic, gender, and political affiliation classification), we find that the resulting classifier is competitive with a fully supervised baseline, achieving superior accuracy on four of six datasets despite using no manually labeled data.

Citation

@article{culotta2016training,
  author =       {Aron Culotta},
  title =        {Training a text classifier with a single word using Twitter Lists and domain adaptation},
  journal = {Social Network Mining and Analysis},
  volume = 6,
  number = 1,
  pages = {1--15},
  year =         2016,
}