As text classifiers become increasingly used in observational studies, it is critical to consider not only their accuracy but also their robustness to changes in the data distribution. In this paper, we consider the case where there is a confounding variable $Z$ that influences both the text features $W$ and the class variable $Y$. For example, a classifier trained to predict the health status of a user based on their online communications may be confounded by socioeconomic variables. When the influence of $Z$ changes from training to testing data, we find that classifier accuracy can degrade rapidly. Our approach, based on Pearl’s back-door adjustment, estimates the underlying effect of a text variable on the class variable while controlling for the confounding variable. We conduct an observational study to estimate the effect of location on dispositional affect, with gender as a confounder. We find that our adjustment results in more accurate estimates of effect sizes over a range of possible confounding strengths.


  author =       {Virgile Landeiro and Aron Culotta},
  title =        {Reducing confounding bias in observational studies that use text classification},
  booktitle = {AAAI Spring Symposium on Observational Studies through Social Media and Other Human-Generated Content},
  year =         2016