Abstract

Author disambiguation is the problem of determining whether records in a publications database refer to the same person. A common supervised machine learning approach is to build a classifier to predict whether a pair of records is coreferent, followed by a clustering step to enforce transitivity. However, this approach ignores powerful evidence obtainable by examining sets (rather than pairs) of records, such as the number of publications or co-authors an author has. In this paper we propose a representation that enables these first-order features over sets of records. We then propose a training algorithm well-suited to this representation that is (1) error-driven in that training examples are generated from incorrect predictions on the training data, and (2) rank-based in that the classifier induces a ranking over candidate predictions. We evaluate our algorithms on three author disambiguation datasets and demonstrate error reductions of up to 60% over the standard binary classification approach.

Citation

@inproceedings{culotta07author,
  author = {Aron Culotta and Pallika Kanani and Robert Hall and Michael Wick and Andrew McCallum},
  title = {Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function},
  booktitle = {Sixth International Workshop on Information Integration on the Web (IIWeb-07)},
  	  year = {2007},
  address = {Vancouver, Canada},
}