Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function
Aron Culotta, Pallika Kanani, Robert Hall, Michael Wick, and Andrew McCallum, AAAI 2007 Workshop
Abstract
Author disambiguation is the problem of determining whether records in a publications database refer to the same person. A common supervised machine learning approach is to build a classifier to predict whether a pair of records is coreferent, followed by a clustering step to enforce transitivity. However, this approach ignores powerful evidence obtainable by examining sets (rather than pairs) of records, such as the number of publications or co-authors an author has. In this paper we propose a representation that enables these first-order features over sets of records. We then propose a training algorithm well-suited to this representation that is (1) error-driven in that training examples are generated from incorrect predictions on the training data, and (2) rank-based in that the classifier induces a ranking over candidate predictions. We evaluate our algorithms on three author disambiguation datasets and demonstrate error reductions of up to 60% over the standard binary classification approach.
Citation
@inproceedings{culotta07author,
author = {Aron Culotta and Pallika Kanani and Robert Hall and Michael Wick and Andrew McCallum},
title = {Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function},
booktitle = {Sixth International Workshop on Information Integration on the Web (IIWeb-07)},
year = {2007},
address = {Vancouver, Canada},
}