Pallika H. Kanani, Andrew McCallum
Entity resolution in the domain of research paper authors is an important, but difficult problem. It suffers from insufficient contextual information, hence adding information from the web can significantly improve performance. We formulate the author coreference problem as one of graph partitioning with discriminatively-trained edge weights. Building on our previous work, this paper presents improved and more comprehensive results for the method in which we incorporate web documents as additional nodes in the graph. We also propose efficient strategies to select a subset of nodes to add to the graph and to select a subset of queries to gather additional nodes, without significant loss of performance gain. We extend the classic Set-cover problem to develop a node selection criteria, hence opening up interesting theoretical possibilities. Finally, we propose a hybrid approach, that achieves 74.3% of the total improvement gain using only 18.3% of all additional mentions.
Subjects: 12. Machine Learning and Discovery; 10. Knowledge Acquisition
Submitted: May 15, 2007