AAAI Publications, The Thirtieth International Flairs Conference

Font Size: 
An Ensemble Blocking Approach for Entity Resolution of Heterogeneous Datasets
Janani Balaji, Faizan Javed, Chris Min, Sam Sander

Last modified: 2017-05-08

Abstract


Entity Resolution, also called record linkage or deduplication, refers to the process of identifying and merging duplicate versions of the same entity into a unified representation. The standard practice is to use a Rule based or Machine Learning based model that compares pairs of records and assigns a score to represent the pairs’ Match/Non-Match status. However, performing an exhaustive pair-wise comparison on all pairs of records leads to quadratic matcher complexity and hence a Blocking step is performed before the Matching to group similar entities into smaller blocks that the matcher can then examine exhaustively. Several blocking schemes have been developed to efficiently and effectively block the input dataset into manageable groups. At our organization, we perform deduplication on massive datasets of people profiles collected from disparate sources with varying informational content. We observed that, employing a single blocking technique did not cover the base for all possible scenarios due to high heterogeneity in our data sources. In this paper, we describe our ensemble approach to blocking that combines two different blocking techniques to leverage their respective strengths.

Keywords


blocking; entity resolution; deduplication; record linkage; entity matching; data heterogeneity

Full Text: PDF