Efficient Name Variation Detection

Authors

Luther Branting

Track:

Contents

Downloads:

Abstract:

Semantic integration, link analysis and other forms of evidence detection often require recognition of multiple occurrences of a single name. However, names frequently occur in orthographic variations resulting from phonetic variations and transcription errors. The computational expense of similarity assessment algorithms usually precludes their application to all pairs of strings. Instead, it is typically necessary to use a high-recall, low-precision index to retrieve a smaller set of candidate matches to which the similarity assessment algorithm is then applied. This paper describes five algorithms for efficient candidate retrieval: Burkhart-Keller trees (BKT); filtered Burkhart-Keller trees (FBKT); partition filtering; ngrams; and Soundex. An empirical evaluation showed that no single algorithm performed best under all circumstances. When the source of name variations was purely orthographic, partition filtering generally performed best. When similarity assessment was based on phonetic similarity and the phonetic model was available, BKT and FBKT performed best. When the pronunciation model was unavailable, Soundex was best for k=0 (homonyms), and partition filtering and BKT were best for k>0. Unfortunately, the high-recall retrieval algorithms were multiple orders of magnitude more costly than the low-recall algorithms.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.