Automatically Utilizing Secondary Sources to Align Information Across Sources

Martin Michalowski, Snehal Thakkar, Craig A. Knoblock

Abstract


XML, web services, and the semantic web have opened the door for new and exciting informationintegration applications. Information sources on the web are controlled by different organizations or people, utilize different text formats, and have varying inconsistencies. Therefore, any system that integrates information from different data sources must identify common entities from these sources. Data from many data sources on the web does not contain enough information to link the records accurately using state-of-the-art record-linkage systems. However, it is possible to exploit secondary data sources on the web to improve the recordlinkage process. We present an approach to accurately and automatically match entities from various data sources by utilizing a state-of-the-art record-linkage system in conjunction with a data-integration system. The data-integration system is able to automatically determine which secondary sources need to be queried when linking records from various data sources. In turn, the record-linkage system is then able to utilize this additional information to improve the accuracy of the linkage between datasets.

Full Text:

PDF


DOI: http://dx.doi.org/10.1609/aimag.v26i1.1797

Copyright © 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.