The World Wide Web provides a wealth of data that can be harnessed to help improve information retrieval and increase understanding of the relationships between different entities. In many cases, we are often interested in determining how similar two entities may be to each other, where the entities may be pieces of text, descriptions of some object, or even the preferences of a group of people. In this work, we examine several instances of this problem, and show how they can be addressed by harnessing data mining techniques applied to large web-based data sets. Specifically, we examine the problems of: (1) determining the similarity of short texts--even those that may not share any terms in common, (2) learning similarity functions for semi-structured data to address tasks such as record linkage between objects, and (3) measuring the similarity between on-line communities of users as part of a recommendation system. While we present rather different techniques for each problem, we show how measuring similarity between entities in all these domains has a direct application to the overarching goal of improving information access for users of web-based systems.
Subjects: 12. Machine Learning and Discovery; 1.10 Information Retrieval
Submitted: Feb 7, 2006