Conor Hayes, Paolo Avesani, Sriharsha Veeramachaneni
The Web is experiencing an exponential growth in the use of weblogs or blogs, websites containing dated journal-style entries. Blog entries are generally organised using informally defined labels known as tags. Increasingly, tags are being proposed as a 'grassroots' alternative to Semantic Web standards. We demonstrate that tags by themselves are weak at partitioning blog data. We then show how tags may contribute useful, discriminating information. Using content-based clustering, we observe that frequently occurring tags in each cluster are usually good meta-labels for the cluster concept. We then introduce the Tr score, a score based on the proportion of high-frequency tags in a cluster, and demonstrate that it is strongly correlated with cluster strength. We demonstrate how the Tr score enables the detection and removal of weak clusters. As such, the Tr score can be used as an independent means of verifying topic integrity in a cluster-based recommender system.
Subjects: 1. Applications; 1.10 Information Retrieval
Submitted: Oct 12, 2006