Coping With Noise in a Real-World Weblog Crawler and Retrieval System

Authors

James Lanagan,Paul Ferguson,Neil O'Hare,Alan Smeaton

Clarity: Centre For Sensor Web Technologies,Clarity: Centre For Sensor Web Technologies,Clarity: Centre For Sensor Web Technologies,Clarity: Centre For Sensor Web Technologies

Proceedings:

Proceedings of the International AAAI Conference on Web and Social Media, 4

Volume

Issue:

Vol. 4 No. 1 (2010): Fourth International AAAI Conference on Weblogs and Social Media

Track:

Poster Papers

Downloads:

Download PDF

Abstract:

In this paper we examine the effects of noise when creating a real-world weblog corpus for information retrieval. We focus on the DiffPost (Lee et al. 2008) approach to noise removal from blog pages, examining the difficulties encountered when crawling the blogosphere during the creation of a real-world corpus of blog pages. We introduce and evaluate a number of enhancements to the original DiffPost approach in order to increase the robustness of the algorithm. We then extend DiffPost by looking at the anchor-text to text ratio, and discover that the time-interval between crawls is more important to the successful application of noise-removal algorithms within the blog context, than any additional improvements to the removal algorithm itself.

DOI:

10.1609/icwsm.v4i1.14040

ICWSM

Vol. 4 No. 1 (2010): Fourth International AAAI Conference on Weblogs and Social Media

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.