Genre Classification of Web Documents

Elizabeth S Boese, Adele Howe

Retrieving relevant documents over the Web is an overwhelming task when search engines return thousands of Web documents. Sifting through these documents is time-consuming and sometimes leads to an unsuccessful search. One problem is that most search engines rely on matching a query to documents based solely on topical keywords. However, many users of search engines have a particular genre in mind for the desired documents. The genre of a document concerns aspects of the document such as the style or readability, presentation layout, and meta-content such as words in the title or the existence of graphs or photos. By including genre in Web searches, we hypothesize that Web document retrieval could greatly improve accuracy by better matching documents to the user’s information needs. Before implementing a search engine capable of discriminating on both genre and topic, a feasibility analysis of genre classification is needed. Our previous research achieved 91% classification accuracy across ten genres, whereas similar research range between 60 and 85% accuracy. However, the ten genres used in our research were mostly distinct and only exemplar Web documents (consisting of only one genre) were chosen. This paper discusses our current work which involves an in-depth analysis of maintaining high accuracy rates among genres that are very similar.

Subjects: 1.10 Information Retrieval; 12. Machine Learning and Discovery

Submitted: Apr 5, 2005

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.