Semi-Supervised Learning for Blog Classification

Daisuke Ikeda, Hiroya Takamura, Manabu Okumura

Blog classification (e.g., identifying bloggers' gender or age) is one of the most interesting current problems in blog analysis. Although this problem is usually solved by applying supervised learning techniques, the large labeled dataset required for training is not always available. In contrast, unlabeled blogs can easily be collected from the web. Therefore, a semi-supervised learning method for blog classification, effectively using unlabeled data, is proposed. In this method, entries from the same blog are assumed to have the same characteristics. With this assumption, the proposed method captures the characteristics of each blog, such as writing style and topic, and uses these characteristics to improve the classification accuracy.

Subjects: 13. Natural Language Processing; 12. Machine Learning and Discovery

Submitted: Apr 15, 2008

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.