AAAI Publications, 2010 AAAI Spring Symposium Series

Font Size: 
Document Classification for Focused Topics
Russell Power, Jay Chen, Trishank Karthik, Lakshminarayanan Subramanian

Last modified: 2010-03-16


Feature extraction is one of the fundamental challenges in improving the accuracy of document classification. While there has been a large body of research literature on document classification, most existing approaches either do not have a high classification accuracy or require massive training sets. In this paper, we propose a simple feature extraction algorithm that can achieve high document classification accuracy in the context of development-centric topics. Our feature extraction algorithm exploits two distinct aspects in development-centric topics:

  • most of these topics tend to be very focused (unlike semantically hard classification topics such as chemistry or banks)
  • due to local language and cultural underpinnings in these topics, the authentic pages tend to use several region specific features.

Our algorithm uses a combination of popularity and rarity as two separate metrics to extract features that describe a topic. Given a topic, our output feature set comprises of: (i) a list of popular keywords closely related to the topic; (ii) a list of rare keywords closely related to the topic. We show that a simple joint classifier based on these two feature sets can achieve high classification accuracy while each feature sub-set in itself is insufficient. We have tested our algorithm across a wide range of development-centric topics.


classification; development; focused topics

Full Text: PDF