Document Classification for Focused Topics
Russell Power, Jay Chen, Trishank Karthik, Lakshminarayanan Subramanian

Feature extraction is one of the fundamental challenges in improving the accuracy of document classification. While there has been a large body of research literature on document classification, most existing approaches either do not have a high classification accuracy or require massive training sets. In this paper, we propose a simple feature extraction algorithm that can achieve high document classification accuracy in the context of development-centric topics. Our feature extraction algorithm exploits two distinct aspects in development-centric topics:

  • most of these topics tend to be very focused (unlike semantically hard classification topics such as chemistry or banks)
  • due to local language and cultural underpinnings in these topics, the authentic pages tend to use several region specific features.

Our algorithm uses a combination of popularity and rarity as two separate metrics to extract features that describe a topic. Given a topic, our output feature set comprises of: (i) a list of popular keywords closely related to the topic; (ii) a list of rare keywords closely related to the topic. We show that a simple joint classifier based on these two feature sets can achieve high classification accuracy while each feature sub-set in itself is insufficient. We have tested our algorithm across a wide range of development-centric topics.


classification; development; focused topics

