Proceedings:
Machine Learning in Information Access
Volume
Issue:
Papers from the 1996 AAAI Spring Symposium
Track:
Contents
Downloads:
Abstract:
This paper studies training set sampling strategies in the context of statistical learning for text categorization. It is argued sampling strategies favoring common categories is superior to uniform coverage or mistake-driven approaches, if performance is measured by globally assessed precision and recall. The hypothesis is empirically validated by examining the performance of a nearest neighbor classifier on training samples drawn from a pool of 235,401 training texts with 29,741 distinct categories. The learning curves of the classifier are analyzed with respect to the choice of training resources, the sampling methods, the size, vocabulary and category coverage of a sample, and the category distribution over the texts in the sample. A nearly-optimal categorization performance of the classifier is achieved using a relatively small training sample, showing that statistical learning can be successfully applied to very large text categorization problems with affordable computation.
Spring
Papers from the 1996 AAAI Spring Symposium