Delip Rao, Deepak P, Deepak Khemani
Text categorization involves mapping of documents to a fixed set of labels. A similar but equally important problem is that of assigning labels to large corpora. With a deluge of documents from sources like the World Wide Web, manual labeling by domain experts is prohibitively expensive. The problem of reducing effort in labeling of documents has warranted a lot of investigation in the past. Most of this work involved some kind of supervised or semi-supervised learning. This motivates the need to find automatic methods for annotating documents with labels. In this work we explore a novel method of assigning labels to documents without using any training data. The proposed method uses clustering to build semantically related sets that are used as candidate labels to documents. This technique could be used for labeling large corpora in an unattended fashion.
Subjects: 12. Machine Learning and Discovery; 13. Natural Language Processing
Submitted: Feb 13, 2006