Abstract:
With the advent of large distributed and dynamic document collections (such as are on the World Wide Web), it is becoming increasingly important to automate the task of text categ~zation. The use of machine learning in text categorization is difficult due to characteristics of the domain~ including a very large number of input features, noise, and the problems associated with semantic analysis of text. As a result, the use of mpervised learning requires a relatively large number of labeled examples. We explore the possibility of using (almost) unsupervised learning and propose some novel approaches to using machine learning in this domain.