This study borrowed sequence analysis techniques from the genetic sciences and applied them to a similar problem in email filtering and web searching. Genre identification is the process of determining the type or family of a given document. For example, is the document a letter, a news story, a horoscope, a joke, or an advertisement. Genre identification allows a computer user to further filter email and web sites in a way that is totally different than topic-based methods. This study presents original research in an application of neural networks to the genre identification problem. The data for the study came from a database constructed by the author and his colleagues. The data consisted of descriptive features and the genre classifi- cation, as judged by a human, from over 5000 different documents. Ten different genres were represented. The descriptive features consisted of 89 different measurements of each document such as average word length, the number of numeric terms, the proportion of present tense verbs, etc. The data was divided into 2 sets, with 75% for training and 25% for testing. The first neural network applied was a very basic single layer network that achieved 79% correct classifications on the testing data. This performance was equivalent to the previous best method on the problem, decision trees. When more complex neural networks were applied to the problem, performance increased significantly. The best performance of 86% correct classifications was achieved by a network with a single hidden layer of 300 units. Increasing the number of hidden layers, or changing the number of hidden units did not improve performance. A weight decay process also did not improve performance. The analysis of the features indicated that 2nd order information was being exploited by the networks for better performance. This means that neural networks will outperform statistical models or other methods that only utilize 1st order information.
Published Date: May 2004
Registration: ISBN 978-1-57735-201-3
Copyright: Published by The AAAI Press, Menlo Park, California.