An Application of Neural Networks to Sequence Analysis and Genre Identification

Authors

David Bisant

Published:

May 2004

Proceedings:

Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference (FLAIRS 2004)

Volume

Issue:

Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference (FLAIRS 2004)

Track:

All Papers

Downloads:

Download PDF

Abstract:

This study borrowed sequence analysis techniques from the genetic sciences and applied them to a similar problem in email filtering and web searching. Genre identification is the process of determining the type or family of a given document. For example, is the document a letter, a news story, a horoscope, a joke, or an advertisement. Genre identification allows a computer user to further filter email and web sites in a way that is totally different than topic-based methods. This study presents original research in an application of neural networks to the genre identification problem. The data for the study came from a database constructed by the author and his colleagues. The data consisted of descriptive features and the genre classifi- cation, as judged by a human, from over 5000 different documents. Ten different genres were represented. The descriptive features consisted of 89 different measurements of each document such as average word length, the number of numeric terms, the proportion of present tense verbs, etc. The data was divided into 2 sets, with 75% for training and 25% for testing. The first neural network applied was a very basic single layer network that achieved 79% correct classifications on the testing data. This performance was equivalent to the previous best method on the problem, decision trees. When more complex neural networks were applied to the problem, performance increased significantly. The best performance of 86% correct classifications was achieved by a network with a single hidden layer of 300 units. Increasing the number of hidden layers, or changing the number of hidden units did not improve performance. A weight decay process also did not improve performance. The analysis of the features indicated that 2nd order information was being exploited by the networks for better performance. This means that neural networks will outperform statistical models or other methods that only utilize 1st order information.

FLAIRS

Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference (FLAIRS 2004)

ISBN 978-1-57735-201-3

Published by The AAAI Press, Menlo Park, California.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.