Mega-Classification: Discovering Motifs in Massive Datastreams

Authors

Nomi L. Harris

Lawrence Hunter

David J. States

Proceedings:

Book One

Volume

Issue:

Proceedings of the AAAI Conference on Artificial Intelligence, 10

Track:

Scaling Up

Downloads:

Download PDF

Abstract:

We report on the development and application of an efficient unsupervised learning procedure for the classification of an unsegmented datastream, given a set of probabilistic binary similarity judgments between regions in the stream. Our method is effective on very large databases, and tolerates the presence of noise in the similarity judgements and in the extents of similar regions. We applied this method to the problem of finding the sequence-level building blocks of proteins. After verifying the effectiveuess of the clusterer by testing it on synthetic protein data with known evolutionary history, we applied the method to a large protein sequence database (a datastream of more than IO^7 elements) and found about 10,000 protein sequence classes. The motifs defined by these classes are of biological interest, and have the potential to supplement or replace the existing manual annotation of protein sequence databases.

AAAI

Proceedings of the AAAI Conference on Artificial Intelligence, 10

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.