Learning to Recognize Promoter Sequences in E. coli by Modeling Uncertainty in the Training Data

Steven W. Norton

Automatic recognition of promoter sequences is an important open problem in molecular biology. Unfortunately, the usual machine learning version of this problem is critically flawed. In particular, the dataset available from the Irvine repository was drawn from a compilation of promoter sequences that were preprocessed to conform to the biologists’ related notion of the consensus sequence, a first-order approximation with a number of shortcomings that are well-known in molecular biology. Although concept descriptions learned from the Irvine data may represent the consensus sequence, they do not represent promoters. More generally, imperfections in preprocessed data and statistical variations in the locations of biologically meaningful features within the raw data invalidate standard attribute-based approaches. I suggest a dataset, a concept-description language, and a model of uncertainty in the promoter data that are all biologically justified, then address the learning problem with incremental probabilistic evidence combination. This knowledge-based approach yields a more accurate and more credible solution than other more conventional machine learning systems.


This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.