Calculating the Exact Probability of Language-Like Patterns in Biomolecular Sequences

Authors

Kevin Atteson

Proceedings:

Proceedings Of The Sixth International Conference On Intelligent Systems For Molecular Biology

Volume

Issue:

Proceedings Of The Sixth International Conference On Intelligent Systems For Molecular Biology

Track:

Contents

Downloads:

Download PDF

Abstract:

We present algorithms for the exact computation of the probability that a random string of a certain length matches a given regular expression. These algorithms can be used to determine statistical significance in a variety of pattern searches such as motif searches and gene-finding. This work improves upon work of Kleffe and Langebackerand of Sewell and Durbin in sev- eral ways. First, in many cases of interest, the algorithms presented here are faster. In addition, the type of pattern considered here strictly includes those of both previous works but also allows, for instance, arbi-rary length gaps. Also, the type of probability model which can be used is more general than that of Sewell and Durbin, allowing for Markov chains. The problem solved in this work is in fact in the class of NP-hard problems which are believed to be intractable. However, the problem is fixed-parameter tractable, meaning that it is tractable for small patterns. The is problem is also computationally feasible for many patterns which occur in practice. As a sample application, we consider calculating the statistical significance of most of the PROSITE patterns as in Sewell and Durbin. Whereas their method was only fast enough to exactly compute the probabil- ities for sequences of length 13 larger than the pattern length, we calculate these probabilities for sequences of up to length 2000. In addition, we calculate most of these probabilities using a first order Markov chain. Most of the PROSITE patterns have high significance at length 2000 under both the i.i.d. and Markov chain models. For further applications, we demonstrate the calculation of the probability of a PROSITE pattern occurring on either strand of a random DNA sequence of up to 500 kilo-bases and the probability of a simple gene model occurring in a random sequence of up to 1 megabase.

ISMB

Proceedings Of The Sixth International Conference On Intelligent Systems For Molecular Biology

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.