Efficiently Enumerating Substrings with Statistically Significant Frequencies of Locally Optimal Occurrences in Gigantic String

Authors

  • Atsuyoshi Nakamura Hokkaido University
  • Ichigaku Takigawa RIKEN
  • Hiroshi Mamitsuka Kyoto University

DOI:

https://doi.org/10.1609/aaai.v34i04.5969

Abstract

We propose new frequent substring pattern mining which can enumerate all substrings with statistically significant frequencies of their locally optimal occurrences from a given single sequence. Our target application is genome sequences, around a half being said to be covered by interspersed and consecutive (tandem) repeats, and detecting these repeats is an important task in molecular life sciences. We evaluate the statistical significance of frequent substrings by using a string generation model with a memoryless stationary information source. We combine this idea with an existing algorithm, ESFLOO-0G.C (Nakamura et al. 2016), to enumerate all statistically significant substrings with locally optimal occurrences. We further develop a parallelized version of our algorithm. Experimental results using synthetic datasets showed the proposed algorithm achieved far higher F-measure in extracting substrings (with various lengths and frequencies) embedded in a randomly generated string with noise, than conventional algorithms. The large-scale experiment using the whole human genome sequence with 3,095,677,412 bases (letters) showed that our parallel algorithm covers 75% of the whole positions analyzed, around 4% and 24% higher than the recent report and the current cutting-edge knowledge, implying a biologically unique finding.

Downloads

Published

2020-04-03

How to Cite

Nakamura, A., Takigawa, I., & Mamitsuka, H. (2020). Efficiently Enumerating Substrings with Statistically Significant Frequencies of Locally Optimal Occurrences in Gigantic String. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04), 5240-5247. https://doi.org/10.1609/aaai.v34i04.5969

Issue

Section

AAAI Technical Track: Machine Learning