A New Burrows Wheeler Transform Markov Distance

Authors

  • Edward Raff Booz Allen Hamilton
  • Charles Nicholas University of Maryland, Baltimore County
  • Mark McLean Laboratory for Physical Sciences

DOI:

https://doi.org/10.1609/aaai.v34i04.5994

Abstract

Prior work inspired by compression algorithms has described how the Burrows Wheeler Transform can be used to create a distance measure for bioinformatics problems. We describe issues with this approach that were not widely known, and introduce our new Burrows Wheeler Markov Distance (BWMD) as an alternative. The BWMD avoids the shortcomings of earlier efforts, and allows us to tackle problems in variable length DNA sequence clustering. BWMD is also more adaptable to other domains, which we demonstrate on malware classification tasks. Unlike other compression-based distance metrics known to us, BWMD works by embedding sequences into a fixed-length feature vector. This allows us to provide significantly improved clustering performance on larger malware corpora, a weakness of prior methods.

Downloads

Published

2020-04-03

How to Cite

Raff, E., Nicholas, C., & McLean, M. (2020). A New Burrows Wheeler Transform Markov Distance. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04), 5444-5453. https://doi.org/10.1609/aaai.v34i04.5994

Issue

Section

AAAI Technical Track: Machine Learning