Substructure Discovery Using Minimum Description Length Principle and Background Knowledge

Surnjani Djoko

Discovering conceptually interesting and repetitive substructures in a structural data improves the ability to interpret and compress the data. The substructures are evaluated by their ability to describe and compress the original data set using the domain’s background knowledge and the minimum description length (MDL) of the data. Once discovered, the substructure concept is used to simplify the data by replacing instances of the substructure with a pointer to the newly discovered concept. The discovered substructure concepts allow abstraction over detailed structure in the original data. Iteration of the substructure discovery and replacement process constructs a hierarchical description of the structural data in terms of the discovered substructures. This hierarchy provides varying levels of interpretation that can be accessed based on the goals of the data analysis.

