Discovery of Dependencies and Models
Explora supports Discovery in Databases by large scale search for interesting instances of statistical patterns. Due to the variety of patterns and the immense combinatorial possibilities in studying relations between variables in subsets of data, at least two implementation problems arise. First, the user must be saved from getting overwhelmed with a deluge of findings. This can be achieved by some basic organization principles included into search. One principle is, to organize search hierarchically and to study first the strongest hypotheses (that mostly means the most general ones). Weaker hypotheses are then eliminated from further search. But even in moderately sized data, that approach alone usually does not prevent from large sets of findings. Therefore, in a second evaluation phase, a refinement strategy selects the most interesting verified statements and treats also the overlapping problem (due to correlations between'independent variables). Further, the user can focus a discovery task by a more detailed specification of the analysis problem to be treated. Second, it is important for discovery systems, to manage the efficiency problem. Each hypothesis evaluated when processing the large search space refers to subsets of cases stored in a database. These subsets correspond to combinations of variables and their (taxonomical) values. principal, each subset needs random accesses to a lot of cases which takes much computation time. We describe solutions implemented in the discovery system Explora to deal with these two problems. In an appendix, results of a discovery session in Explora are presented, and the necessity to insert more statistical strategies into a "higher" discovery level is discussed. On this level, instances of patterns verified during basic search are selected, refined, and combined to achieve a higher quality of presented findings including more interpretation potentiality.