Track:
Statistical and Probabilistic Models
Downloads:
Abstract:
Knowledge discovery in databases [1] often involves the identification of statistical relktlonships in data as a first step towards the discovery of knowledge. The main problem with such identification is that in a large data set there usually are very many statistical relationship~ that exist but which are quite useless in the context of knowledge discove: y. For instance, some of those relationships will be well known, and hence, not be meaningful in the context of knowledge discovery. Those relationships can be screened for and removed by incorporating domain knowledge in the knowledge discovery system. Other relationships will reflect purely stsstatical relationships, i.e., relationships that occur by chance because of random effects in the underly- ing processes used to generate and collect the data. Such chance relationships are undesirable as they may lead to the discovery of spurious knowledge. Those relationships can be screened for and removed by using methods of statistical validation The identification of relationships that occur by chance has always been a concern in statistical analysis, specifically, in the arena of hypothesis testing ([3], Chapter The approach used to eliminate such relationships is based on determining the extent to which an observed relationship was produced by chance. One uses that extent to determine whether to accept or to reject that relationship. Clearly, a similar strategy can be used in knowledge discovery to remove chance relationships, and, it often is recommended in the literature. This paper suggests that such a strategy be used carefully since the circumstances under which hypothesis testingis done in the context of knowledge iscovery maybe different from conventional applications of such testing. Hence, it may not be advisable to follow the dictates of conventional wisdora. In particular, we study the use of statistical significance in the context of the application of an approach to knowledge discovery called Attribute Focusing to software engineering.