On the Role of Statistical Significance in Exploratory Data Analysis

Authors

Inderpal Bhandari

Shriram Biyani

Track:

Statistical and Probabilistic Models

Downloads:

Abstract:

Knowledge discovery in databases [1] often involves the identification of statistical relktlonships in data as a first step towards the discovery of knowledge. The main problem with such identification is that in a large data set there usually are very many statistical relationship~ that exist but which are quite useless in the context of knowledge discove: y. For instance, some of those relationships will be well known, and hence, not be meaningful in the context of knowledge discovery. Those relationships can be screened for and removed by incorporating domain knowledge in the knowledge discovery system. Other relationships will reflect purely stsstatical relationships, i.e., relationships that occur by chance because of random effects in the underly- ing processes used to generate and collect the data. Such chance relationships are undesirable as they may lead to the discovery of spurious knowledge. Those relationships can be screened for and removed by using methods of statistical validation The identification of relationships that occur by chance has always been a concern in statistical analysis, specifically, in the arena of hypothesis testing ([3], Chapter The approach used to eliminate such relationships is based on determining the extent to which an observed relationship was produced by chance. One uses that extent to determine whether to accept or to reject that relationship. Clearly, a similar strategy can be used in knowledge discovery to remove chance relationships, and, it often is recommended in the literature. This paper suggests that such a strategy be used carefully since the circumstances under which hypothesis testingis done in the context of knowledge iscovery maybe different from conventional applications of such testing. Hence, it may not be advisable to follow the dictates of conventional wisdora. In particular, we study the use of statistical significance in the context of the application of an approach to knowledge discovery called Attribute Focusing to software engineering.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.