AAAI Publications, The Twenty-Seventh International Flairs Conference

Font Size: 
Optimizing Wrapper-Based Feature Selection for Use on Bioinformatics Data
Randall Wald, Taghi M. Khoshgoftaar, Amri Napolitano

Last modified: 2014-05-03


High dimensionality (having a large number of independent attributes) is a major problem for bioinformatics datasets such as gene microarray datasets. Feature selection algorithms are necessary to remove the irrelevant (not useful) and redundant (contain duplicate information) features. One approach to handle this problem is wrapper-based subset evaluation, which builds classification models on different feature subsets to discover which performs best. Although the computational complexity of this technique has led to it being rarely used for bioinformatics, its ability to find the features which give the best model make it important in this domain. However, when using wrapper-based feature selection, it is not obvious whether the learner used within the wrapper should match the learner used for building the final classification model. Furthermore, this question may depend on other properties of the dataset, such as difficulty of learning (general performance without feature selection) and dataset balance (ratio of minority and majority instances). To study this, we use nine datasets with varying levels of difficulty and balance. We find that across all datasets, the best strategy is to use one learner (Na¨ıve Bayes) inside the wrapper regardless of the learner which will be used outside. However, when broken down by difficulty and balance levels, our results show that the more balanced and less difficult datasets work best when the learners inside and outside the wrapper match. Thus, the answer to this question will depend on properties of the dataset.


Wrapper-Based Feature Selection; Bioinformatics Data; High dimensionality

Full Text: PDF