Evaluating classifiers with increased confidence can significantly impact the success of many machine learning applications. However, traditional machine learning evaluation measures fail to provide any levels of confidence in their results. In this paper, we motivate the need for confidence in classifier evaluation at a level suitable for medical studies. We draw a parallel between case-control medical studies and classification in machine learning. We propose the use of Tango's biostatistical test to compute consistent confidence intervals on the difference in classification errors on both classes. Our experiments compare Tango's confidence intervals to accuracy, recall, precision, and the F measure. Our results show that Tango's test provides a statistically sound notion of confidence and is more consistent and reliable than the above measures.