Designing a hypothesis test to determine the best of two machine learning algorithms with only a small data set available is not a simple task. Many popular tests suffer from low power (5x2 cv ), or high Type I error (Weka’s 10x10 cross validation ). Furthermore, many tests show a low level of replicability, so that tests performed by different scientists with the same pair of algorithms, the same data sets and the same hypothesis test still may present different results. We show that 5x2 cv, resampling and 10 fold cv suffer from low replicability. The main complication is due to the need to use the data multiple times. As a consequence, independence assumptions for most hypothesis tests are violated. In this paper, we pose the case that reuse of the same data causes the effective degrees of freedom to be much lower than theoretically expected. We show how to calibrate the effective degrees of freedom empirically for various tests. Some tests are not calibratable, indicating another flaw in the design. However the ones that are calibratable all show very similar behavior. Moreover, the Type I error of those tests is on the mark for a wide range of circumstances, while they show a power and replicability that is a considerably higher than currently popular hypothesis tests.