Track:
Contents
Downloads:
Abstract:
Classifiers that are deployed in the field can be used and evaluated in ways that were not anticipated when the model was trained. The ultimate evaluation metric may not have been known to the modeler at training time, additional performance criteria may have been added, the evaluation metric may have changed over time, or the real-world evaluation procedure may have been impossible to simulate. But unforeseen ways of measuring model utility can degrade performance. Our objective is to provide experimental support for modelers who face potential "cross-metric" performance deterioration. First, to identify model-selection metrics that lead to stronger cross-metric performance, we characterize the expected loss where the selection metric is held fixed and the evaluation metric is varied. Second, we show that the number of data points evaluated by a selection metric has a substantial effect on the optimal evaluation. In trying to address both these issues, we hypothesize that whether classifiers are calibrated to output probabilities may influence these issues. In our consideration of the role of calibration, we show that our experiments demonstrate that cross-entropy is the highest-performing selection metric where little data is available for selection. With these experiments, modelers may be in a better position to choose selection metrics that are robust where it is uncertain what evaluation metric will be applied.