Play trace dissimilarity metrics compare two plays of a game and describe how different they are from each other. But how can we evaluate these metrics? Are some more accurate than others for a particular game, or in general? If so, why? Is the appropriate metric for a given game determined by certain characteristics of the game's design? This work provides an experimental methodology for validating play trace dissimilarity metrics for conformance to game designers' perception of play trace difference. We apply this method to a game-independent metric called Gamalyzer and compare it against three baselines which are representative of commonly used techniques in game analytics. We find that Gamalyzer---with an appropriate input encoding---is more accurate than the baseline metrics for the specific game under consideration, but simpler metrics based on event counting perform nearly as well for this game.