Abstract:
UCT, a state-of-the art algorithm for Monte Carlo tree search (MCTS),is based on UCB, a policy for the Multi-armed Bandit problem (MAB) thatminimizes the cumulative regret. However, search differs from MAB inthat in MCTS it is usually only the final ``arm pull''that collects a reward, rather than all ``arm pulls''.Therefore, it makes more sense to minimize the simple, rather thancumulative, regret. We introduce policies formulti-armed bandits with lower simpleregret than UCB and develop a two-stage scheme (SR+CR) for MCTSwhich outperforms UCT empirically. We also propose a samplingscheme based on value of information (VOI), achieving an algorithmthat empirically outperforms other proposed algorithms.
DOI:
10.1609/socs.v3i1.18221