Abstract:
Inspired by recent results on polynomial time reinforcement algorithms that accumulate near-optimal rewards, we look at the related problem of quickly learning near-optimal policies. The new problem is obviously related to the previous one, but different in important ways. We provide simple algorithms for MDPs, zero-sum and common-payoff Stochastic Games, and a uniform framework for proving their polynomial complexity. Unlike the previously studied problem, these bounds use the minimum between the mixing time and a new quantity - the spectral radius. Unlike the previous results, our results apply uniformly to the average and discounted cases.