Proceedings:
No. 8: AAAI-22 Technical Tracks 8
Volume
Issue:
Proceedings of the AAAI Conference on Artificial Intelligence, 36
Track:
AAAI Technical Track on Machine Learning III
Downloads:
Abstract:
We study a finite-horizon restless multi-armed bandit problem with multiple actions, dubbed as R(MA)^2B. The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state and action of the corresponding MDP. Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy entitled Occupancy-Measured-Reward Index Policy for the finite-horizon R(MA)^2B. Our index policy is well-defined without the requirement of indexability condition and is provably asymptotically optimal as the number of arms tends to infinity. We then adopt a learning perspective where the system parameters are unknown, and propose R(MA)^2B-UCB, a generative model based reinforcement learning augmented algorithm that can fully exploit the structure of Occupancy-Measured-Reward Index Policy. Compared to existing algorithms, R(MA)^2B-UCB performs close to offline optimum, and achieves a sub-linear regret and a low computational complexity all at once. Experimental results show that R(MA)^2B-UCB outperforms existing algorithms in both regret and running time.
DOI:
10.1609/aaai.v36i8.20852
AAAI
Proceedings of the AAAI Conference on Artificial Intelligence, 36