Font Size:

Efficient Exploration for Constrained MDPs

Last modified: 2018-03-15

#### Abstract

Given a Markov Decision Process (MDP) defined by a simulator, a designated starting state $s_0$, and a downside risk constraint defined as the probability of reaching catastrophic states, our goal is to find a stationary deterministic policy $\pi$ that with probability $1-\delta$ achieves a value $V^\pi(s_0)$ that is within $\epsilon$ of the value of the optimal stationary deterministic $\nu$-feasible policy, $V^*(s_0)$, while economizing on the number of calls to the simulator. This paper presents the first {\bf PAC-Safe-RL} algorithm for this purpose. The algorithm extends PAC-RL algorithms for efficient exploration while providing guarantees that the downside constraint is satisfied. Experiments comparing our {\sc ConstrainedDDV} algorithm to baselines show substantial reductions in the number of simulator calls required to find a feasible policy.

#### Keywords

Reinforcement learning; Markov decision processes; Constrained MDP Planning; PAC-RL; PAC-Safe-RL

Full Text:
PDF