Dynamics of Temporal Difference Learning

Andreas Wendemuth

In behavioural sciences, the problem that a sequence of stimuli is followed by a sequence of rewards r(t) is considered. The subject is to learn the full sequence of rewards from the stimuli, where the prediction is modelled by the Sutton-Barto rule. In a sequence of n trials, this prediction rule is learned iteratively by temporal difference learning. We present a closed formula of the prediction of rewards at trial time t within trial n. From that formula, we show directly that for n to infinity, the predictions converge to the real rewards. In this approach, a new quality of correlation type Toeplitz matrices is proven. We give learning rates which optimally speed up the learning process.

Subjects: 12.1 Reinforcement Learning; 15.9 Theorem Proving

Submitted: Oct 13, 2006

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.