Pathologies of Temporal Difference Methods in Approximate Dynamic Programming

Approximate policy iteration methods basedon temporal differences are popular in practice, and havebeen tested extensively, dating to the early nineties, butthe associated convergence behavior is complex, and notwell understood at present. An important question iswhether the policy iteration process is seriously hamperedby oscillations between poor policies, roughly similar tothe attraction of gradient methods to poor local minima.There has been little apparent concern in the approximateDP/reinforcement learning literature about this possibility,even though it has been documented with several simpleexamples. Recent computational experimentation with thegame of tetris, a popular testbed for approximate DPalgorithms over a 15-year period, has brought the issueto sharp focus. In particular, using a standard set of 22features and temporal difference methods, an average scoreof a few thousands was achieved. Using the same featuresand a random search method, an overwhelmingly betteraverage score was achieved (600,000-900,000). The paperexplains the likely mechanism of this phenomenon, andderives conditions under which it will not occur.

MoreLess

Year of publication:	2010-12
Authors:	Bertsekas, Dimitri P.
Publisher:	Institute of Electrical and Electronics Engineers

freely available

Full text |

More access options

Check Google Scholar

I need help

More details

Type of publication:	Article
Type of publication (narrower categories):	Congress Report
Language:	English
Notes:	Proceedings of the 49th IEEE Conference on Decision and Control (CDC), 2010 MIT web domain
Source:	BASE

Persistent link: https://www.econbiz.de/10009432263