Pathologies of Temporal Difference Methods in Approximate Dynamic Programming
Approximate policy iteration methods basedon temporal differences are popular in practice, and havebeen tested extensively, dating to the early nineties, butthe associated convergence behavior is complex, and notwell understood at present. An important question iswhether the policy iteration process is seriously hamperedby oscillations between poor policies, roughly similar tothe attraction of gradient methods to poor local minima.There has been little apparent concern in the approximateDP/reinforcement learning literature about this possibility,even though it has been documented with several simpleexamples. Recent computational experimentation with thegame of tetris, a popular testbed for approximate DPalgorithms over a 15-year period, has brought the issueto sharp focus. In particular, using a standard set of 22features and temporal difference methods, an average scoreof a few thousands was achieved. Using the same featuresand a random search method, an overwhelmingly betteraverage score was achieved (600,000-900,000). The paperexplains the likely mechanism of this phenomenon, andderives conditions under which it will not occur.
Year of publication: |
2010-12
|
---|---|
Authors: | Bertsekas, Dimitri P. |
Publisher: |
Institute of Electrical and Electronics Engineers |
Saved in:
freely available
Saved in favorites
Similar items by person
-
Stochastic optimal control : the discrete time case
Bertsekas, Dimitri P., (1996)
-
Linear network optimization : algorithms and codes
Bertsekas, Dimitri P., (1991)
-
Stochastic optimal control : the discrete time case
Bertsekas, Dimitri P., (1978)
- More ...