Infinite-horizon policy-gradient estimation

Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter β ∈ [0,1) (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter β is related to the mixing time of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter, Bartlett, & Weaver, 2001) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward. ©2001 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.

MoreLess

Year of publication:	2001
Authors:	Baxter, J. ; Bartlett, P. L.
Publisher:	AI Access Foundation
Subject:	APPLIED MATHEMATICS \| ARTIFICIAL INTELLIGENCE AND IMAGE PROCESSING \| Algorithms \| Computational methods \| Markov processes \| Multi agent systems \| Problem solving \| Random processes \| Gradient-based approaches \| Policy parameters \| Value-function methods \| Learning systems \| OAVJ

More details

Type of publication:	Article
Notes:	DOI:10.1613/jair.806 Baxter, J. & Bartlett, P. L. (2001) Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 319-350 . Faculty of Science and Technology; Mathematical Sciences
Source:	BASE

Persistent link: https://www.econbiz.de/10009438377