Regret bound for Narendra-Shapiro bandit algorithms

Narendra-Shapiro (NS) algorithms are bandit-type algorithms that have been introduced in the sixties (with a view to applications in Psychology or learning automata), whose convergence has been intensively studied in the stochastic algorithm literature. In this paper, we adress the following question: are the Narendra-Shapiro (NS) bandit algorithms competitive from a regret point of view? In our main result, we show that some competitive bounds can be obtained for such algorithms in their penalized version (introduced in Lamberton-Pages). More precisely, up to an over-penalization modification, the pseudo-regret Rn related to the penalized two-armed bandit algorithm is uniformly bounded by C n^{1/2} (where C is made explicit in the paper). We also generalize existing convergence and rates of convergence results to the multi-armed case of the over-penalized bandit algorithm, including the convergence toward the invariant measure of a Piecewise Deterministic Markov Process (PDMP) after a suitable renormalization. Finally, ergodic properties of this PDMP are given in the multi-armed case.

MoreLess

Year of publication:	2015-02
Authors:	Gadat, Sébastien ; Panloup, F. ; Saadane, Sofiane
Institutions:	Toulouse School of Economics (TSE)
Subject:	Regret \| Stochastic Bandit Algorithms \| Piecewise Deterministic Markov Processes

More details

Extent:	application/pdf
Series:	TSE Working Papers.
Type of publication:	Book / Working Paper
Notes:	The text is part of a series TSE Working Paper Number 15-556
Source:	RePEc - Research Papers in Economics

Persistent link: https://www.econbiz.de/10011189153