Restless Bandits with Bursty Rewards
Time: September 10, 2007, 1pm - 2pm
Place: D344, LSRC
Speaker:Kamesh Munagala
We consider a variant of the classic multi-armed bandit problem (MAB)
from stochastic control theory. We call this variant "Feedback MAB".
Here, the reward obtained by playing each of n independent arms varies
according to an underlying "on/off" Markov process with known
parameters. The evolution of the Markov chain happens irrespective of
whether the arm is played, and furthermore, the exact state of the
Markov chain is only revealed to the player when the arm is played and
the reward observed. At most one arm (or in general, M arms) can be
played any time step. The goal is to design a policy for playing the
arms in order to maximize the infinite horizon time average expected
reward.
This problem is an instance of a Partially Observable Markov Decision
Process (POMDP), and a special case of the notoriously intractable
"restless bandit" problem in control theory. Unlike the classic
Stochastic MAB problem, the Feedback MAB problem does not admit to
greedy index-based optimal policies. The state of the system at any
time step encodes the beliefs about the states of different arms, and
the policy decisions change these beliefs – this aspect complicates
the design and analysis of simple algorithms.
We design a constant factor approximation to the Feedback MAB problem
by solving and rounding a natural LP relaxation to this problem. The
resulting policy is simple to implement, intuitive, and crucially
exploits interesting structure in the LP solution. As far as we are
aware, this is the first (multiplicative) approximation algorithm for
a POMDP problem.
To appear in FOCS 2007. Joint work with Sudipto Guha, UPenn.