We present ongoing work regarding best arm identification in multi-armed bandit problems, when the reward distributions are bounded. Although this is a standard assumption in this context, state of the art methods (such as the lil-UCB algorithm) use sub-Gaussian concentration bounds for the mean rewards.
However, one can take advantage of the boundedness of the rewards to derive stronger concentration inequalities for the empirical means that involve the Kullback-Leibler divergence of certain Bernoulli distributions. Although such methods exist in the literature, they exhibit sub-optimal performance in other parameters of the problem (such as the number of arms). Our goal is to combine ideas of the lil-UCB algorithm and the KL concentration bounds to obtain methods with improved performance.
This is joint work with Robert Nowak and Kevin Jamieson.
Discovery Building, Orchard View Room