Regular bandit problems assume that we have a fixed reward vector and that we need to find the optimal arm for that one vector. In practice, people might have different taste profiles and websites propose more than one arm at a time (or ad, or sale item, etc) for a user to see. This leads to a new formulation of a bandit problem where the optimal arm is drawn from a distribution and the game is to play more than one arm (a fixed number K, K=2 for example) at a time and optimize the rewards for the distribution that the hidden reward vector is coming from.
I will present these ideas and also pontificate on what current linear bandit algorithms do.
August 27 @ 16:00
4:00 pm (1h)
Mezzanine Room A/B, Red Gym