The multi-armed bandit problem is a common sequential decision-making framework where at each time step a player selects an action and receives some reward from selecting that action. The aim is to select actions to maximize the total reward. Commonly it is assumed that the (expected) reward of each action is constant and does not depend on the actions that the player has previously taken. However, in many practical settings, this is not realistic. For example in web-advertising, the benefit from showing an advert is likely to depend on the number of times the user has seen it in the past, and in product recommendation, the reward of suggesting an item will depend on the time since it was last suggested to the customer. In this talk, we will consider several variants of the multi-armed bandit problem where the reward depends on the history of the players’ actions. For each problem, we will discuss whether learning is possible and if so provide algorithms that perform well.
April 1 @ 12:30
12:30 pm (1h)
Discovery Building, Orchard View Room