Learning From Sub-Optimal Data

Learning algorithms typically assume their input data is good natured. If one takes this input data and trains an agent with it, then the agent should, given enough time and compute, eventually learn how to solve the intended task. But this is not always a realistic expectation. Sometimes, the data given to an agent is flawed or fails to fully convey the correct problem. In other words, the input data is sub-optimal. In this talk, we will discuss two recent advances for overcoming sub-optimal data.

First, we consider the problem of imitation learning from sub-optimal demonstrations. In this setting, a robot receives failed or flawed demonstrations of a task. It must learn to infer, and subsequently complete, the intended task from only these failed demonstrations. Results are presented on a variety of robotics problems such as door opening and pick and place.

Second, we consider the problem of learning from sub-optimal reward functions. Often, the reward functions provided to reinforcement learning agents are derived by combining low level primitives such as agent position and velocity. For example, the reward for a robot learning to walk might be its forward velocity plus the position of its head. These reward functions are first and foremost intended for human consumption, not the consumption of an RL algorithm. Consequently, it might be possible to learn a better intrinsic reward function that it is easier for the RL algorithm to optimize against. We provide a new algorithm for learning such intrinsic reward functions. Optimizing against these learned intrinsic rewards leads to better overall agent performance than optimizing against the raw hand-designed reward function. Crucially, these reward functions can be learned on the fly without significant extra computational costs. Results are presented on a variety of MuJoCo tasks and some hard robotics problems such as block stacking.

March 27, 2019

12:30 pm (1h)

Discovery Building, Orchard View Room

Bradly Stadie