First, we consider the problem of imitation learning from sub-optimal demonstrations. In this setting, a robot receives failed or flawed demonstrations of a task. It must learn to infer, and subsequently complete, the intended task from only these failed demonstrations. Results are presented on a variety of robotics problems such as door opening and pick and place.
Second, we consider the problem of learning from sub-optimal reward functions. Often, the reward functions provided to reinforcement learning agents are derived by combining low level primitives such as agent position and velocity. For example, the reward for a robot learning to walk might be its forward velocity plus the position of its head. These reward functions are first and foremost intended for human consumption, not the consumption of an RL algorithm. Consequently, it might be possible to learn a better intrinsic reward function that it is easier for the RL algorithm to optimize against. We provide a new algorithm for learning such intrinsic reward functions. Optimizing against these learned intrinsic rewards leads to better overall agent performance than optimizing against the raw hand-designed reward function. Crucially, these reward functions can be learned on the fly without significant extra computational costs. Results are presented on a variety of MuJoCo tasks and some hard robotics problems such as block stacking.
Discovery Building, Orchard View Room