Beyond the standard benchmarks: on the importance of robust models and where to fine them Jong Wook Kim Open AI