SILO: Data for training and evaluating agents: OpenThoughts and Terminal-Bench

Abstract

Over the past year, language models have transitioned from chat interfaces to agentic systems like Claude Code or Codex. In this talk, I will give an overview of two projects to build and understand agentic models in the open. OpenThoughts is a dataset for training reasoning models via supervised fine-tuning. Based on our pipeline with over 1,000 experiments, we assemble a training set of 1M examples that yields state-of-the-art performance. Next I will cover Terminal-Bench, a benchmark for measuring agent performance in terminal environments which has become an industry standard and illustrates the growing capabilities of agentic models.

Bio

Ludwig Schmidt is an assistant professor at Stanford University and a member of the technical staff at Anthropic. Ludwig’s research interests revolve around the empirical foundations of machine learning, often with a focus on datasets, reliable generalization, and language models. Ludwig’s research group contributed to open source machine learning by creating OpenCLIP, LAION-5B, DCLM, OpenThoughts, and Terminal-Bench. Ludwig completed his PhD at MIT and was a postdoc at UC Berkeley. Ludwig’s research received best paper awards at ICML & NeurIPS, best paper finalist at CVPR, and the Sprowls dissertation award from MIT.

May 13, 2026

12:30 pm (1h)

Orchard View Room

Ludwig Schmidt, Stanford University