I'm a researcher at METR, making agent benchmarks, measuring the impacts of AI tools on software developer productivity, and modeling trends in AI capabilities.
I think there's a very good chance that humanity is on track to create AI systems substantially smarter and more capable than us. We need to be *really damn careful* about this, because it's an open scientific question whether we can control such systems.
I'm most well-known for creating GPQA, one of the (as of mid-2025) most widely used AI capability benchmarks. I've also done research on scalable oversight (i.e. trying to figure out how to supervise a system that's smarter or more capable than the supervisor), and I was one of the first few employees at Cohere, where I trained embedding models for semantic search.
Other writing
- 2025 — Research Update: Algorithmic vs. Holistic Evaluation — A comparison of automatic scoring vs. human rubric reviews on real repos.
- 2025 — Science Comms at METR — Some notes on how we communicate nuanced results at METR while preserving clarity.
- 2024 — NYU Code Debates Update/Postmortem — A write-up of what we learned from running AI-assisted debates with non-programmer judges on simple coding tasks.
- 2024 — Can Good Benchmarks Contain Mistakes? — Why benchmarks can still be useful even when they aren't perfect, and how I think about their limits.