About
I'm a researcher at METR, trying to shine a flaslight to help us navigate the cliff's edge we're walking along with AI.
I think there's a very good chance that humanity is on track to create AI systems substantially smarter and more capable than us. We need to be really careful about this, because it's an open scientific question whether we can control such systems.
I'm most well-known for creating GPQA, a widely used AI capability benchmark. I've also done research on scalable oversight (i.e. trying to figure out how to supervise a system that's smarter or more capable than the supervisor), and I was one of the first few employees at Cohere, where I trained embedding models for semantic search.
Research highlights
-
2025
(METR) HCAST: Human-Calibrated Autonomy Software Tasks -
2025
(METR) Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity -
2025
(METR) Measuring AI Ability to Complete Long Tasks -
2023
(NYU) GPQA: A Graduate-Level Google-Proof Q&A Benchmark -
2023
(NYU) Debate Helps Supervise Unreliable Experts
Other writing
-
A conversation about some of the subtle in-the-weeds details about METR's time horizon methodology.
-
A comparison of automatic scoring vs. human rubric reviews on real repos.
-
Some notes on how we communicate nuanced results at METR while preserving clarity.
-
A write-up of what we learned from running AI-assisted debates with non-programmer judges on simple coding tasks.
-
Spotlight presentation for GPQA at the first Conference on Language Modeling.
-
Why benchmarks can still be useful even when they aren't perfect, and how I think about their limits.