David Rein

About

I'm a researcher at METR, trying to shine a flaslight to help us navigate the cliff's edge we're walking along with AI.

I think there's a very good chance that humanity is on track to create AI systems substantially smarter and more capable than us. We need to be really careful about this, because it's an open scientific question whether we can control such systems.

I'm most well-known for creating GPQA, a widely used AI capability benchmark. I've also done research on scalable oversight (i.e. trying to figure out how to supervise a system that's smarter or more capable than the supervisor), and I was one of the first few employees at Cohere, where I trained embedding models for semantic search.

Research highlights

2025 (METR) HCAST: Human-Calibrated Autonomy Software Tasks
2025 (METR) Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
2025 (METR) Measuring AI Ability to Complete Long Tasks
2023 (NYU) GPQA: A Graduate-Level Google-Proof Q&A Benchmark
2023 (NYU) Debate Helps Supervise Unreliable Experts

Other writing

2026 AXRP Podcast: David Rein on METR Time Horizons

A conversation about some of the subtle in-the-weeds details about METR's time horizon methodology.
2025 Research Update: Algorithmic vs. Holistic Evaluation

A comparison of automatic scoring vs. human rubric reviews on real repos.
2025 Science Comms at METR

Some notes on how we communicate nuanced results at METR while preserving clarity.
2024 NYU Code Debates Update/Postmortem

A write-up of what we learned from running AI-assisted debates with non-programmer judges on simple coding tasks.
2024 CoLM GPQA Spotlight Talk

Spotlight presentation for GPQA at the first Conference on Language Modeling.
2024 Can Good Benchmarks Contain Mistakes?

Why benchmarks can still be useful even when they aren't perfect, and how I think about their limits.