Humanity’s Last Exam exposes gaps in leading AI models
As AI systems outgrew older academic benchmarks such as MMLU, a worldwide team of nearly 1,000 researchers built Humanity’s Last Exam to probe expert-level knowledge. The 2,500 question assessment spans mathematics, humanities, natural sciences, ancient languages and specialized fields, with each problem designed to have one clear, verifiable answer and resist simple internet searches.
Researchers screened every question against leading models and removed any item a model could answer correctly, leaving a benchmark just beyond current systems. Early results were low: GPT-4o scored 2.7 percent, Claude 3.5 Sonnet reached 4.1 percent and OpenAI’s o1 model scored 8 percent. The strongest systems tested so far, including Gemini 3.1 Pro and Claude Opus 4.6, have reached accuracy levels between about 40 percent and 50 percent.
Texas A&M’s Dr. Tung Nguyen, who contributed 73 of the 2,500 publicly available questions, said stronger benchmarks are needed so policymakers, developers and users do not overread AI performance on tests designed for people. Most questions remain hidden to limit memorization, while public examples make the benchmark transparent for tracking future progress.