Humanity’s Last Exam exposes gaps in leading AI models

22 June 2026, 14:31·1 min read

As AI systems outgrew older academic benchmarks such as MMLU, a worldwide team of nearly 1,000 researchers built Humanity’s Last Exam to probe expert-level knowledge. The 2,500 question assessment spans mathematics, humanities, natural sciences, ancient languages and specialized fields, with each problem designed to have one clear, verifiable answer and resist simple internet searches.

Researchers screened every question against leading models and removed any item a model could answer correctly, leaving a benchmark just beyond current systems. Early results were low: GPT-4o scored 2.7 percent, Claude 3.5 Sonnet reached 4.1 percent and OpenAI’s o1 model scored 8 percent. The strongest systems tested so far, including Gemini 3.1 Pro and Claude Opus 4.6, have reached accuracy levels between about 40 percent and 50 percent.

Texas A&M’s Dr. Tung Nguyen, who contributed 73 of the 2,500 publicly available questions, said stronger benchmarks are needed so policymakers, developers and users do not overread AI performance on tests designed for people. Most questions remain hidden to limit memorization, while public examples make the benchmark transparent for tracking future progress.

Originally reported by sciencedaily.comRead the source →

Related coverage

Business

Humanity’s Last Exam exposes gaps in leading AI models

AI boom races ahead as costs and risks mount

AI advances span diagnostics, robotics and workflow automation

AI shifts into execution and scrutiny

Shadow AI creates growing business risk