NVDA 208.65 ▼0.97%GOOGL 349.68 ▼4.99%MSFT 367.34 ▼3.18%AMD 551.63 ▲2.65%INTC 140.94 ▲5.19%TSMC 467.67 ▲1.20%AMZN 232.79 ▼4.75%META 563.85 ▼2.32%AAPL 297.01 ▼0.34%PLTR 119.50 ▼6.98%
Markets at last close

OpenAI · Models

Humanity’s Last Exam exposes gaps in leading AI models

·1 min read

As AI systems outgrew older academic benchmarks such as MMLU, a worldwide team of nearly 1,000 researchers built Humanity’s Last Exam to probe expert-level knowledge. The 2,500 question assessment spans mathematics, humanities, natural sciences, ancient languages and specialized fields, with each problem designed to have one clear, verifiable answer and resist simple internet searches.

Researchers screened every question against leading models and removed any item a model could answer correctly, leaving a benchmark just beyond current systems. Early results were low: GPT-4o scored 2.7 percent, Claude 3.5 Sonnet reached 4.1 percent and OpenAI’s o1 model scored 8 percent. The strongest systems tested so far, including Gemini 3.1 Pro and Claude Opus 4.6, have reached accuracy levels between about 40 percent and 50 percent.

Texas A&M’s Dr. Tung Nguyen, who contributed 73 of the 2,500 publicly available questions, said stronger benchmarks are needed so policymakers, developers and users do not overread AI performance on tests designed for people. Most questions remain hidden to limit memorization, while public examples make the benchmark transparent for tracking future progress.

Originally reported by sciencedaily.comRead the source →
Related coverage
All OpenAI news →