MGB benchmark tests clinical AI on real-world data
MGB’s Clinical LLM Benchmark, known as BRIDGE, is designed to test medical language models against messier clinical material rather than polished exam-style questions. The dataset draws from 59 clinical sources spanning fourteen specialties, with tasks such as triage prioritization, procedure coding and discharge instruction generation. Samples come from real EHRs or peer-reviewed case reports, and the benchmark also covers Arabic, Spanish, Chinese and six additional languages.
BRIDGE includes 87 tasks grouped into eight functional categories and standardizes zero-shot, chain-of-thought and few-shot inference modes. Experiments log hyperparameters, compute cost and token usage, while contamination checks are intended to reduce the risk of data leakage. A public leaderboard on Hugging Face Spaces listed 107 models at press time, including OpenAI GPT-4o, Google Gemini and DeepSeek entries.
Researchers ran 13,572 experiments and found that leading generalist models performed far worse on BRIDGE than on conventional medical exams. Top models scored roughly 45 percent on the benchmark, while some open-source systems fell below 20 percent on multilingual tasks. GPT-4o posted an average BRIDGE macro F1 of 44.8, compared with an average MedQA accuracy of 92.0, a 47.2-point realism penalty.
Hospital technology leaders are using the benchmark to inform vendor selection, risk assessments and specialty-specific thresholds for tools such as documentation assistants, scribes and coding bots. Planned additions include imaging-text fusion tasks within six months, adversarial prompting tests, uncertainty intervals and expanded multilingual coverage.