NVDA 208.65 ▼0.97%GOOGL 349.68 ▼4.99%MSFT 367.34 ▼3.18%AMD 551.63 ▲2.65%INTC 140.94 ▲5.19%TSMC 467.67 ▲1.20%AMZN 232.79 ▼4.75%META 563.85 ▼2.32%AAPL 297.01 ▼0.34%PLTR 119.50 ▼6.98%
Markets at last close

Research

Large language models require a new form of oversight: capability-based monitoring

·1 min read

Katherine Kellogg and coauthors from the massachusetts institute of technology, harvard, and northeastern present an 18-page paper proposing capability-based monitoring for large language models used in healthcare. Posted 14 November 2025 and written 5 November 2025, the paper critiques existing monitoring approaches that are task-based and inherited from traditional machine learning. The authors note that task-based monitoring assumes performance degradation driven by dataset drift, an assumption that does not reliably hold for generalist large language models that were not trained for specific tasks or populations.

Capability-based monitoring reframes oversight around overlapping internal capabilities that models reuse across many downstream tasks. The paper highlights examples of such capabilities, including summarization, reasoning, translation, and safety guardrails, and argues that organizing monitoring around these shared abilities enables cross-task detection of systemic weaknesses, long-tail errors, and emergent behaviors that task-based approaches may miss. By focusing on capabilities, organizations can detect issues that propagate across multiple applications rather than evaluating each downstream task independently.

The authors describe considerations for implementation aimed at developers, organizational leaders, and professional societies. They propose that capability-based monitoring offers a scalable foundation for safe, adaptive, and collaborative oversight of large language models and future generalist Artificial Intelligence models in healthcare. The paper positions this approach as a practical organizing principle grounded in how these models are developed and used in practice, and as a way to enable broader detection and mitigation of failures that affect multiple clinical and operational use cases.

Originally reported by papers.ssrn.comRead the source →
Related coverage