NVDA 208.65 ▼0.97%GOOGL 349.68 ▼4.99%MSFT 367.34 ▼3.18%AMD 551.63 ▲2.65%INTC 140.94 ▲5.19%TSMC 467.67 ▲1.20%AMZN 232.79 ▼4.75%META 563.85 ▼2.32%AAPL 297.01 ▼0.34%PLTR 119.50 ▼6.98%
Markets at last close

Microsoft · Business

Private evals become a strategic edge in AI

·1 min read

Satya Nadella’s recent warning that “A frontier without an ecosystem is not stable” puts enterprise AI strategy around a compounding loop: turning workflows, domain knowledge, and institutional judgment into systems that improve with each use. Private evals, reinforcement learning environments, and queryable knowledge bases can shift value away from rented frontier-model capability and toward business-specific learning that competitors cannot easily copy.

Fin illustrates the approach after replacing frontier lab models with a system trained on proprietary data. Its Apex model now handles ~100% of all (English language, chat and email) customer conversations and reportedly improved one large gaming customer’s resolution rate from 68% to 75%, reducing unresolved conversations of 22%. The same logic helps explain why Salesforce’s acquisition of Fin is framed as strategically meaningful.

Cursor has moved in a similar direction with CursorBench, a private benchmark built from real user requests and used to train Composer models against codebase-specific practices. The benchmark suggests efficiency can matter as much as raw score: Composer 2.5 reached 63.2% for just 55 cents and ~15,000 tokens, while Opus 4.8 at full effort barely beat it while using ~5x tokens and costing ~14x more.

The broader risk is incentive misalignment between AI providers and customers. Without private evals, companies may struggle to know whether higher test-time compute is improving outcomes or simply increasing vendor revenue, leaving them dependent on external labs rather than building their own compounding advantage.

Originally reported by mbi-deepdives.comRead the source →
Related coverage
All Microsoft news →