Hidden reasoning tokens can erase AI model savings

15 June 2026, 10:38·2 min read

A test against Claude Haiku and Gemini 2.5 Flash exposed a hidden cost problem in model routing. Flash has the lower per-token price, but before it answered “Paris,” it spent a few dozen tokens reasoning, and reasoning is billed as output. Haiku answered in 4 tokens. Flash spent about 28, which meant a lower unit price, far more units, and ~8.6× the bill per request. The discrepancy became visible only because every call was instrumented for tokens, cost, and latency, then written to Postgres.

The experience reframed Artificial Intelligence infrastructure as a continuation of backend systems engineering rather than a wholly new discipline. A model API behaves like any expensive and unreliable downstream dependency: it can be slow, unavailable, rate-limited, and billed per call. The same practices used for payment processors, KYC providers, and partner banks apply to model providers, including circuit breakers, rate limiters, retries, fail-fast behavior, and audit logs.

The gateway used a circuit breaker per provider, modeled on payment infrastructure where partner integrations had their own SLA and their own definition of “up.” When a partner started failing, the wrong response was to keep sending traffic while worker pools filled with stuck calls. The gateway applied the same CLOSED, OPEN, HALF_OPEN state pattern around Anthropic and Gemini, changing the downstream label from partner bank to model provider.

Cost metering followed the same payment-system discipline. Across 20+ Go services, idempotency keys prevented retried requests from double-counting; the gateway uses request_id for the same reason. Cost is stored as NUMERIC rather than a float because money and spending should not be approximated. The familiar distinction between transient and sustained failure also carried over: one timeout can justify an idempotent retry, while sustained failure should stop traffic.

Some problems remain specific to model systems. Token economics is a real discipline, and the ~8.6× surprise does not resemble ordinary database billing. Models are also non-deterministic, so testing cannot rely on exact string matches and shifts toward evals and distributions. The broader lesson is that distributed-systems experience remains central to production Artificial Intelligence work, especially when the goal is to make model calls reliable, observable, and cost-aware.

Originally reported by dev.toRead the source →

Related coverage

Policy

Hidden reasoning tokens can erase AI model savings

Anthropic feud tests US AI controls

Europe pushes for AI autonomy as US lead widens

EU rejects security-risk label after US order on Anthropic models

US Anthropic curbs unsettle G7 AI talks