Satori pushes large language model reasoning with chain-of-action-thought and reinforcement learning

2 July 2025, 09:53·1 min read

Large language models have shown impressive reasoning skills across different disciplines, but many advances rely on complex systems where an external verifier oversees inference. This approach involves significant test-time computation and frequently splits reasoning into a two-player scenario: the model and an evaluator. Despite this, evidence continues to mount that a single, well-trained language model could handle complex problem solving unaided, provided its reasoning abilities are sufficiently strengthened.

Addressing this, researchers introduce Satori, a new 7B parameter large language model developed upon the principle of internalizing advanced search and self-reflection processes. The work presents ´Chain-of-Action-Thought´ (COAT) reasoning, which extends the model´s ability not just to think step by step, but to iteratively explore, reflect, and adjust its strategies internally. The training paradigm unfolds in two stages: an initial small-scale format tuning to internalize the COAT reasoning style, and a large-scale reinforcement learning phase that enables the model to iteratively improve itself through self-guided exploration.

Satori´s empirical performance sets new state-of-the-art results on mathematical reasoning benchmarks, indicating not just improved computation but robust generalization to tasks beyond its training distribution. By training exclusively on open-source data and models, and committing to open-sourcing the full suite of code, data, and models, the team aims to accelerate community-driven progress in Artificial Intelligence reasoning and autonomy. The novel focus on making sophisticated autoregressive search a native part of model reasoning marks a significant shift from reliance on external evaluation, paving the way for more autonomous and adaptable language models in the future.

Originally reported by research.ibm.comRead the source →

Related coverage

Policy

Satori pushes large language model reasoning with chain-of-action-thought and reinforcement learning

Anthropic feud tests US AI controls

Meta’s AI overhaul rattles engineering teams

Europe pushes for AI autonomy as US lead widens

EU rejects security-risk label after US order on Anthropic models