Reasoning Models Explained: From Chain-of-Thought to o3
A technical primer on how reasoning models work — from basic chain-of-thought prompting to OpenAI's o3 and DeepSeek R1. Understanding the inference-time compute revolution.
The Evolution of AI Reasoning
The journey from basic language model outputs to genuine multi-step reasoning represents one of the most significant advances in AI. Understanding this evolution — from simple chain-of-thought prompting to dedicated reasoning models like o3 and DeepSeek R1 — is essential for any developer working with LLMs.
Level 1: Chain-of-Thought Prompting (2022)
The story begins with Google's chain-of-thought (CoT) paper in January 2022. The insight was deceptively simple: if you ask a model to "think step by step," it performs dramatically better on reasoning tasks.
# Without CoT
Q: If a store has 42 apples and sells 3/7 of them, how many remain?
A: 18 ← WRONG
# With CoT
Q: If a store has 42 apples and sells 3/7 of them, how many remain?
A: Let me think step by step.
3/7 of 42 = 42 × 3/7 = 126/7 = 18 apples sold
42 - 18 = 24 apples remain
A: 24 ← CORRECT
Why it works: By generating intermediate steps, the model creates a "scratchpad" that keeps partial results in context. Without CoT, the model must compute multi-step answers in a single forward pass through its weights — effectively doing mental arithmetic without paper.
Limitation: The model does not actually reason differently. It generates text that looks like reasoning, and this text happens to improve accuracy by keeping intermediate results in the context window.
Level 2: Self-Consistency and Verification (2023)
Researchers improved on basic CoT with techniques that generate multiple reasoning chains and select the best:
- Self-Consistency: Generate N different reasoning chains for the same problem, then take the majority vote on the final answer
- Tree of Thought: Explore multiple reasoning paths as a tree, evaluating and pruning branches
- Self-Verification: After generating an answer, ask the model to verify its own reasoning and correct errors
These techniques improved accuracy but multiplied inference costs linearly with the number of generated chains.
Level 3: Trained Reasoning — o1 and o3 (2024-2025)
OpenAI's o1 (September 2024) and o3 (December 2025) represented a paradigm shift: instead of prompting a general model to reason, these are models trained specifically to reason.
Key differences from prompted CoT:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Internal chain of thought: o1/o3 generate hidden reasoning tokens that are not shown to the user. The model "thinks" in an internal monologue before producing a response.
Reinforcement learning from reasoning: These models are trained using reinforcement learning (RL) where the reward signal is based on reaching correct answers through valid reasoning chains. The model learns which reasoning strategies work and which fail.
Compute allocation: The model dynamically allocates more "thinking" tokens to harder problems. A simple factual question might use 50 internal tokens; a complex math proof might use 10,000+.
Deliberative alignment: The model actively reasons about safety policies and constraints within its chain of thought, rather than relying solely on RLHF-trained instincts.
Level 4: Open Reasoning Models — DeepSeek R1 (2025)
DeepSeek R1, released in January 2025, demonstrated that reasoning capabilities could be achieved through a surprisingly elegant training process:
- Cold start: Basic supervised fine-tuning on a small set of reasoning examples
- Pure RL training: Large-scale reinforcement learning where the model is rewarded only for correct final answers — no human-written reasoning chains required
- Emergent behaviors: The model spontaneously developed reasoning strategies including self-verification, backtracking, and multi-approach problem-solving
The remarkable finding: reasoning capability emerged from RL training alone, without requiring explicit reasoning demonstrations.
# DeepSeek R1's emergent reasoning pattern
<think>
Let me approach this problem step by step.
First, I will try direct calculation...
Wait, that gives 17, which seems wrong because...
Let me try a different approach using modular arithmetic...
Yes, this confirms the answer is 23.
</think>
The answer is 23.
How Reasoning Models Differ Technically
| Aspect | Standard LLM | CoT Prompting | o3 / R1 |
|---|---|---|---|
| Reasoning method | Implicit (single pass) | Explicit (prompted) | Trained (RL-optimized) |
| Token overhead | None | 2-5x | 5-100x |
| Training cost | Standard | None (prompt-only) | Significant RL training |
| Reasoning quality | Low on hard problems | Medium | High |
| Consistency | Variable | Improved with SC | Strong |
| Self-correction | Rare | Occasional | Systematic |
When to Use Reasoning Models
Use reasoning models (o3, R1) for:
- Mathematical proofs and competition-level problems
- Complex code generation requiring architectural planning
- Multi-step logical reasoning with constraints
- Scientific analysis requiring hypothesis evaluation
- Tasks where accuracy matters more than speed or cost
Use standard models with CoT for:
- Most production applications where reasoning complexity is moderate
- Latency-sensitive applications
- High-volume workloads where reasoning model costs are prohibitive
- Tasks where approximate reasoning is sufficient
The Inference-Time Compute Revolution
The core insight behind reasoning models is a new scaling axis: inference-time compute. Traditional scaling focused on training — more data, more parameters, more GPU-hours during training. Reasoning models scale at inference time — more thinking per query, dynamically allocated based on problem difficulty.
This has profound implications for AI system design. Rather than deploying the largest model for every query, systems can route simple questions to fast, cheap models and reserve reasoning models for genuinely hard problems. The cost per token matters less when the model uses 10x more tokens but gets the answer right the first time instead of requiring multiple retries.
Sources: OpenAI — Learning to Reason with LLMs, DeepSeek — DeepSeek R1 Technical Report, Google Research — Chain-of-Thought Prompting
NYC News
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.