The Reasoning Revolution

A Journey into How AI (LLM) Learned to Think

Jul 07, 2025

1×

0:00

-18:44

How Human Reasoning Inspired Machines

We often think of intelligence as being right or knowing the right answer. But real intelligence is more than that. It’s the ability to reason.

Reasoning is the mental process of forming judgments, drawing conclusions, or making inferences by analyzing information, connecting ideas, and applying logic. Consciously or subconsciously, we humans engage in reasoning almost all the time. You might be reasoning while answering a question or while crafting a detailed strategy for your business.

Even as you read this, you're likely reasoning. You may be deciding whether you agree, challenging certain points, or forming new ideas. Every decision we make is shaped by this mental process. Reasoning is not just a skill—it is a defining aspect of human intelligence, the force behind our history, our progress, and our innovations.

But that may no longer be uniquely human. Now, something new is emerging: models that don’t just give answers, but think through them.

Generated image — Reasoning Revolution: Model Reasoning a Math problem and teaching a human.

What Reasoning Looks Like in AI

Imagine you ask a model:
“A principal of $10,000 is invested at a 5% annual interest rate compounded annually. What will be the total amount after 3 years?”

Older models might jump to an answer—sometimes correct, sometimes not. Well, this mechanism works fine when you're asking, say, “What’s the capital of France?” But what if I asked you to calculate compound interest for a given principal, at a certain rate, over time? That’s not something even most humans can answer in a blink. We pause. We reason.

And now, so do machines.

Have you ever tried asking an advanced AI model a fairly complex question? What does it do? It plans. It reasons. And the most sophisticated models can go further—they continue, reflect, and even explore.

From Answers to Thought Processes: A Token Revolution

In the world of large language models, everything is a token. But the way those tokens are used has evolved. A new kind of token has emerged: one that signals intent or cognition. It’s a shift from text prediction to thought simulation.

Unlike basic models that rely only on input and output tokens, reasoning models work with an expanded vocabulary of tokens :

Planning tokens <plan>: Help the model structure its thoughts—breaking down complex problems into smaller, manageable tasks with logical execution order and dependencies.
Reasoning tokens <reason>: Allow the model to infer, hypothesize, and draw logical conclusions step by step.
Continue <continue>, Reflect <reflect>, and Explore <reflect> tokens (in advanced models): Enable iterative thinking—where the model doesn’t just respond but revisits assumptions, adjusts its approach, or considers alternative outcomes.

In case you haven’t experimented with any of the reasoning models yet, here’s a snapshot of Perplexity reasoning through an input prompt. While the intermediate steps shown aren’t the exact “reasoning tokens” or other specific token types we discussed earlier, they still offer valuable insight into how the model is approaching the problem. It gives the user a clearer sense of the model’s thought process, even if the internal mechanics are abstracted away.

Perplexity AI: Reasoning through Problem Statement

After reading all this, you might be wondering—this is great, but how exactly do models reason? Or how do you teach a machine to reason?

Enhancing a model’s ability to reason can happen during both of these phases:

During training, known as Train-Time Scaling
During inference, known as Test-Time Scaling

In the rest of this article, I’ll walk you through key techniques under each of these two umbrellas, starting with Train-Time Scaling.

Train-Time Scaling: Teaching Models to Reason During Training

Train-time scaling refers to methods applied during the training phase of a model to enhance its capabilities, including reasoning.

The first major method under this is

1. Supervised Fine-Tuning (SFT): Learning from Demonstrations

Let’s go back to a human analogy. Suppose you are learning to solve a complex math problem where reasoning through multiple steps is crucial. What do you typically do? You study similar solved problems, observing how each step is broken down to reach the solution.

The more examples you see—covering different nuances—the better you become at solving not just that specific problem, but similar types of problems.
This approach applies broadly across many areas of reasoning.

Let’s take an example that combines multiple layers of reasoning: solving a quadratic-linear system of equations.

Solving a Quadratic-Linear System of Equations

System:
3x² - 2y = 12
2x + y = 7

Step 1: Identify the approach
Since one equation is linear, solve for y in terms of x and substitute.

Step 2: Solve for y in the linear equation
From 2x + y = 7:
→ y = 7 - 2x

Step 3: Substitute into the quadratic equation
Replace y in 3x² - 2y = 12 with (7 - 2x):
→ 3x² - 2(7 - 2x) = 12

Step 4: Expand and simplify
→ 3x² - 14 + 4x = 12
→ 3x² + 4x - 14 = 12
→ 3x² + 4x - 26 = 0

Step 5: Solve the quadratic equation using the quadratic formula
x = (-b ± √(b² - 4ac)) / (2a)
Where a = 3, b = 4, and c = -26.

In the example above, notice how there’s a consistent structure to approaching the problem:

Identify, Substitute, Simplify, and Solve.

Even when faced with an entirely different set of numbers, you can still solve the problem because you have mastered the reasoning process behind it, not just the specific example.

Similarly, when you provide AI models with numerous examples of complex reasoning tasks, complete with detailed, step-by-step solutions, the model learns to recognize and internalize the underlying patterns.

Here is what it could look like:

<plan> First, solve the linear equation for one variable. </plan>
<reason> We solve 2x + y = 7 for y, since it’s easy to isolate. </reason>
<calculate> y = 7 - 2x </calculate>
<plan> Substitute this expression into the quadratic equation. </plan>
<calculate> 3x² - 2(7 - 2x) = 12 </calculate>
<calculate> 3x² - 14 + 4x = 12 </calculate>
<calculate> 3x² + 4x - 26 = 0 </calculate>
<plan> Use the quadratic formula to solve for x. </plan>
<calculate> x = (-4 ± √(4² - 4×3×(-26))) / (2×3) </calculate>
<answer> x = (-4 ± √(16 + 312)) / 6 = (-4 ± √328) / 6 </answer>

It’s called supervised fine-tuning because you giving the model the problem and the final answer along with the steps; you are supervising its learning by providing a full breakdown of the steps needed to solve that type of problem.

Over time, the model learns not just the answers, but the process—how to break down a complex problem, reason through each part, and apply similar methods to new problems during inference. In more technical terms, they learn the capability to generate the intermediary tokens pertaining to this process we learnt earlier.

In essence, Supervised Fine-Tuning teaches a model not just what to think, but how to think systematically through a problem.

2. Reinforcement Learning(RL) : Learning through rewards

Giving examples and breaking down step-by-step solutions is the most basic way to help AI models learn how to reason. This is what Supervised Fine-Tuning (SFT) does best—it teaches the model a structured way to approach problems by mimicking human examples. But let’s be honest: that alone can only take you so far.

Models can’t match human reasoning by simply memorizing examples and approaches. And frankly, neither can humans. The real world is messy, full of edge cases, and constantly evolving. Sometimes, you need to learn something entirely new—or creatively adapt what you already know—to succeed in a specific context. That’s where Reinforcement Learning (RL) becomes essential.

This technique has played a critical role in helping modern AI models reach their current level of reasoning ability. Without it, the performance of these models would plateau quickly.

So, What Makes Reinforcement Learning Different?

Let’s go back to the human analogy.

Think about when you first learned to ride a bike. You didn’t sit down and read a step-by-step guide on balance. Instead, you got on the bike with very little knowledge. You pedaled. You leaned too far one way. You fell. That was negative feedback. Then you found the right balance and moved forward—that was positive feedback.

Over time, your body and brain, consciously or subconsciously—started making adjustments. You learned what worked and what didn’t, and eventually, riding a bike became second nature. Reinforcement learning in machines works in a very similar way.

How Reinforcement Works in Language Models

Just like a child learning to balance on a bike, a model typically starts with some basic reasoning ability—often gained through supervised fine-tuning. Without this base, it might not even produce coherent outputs (issues like language mixing could occur, though we won’t dive into that here).

Once the model has this foundational ability, it’s put into a reinforcement learning loop. Here’s how it works:

The model is given a task and generates a response.
That response is then evaluated using a reward function—a kind of feedback mechanism.
If the response is high-quality, the model gets a positive reward (say +1). It learns that the context, steps, or reasoning it used were effective and should be reinforced.
If the response is poor, it gets low or negative rewards (say 0 or -1). The model then learns to avoid those patterns in future attempts.

Over thousands (or millions) of iterations, the model fine-tunes its reasoning strategy, learning which steps and contextual cues lead to better outcomes. It becomes more adept not just at providing answers—but at reasoning its way to them.

What Are These Rewards Based On?

In the context of Large Language Models (LLMs), rewards can be based on several criteria. Here are a few common ones (though this is by no means a comprehensive list):

Rule-Based Rewards: For domains with clear correctness (e.g., math problems or code), the model is rewarded if the answer is correct or if the code passes test cases.
Format Rewards: If an answer follows a desired format—like placing reasoning steps between special tags—the model is rewarded for adherence.
Language Consistency: To prevent the model from mixing languages mid-output, it may be rewarded for maintaining output in the target language throughout its reasoning process.

Each time the model successfully reaches a reward, it's a form of validation—proof that it’s on the right track. These signals are used to adjust internal weights and embeddings, helping the model shape intermediary tokens and future outputs more effectively and reliably.

3. Model Distillation: Learning from Mentor

You might’ve noticed something surprising: some smaller models are just as good—or even better—at reasoning than their larger counterparts. They’re faster, more efficient, and still deliver impressive results. A great example of this is OpenAI’s GPT-4 models—O3 and O3-mini—but they’re not the only ones. This growing trend raises an important question: how are these smaller models getting so good at reasoning?

The answer lies in a technique called Model Distillation.

At its core, model distillation is the process of transferring the knowledge and capabilities of a larger, more advanced model into a smaller one. Think of it as condensing years of your tutor’s experience into a crash course for students, without losing the essence of what makes that experience valuable.

A Real-World Analogy: Learning from the Expert

Let’s bring this down to everyday life.

Say you’ve spent years mastering a skill—maybe solving complex math problems, building business strategies, or navigating difficult decisions. You’ve tried and failed, experimented with different approaches, and finally figured out what works best. If someone asked you to teach them, you wouldn’t walk them through every misstep you made. You’d give them the best version of your process—the distilled essence of your learning.

That’s exactly what model distillation is in the world of AI.

A large model—one that’s been through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL)—learns through countless iterations how to break down problems, structure reasoning, and generate high-quality solutions. Once this expertise is solidified, a smaller model is trained on the outputs of the larger model: the prompts, the structured reasoning steps, and the final answers.

In short, the smaller model learns how to think by observing how the bigger model thinks.

Why Distillation Works So Well ?

Smaller models don’t need to go through the same exhaustive training process as their larger counterparts. Instead, they’re trained on high-quality data—already curated and refined by the larger model’s reasoning engine. And as we all know: better data means better models. Garbage in, garbage out—but in this case, it’s the opposite. The distilled data is full of structured, thoughtful, and effective reasoning patterns.

This makes the smaller models incredibly effective with far less compute and time. They inherit the skills and problem-solving capabilities of the original model—almost like an apprentice learning from a master. That’s how you get models like O3-mini, which can mirror the reasoning strength of O3, but in a leaner and faster form.

So far → we have seen a few core techniques used to help models reason under the umbrella of Train time scaling. This might follow any combination of these techniques and also apply multiple stages of these techniques to train to build the model’s reasoning capablity. Now let’s move to the next phase: Test train scaling. While training-time methods build a strong reasoning foundation inside the model weights,the inference-time techniques exploit that foundation by allocating extra compute to guide, verify, or expand reasoning chains.

Test-Time Scaling:

Test-time scaling, also known as inference-time scaling, refers to strategies applied after a model has been trained. These techniques don't change the model’s underlying capabilities—instead, they enhance its reasoning during inference by giving it more “thinking time” or computational resources. Think of it as helping the model perform better on the spot without retraining it from scratch.

So how exactly do we help a trained model reason more effectively in real-time?

That’s where the family of “X of Thought” reasoning techniques comes into play.

1. Chain of Thought (CoT): Step-by-Step Thinking

One of the foundational techniques in this family is Chain of Thought (CoT) reasoning. While a model may internalize the general structure of CoT, the more clearly the reasoning is broken down—refined into detailed, step-by-step logic, the better the model performs.

For example, earlier models like GPT-3.5 showed notable improvements simply by being prompted to “think step by step.” This approach gave the model room to unpack complex problems methodically rather than leaping to conclusions.

Another breakthrough came with increasing the CoT window length, as seen in models like O1. By allowing more space for the model to articulate its thinking, it could break down problems into deeper and more structured reasoning chains, leading to more accurate answers.

But CoT isn’t without its limitations. It works best when the problem follows a mostly linear structure. In reality, many problems are not so straightforward—they’re messy, multidimensional, and require evaluating multiple options in parallel.

2. Tree of Thought (ToT): Exploring Multiple Paths

That’s where Tree of Thought (ToT) comes in.

If Chain of Thought is like solving a problem by walking a single road (linear path), Tree of Thought is like navigating a trail map with multiple branching paths. Instead of committing to one line of reasoning, the model explores several possible routes simultaneously—each representing a different way to solve the problem.

At every decision point, it assesses the branches and chooses the one that appears most promising, based on expected outcomes or "rewards."

Imagine you’re planning a vacation. You consider multiple destinations—Italy, Japan, Peru. For each option, you weigh trade-offs: budget, weather, travel time, cultural appeal. After exploring the pros and cons, you settle on the destination that aligns best with your goals.

That’s exactly how the Tree of Thought works: by comparing alternative paths and choosing the most rewarding one. We have Graph of Thought (GoT), Step Back to Leap Forward, and other related techniques.

3. Reflection and Self-Evaluation: Rethinking the Answer

Another powerful inference-time technique is self-reflection, often working hand-in-hand with Tree of Thought. Remember those specialized tokens we explored earlier—Continue, Reflect, and Explore? Models equipped with these capabilities embody exactly this kind of reasoning. They don’t just think—they pause, evaluate, and course-correct.

Think of it like a model stopping mid-task, reviewing what it has done so far, and asking itself: “Does this make sense?”

Let’s return to a human analogy.

Imagine you're crafting a strategy to grow revenue. You brainstorm a few options: acquire a smaller company, launch a new product, expand into a new market, or form a strategic partnership. Then, you evaluate each path against your company’s strengths, resources, and long-term goals. After reflecting, you discard the less viable ideas and double down on the one with the highest potential.

AI models can now do something surprisingly similar.

The Reflexion technique enables language models to generate verbal feedback on their own outputs, saying things like, “This step contradicts an earlier point...” or “This conclusion doesn’t logically follow.” The model can then revise its answer based on that internal critique.

Similarly, Self-Refine pushes this even further. It allows the model to iteratively critique and improve its responses, gradually aligning its output with the intended objective.

This is a form of meta-reasoning—the ability to reason about its own reasoning. Just as we double-check ourselves, these models can now do the same. If you used the Deep Research feature in any of the state-of-the-art models available today or asked a complex question, this will likely happen in the background.

These are just a few of the many techniques that enhance a model’s ability to reason. But the field is evolving rapidly. Even as you read this, researchers around the world are working on more advanced architectures and refining the reasoning strategies we've explored—whether it's Chain of Thought, Tree of Thought, or self-reflection methods like Reflexion and Self-Refine.

If you haven’t consciously noticed these techniques in action before, try paying closer attention the next time you interact with advanced AI tools like Gemini, ChatGPT, or Perplexity. You’ll often find the model breaking down a problem step by step, pausing to evaluate its reasoning, or exploring alternative paths before arriving at a final answer.

Article Summary: Techniques to improve your Model’s reasoning capabilities ( Generated by ChatGPT)

Final Thoughts

Reasoning models represent a significant step towards Superintelligence or AGI. The current era of agentic AI, where autonomous agents perform complex tasks, is enabled by breakthroughs in reasoning. This core capability has unlocked a new level of intelligence, once considered uniquely human. Whether you're building with AI, choosing the right tools, or simply curious about how it all works, understanding the foundation of this intelligence stack will help you navigate what’s coming next and of course, quench your curiosity. What we have covered so far is a collection of insights from various research papers, but only the basics of them. There are more advanced techniques discovered just in the past quarter to improve models' reasoning capabilities. Stay tuned for more.

Sources

Thank you for reading Dheena's Newsletter. This post is public so feel free to share it.

Dheena's Newsletter