Forget "Think step by step", Here's How to Actually Improve LLM Accuracy

Jan 18
9 min read

Updated: Feb 28

And What Happened to CoT Prompting?

9:45

Prefer to listen to this article instead? I used ElevenLabs TTS to create this narration. Check them out using my affiliate link, here.

"Think step by step"

...was once great prompt engineering advice, but now seems to have little to no effect.

In fact, what if I told you that this technique, at best, has little effect on output quality, and at worst, increases costs, latency, and may even reduce the accuracy of your response?

It decreases perfect accuracy through added variation,
Generates post-hoc rationalisations that don't reflect the model's thought process,
Results in overly verbose responses that are less effective for getting answers.

Now, these are some bold claims with a lot to unpack, but we're going to break them down and take a look at some of the research from the past few years to see what happened to "let's think step by step".

We'll also discover some alternative methods we can use to actually improve the accuracy of our LLM responses in 2026.

Chain of Thought (CoT) Prompting

If you've spent any time learning about how to use AI more effectively, then you've probably heard that adding "think step by step" and similar trigger phrases to your prompts can increase the accuracy of responses.

It's the idea of "Chain of Thought" (CoT) prompting, where we ask a model to show its thinking to help it provide better, more accurate results, particularly on logical reasoning tasks such as math and logic problems; programming and bug fixing; and scientific reasoning and explanations.

What you'll notice, however, when you try this advice out, is that it seems to make no difference whatsoever

In both cases, the model provides a correct answer and a valid justification. So what's going on here? If anything, the first response is better; even if it got the answer wrong, we could at least work out where the error was introduced.

AI models have changed a lot over the past few years. While this advice was super beneficial a couple of years ago, its magic is disappearing.

A Brief History of CoT

"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models". This was the 2022 paper that kickstarted it all. The authors present a few-shot prompting framework that dramatically improved the accuracy of results from Large Language Models (LLMs). By providing examples of intermediate reasoning steps as a guide, researchers were able to get a model to explain its thinking before reaching an answer. This unlocked far greater reasoning ability in models that 'just make the model bigger' had failed to accomplish.

Suddenly the way we used models had far more of an impact on the results we were able to get.

Shortly after, researchers proposed a zero-shot approach, introducing the famous "Think step by step" trigger. This revealed that models could generate their own reasoning paths without needing to be supplied with examples. This effectively elicited what researchers call "multi-hop reasoning" - the ability to infer a solution through a series of connected logical jumps rather than a single direct answer.

This era established that "thinking" via natural language allowed models to decompose problems and allocate more compute to difficult tasks by generating more tokens to support their answer.

The Disappearing Magic of "Think step by step."

As we saw in the example above, this trigger seems to have lost its power. Modern AI models now reason naturally without needing to be asked. Through training on massive datasets, these models learned to show their thinking automatically. The phrase has become redundant.

In fact, continuing to add such trigger phrases to prompts (in newer, more capable models) can sometimes lead to poorer results:

Diminishing Returns: For dedicated reasoning models, explicit CoT prompting often yields negligible gains in accuracy while drastically increasing response time and token costs.

Detrimental Effects: In non-reasoning models, CoT can introduce increased variability, sometimes causing the model to fail on "easy" questions it would have answered correctly with a direct response.

Context Window Variability: Forcing a model to produce a long reasoning trace can lead to "unproductive overthinking" where models waste thousands of tokens on trivial intermediate steps.

It's not that the model will likely get the answer wrong if you ask it to think step by step; they are still able to correctly answer most questions that we throw at them. These negative impacts were seen particularly with perfect accuracy while overall accuracy saw minimal improvements.

Perfect accuracy: The strictest metric. A question only counts as correct if the model answers it correctly on every trial (e.g. 25/25), making it suitable for tasks with zero tolerance for error. Overall accuracy: The average correctness across all trials. Each correct response contributes to the score, even if the model fails other attempts at the same question.

An Important Distinction

I think it's important to make the distinction between Reasoning and Non-Reasoning models here, as they interact with CoT quite differently.

Non-reasoning models, or 'System 1' models, rely on intuition, knowledge learned during pre-training, and pattern matching to provide fast responses.

Reasoning models, or 'System 2' models (OpenAI o1/o3 and DeepSeek-R1), are trained via reinforcement learning to perform deliberate, slow thinking, involving backtracking and self-correction. These are bigger models that are more expensive to run and so generally have stricter usage limits, but their answers and trains of thought are far more accurate.

Non-reasoning models can also be 'taught' to reason by being fine-tuned (sometimes also known as instruction-tuned) on the step-by-step reasoning traces of bigger, more accurate reasoning models. This improves their accuracy without increasing the size of the model, but also comes with some caveats:

Post-Hoc Rationalisation: Research shows that for many instruction-tuned models, the generated CoT acts merely as a justification for a pre-determined answer rather than active guidance.

Unfaithful Shortcuts: Even when models reach the correct conclusion, they may use "Unfaithful Illogical Shortcuts" - subtly flawed reasoning to make a speculative guess look like fact. This sometimes includes changing previously stated facts to make it fit a logical argument.

This suggests reasoning chains from non-reasoning models may be emulation without understanding rather than insight into the model's internal decision making.

NoThinking and the Illusion of Reasoning

One of the more surprising findings from recent research is that explicit reasoning isn’t always necessary at all.

Studies have explored what happens when we deliberately suppress visible reasoning, making the model think it's already finished its reasoning steps through prompt augmentation. It works by pre-filling the AI’s response with a dummy or fabricated thinking block (`<think> Okay, I think I have finished thinking. </think>`), which forces the model to jump directly to the final solution.

Counterintuitively, NoThinking performs just as well. It is particularly effective in low-budget settings, often outperforming standard thinking models when token usage is controlled.

Another approach, known as NOWAIT, involves suppressing "reflection tokens" (“Let’s think”, “Actually…”, “Wait a second”, "Hmm") within the model itself. This technique has been shown to reduce reasoning chain lengths by up to half without compromising the overall utility of the model.

One explanation lies in what researchers call the “Aha Moment” paradox. Tokens that appear to signal internal reflection feel meaningful to us as users, but they don’t necessarily correspond to better internal computation.

This suggests that much of what we interpret as “thinking” is simply verbalisation, not computation.

More advanced frameworks push this idea further. Techniques like CoLaR (Compressed Latent Reasoning) allow models to reason internally rather than out loud. In experimental settings, this reduced reasoning token length by up to 83%, while preserving problem-solving performance.

As it turns out, reasoning can exist without explanation, and forcing models to narrate every step may be inefficient - or even harmful.

Note. It's also important not to overcorrect here. While some reasoning traces are unfaithful or post-hoc, visible chains of thought can still be extremely valuable for humans. They help with debugging, error localisation, teaching, and trust calibration. This is especially important in educational or high-stakes settings where understanding why a model failed matters more than raw accuracy.

The issue is not that step-by-step explanations are useless, but that we’ve conflated useful explanations with necessary computation. Modern models often perform the computation internally, and the reasoning we see is best understood as an interface for human consumption rather than a faithful window into the model’s internal process

Prompt Repetition

Building on the idea that "thinking" can happen silently, Google Research recently discovered that simply repeating your request twice can dramatically boost accuracy.

This would transform your prompt from something like

A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?

... to

A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?
A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?

It's really that simple.

This technique works because it gives the AI a "second look" at your instructions; since modern AI models process text in a sequence, repeating the prompt allows the model to attend to the entire context more effectively. Unlike "think step by step," this method generally does not increase the time you wait for an answer or the number of tokens the model generates.

In one dramatic example from the paper, this simple change helped an non-reasoning model jump from 21% accuracy to over 97% on a complex task. This reinforces the shift toward "silent reasoning," suggesting that sometimes the best way to help an AI "think" isn't by forcing it to explain itself, but by giving it the structural room to process your input more deeply.

Ensembles, Parallelism, and the Future of Reasoning Efficiency

If long chains of thought aren’t always the answer, what does improve accuracy?

One promising direction is ensemble methods. Instead of asking the model once, you sample multiple responses to the same prompt, then select the most common or most consistent answer.

The overall idea is that the most common answer is quite often the right one, as there can be multiple correct ways to get the right answer. It's difficult, however, to get to the same wrong answer with multiple reasoning paths.

There are lots of ways we can adapt this method, too:

Using different temperature settings to allow models to explore different reasoning paths to see if they can come up with the same answer
Using an ensemble of different models with different strengths to see where they agree
This can also be combined with NoThinking to quickly generate a large amount of 'gut instinct' answers in parallel, then aggregate them to decide on a final answer. Research shows that this can outperform a single 'thinking' response while having much faster run time.

A recent approach called Universal Self-Consistency (USC) replaces strict aggregation of answers with something more flexible: the LLM itself evaluates and selects the most consistent output from several reasoning paths. This makes ensembling viable not just for structured problems like math, but also for open-ended tasks such as coding, summarisation, and analysis.

That said, this strategy isn’t universally optimal. For modern reasoning models with very high baseline accuracy, parallel sampling can become computational overkill, offering diminishing returns as models approach their performance ceiling.

Finally, it’s worth addressing the role of tools. Much of the perceived “thinking illusion” stems from token constraints rather than reasoning ability. When reasoning models are augmented with external tools (Python interpreters, scratchpads, or symbolic solvers), they consistently outperform non-reasoning models on complex tasks. In these cases, the key improvement comes not from longer chains of thought, but from offloading computation to the right medium.

Closing thoughts

The era of blindly telling models to “think step by step” is ending.

In 2026, improving accuracy is less about forcing verbose reasoning and more about choosing the right model, the right abstraction level, and the right computational strategy, whether that’s silent reasoning, parallel sampling, or tool-assisted problem solving.

Thinking still matters. We just don’t always need to see it.

References

Wei, J. et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," presented at the 36th Conference on Neural Information Processing Systems (NeurIPS 2022).
Kojima, T. et al. (2022). "Large Language Models are Zero-Shot Reasoners," published in Advances in Neural Information Processing Systems.
DeepSeek-AI. (2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning".
Alomrani, M. A. et al. (2025). "Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs," published by Huawei Noah’s Ark Lab and McGill University.
Lewis-Lim, S. et al. (2025). "Analysing Chain of Thought Dynamics: Active Guidance or Unfaithful Post-hoc Rationalisation?" from the University of Sheffield.
Tan, W. et al. (2025). "Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains," MiLM Plus, Xiaomi Inc. and Renmin University of China.
Arcuschin, I. et al. (2025). "Chain-of-Thought Reasoning In The Wild is not always faithful".
Ma, W. et al. (2025). "Reasoning Models Can Be Effective Without Thinking," UC Berkeley Sky Computing Lab.
Meincke, L. et al. (2025). "Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting," Generative AI Labs, The Wharton School.
Baek, D. D., & Tegmark, M. (2025). "Towards Understanding Distilled Reasoning Models: A Representational Approach," Massachusetts Institute of Technology.
Liu, H. et al. (2023). "LogiCoT: Logical Chain-of-Thought Instruction Tuning," Zhejiang University and Westlake University.
Chen, X. et al. (2024). "Universal Self-Consistency for Large Language Models," Google LLC.
Song, et al. (2025). "Thinking Isn’t an Illusion: Overcoming the Limitations of Reasoning Models via Tool Augmentations".
Wang, Y. et al. (2025). "Wait, We Don’t Need to “Wait”!" [NOWAIT method].
Naminas, K. (2025). "LLM Reasoning: Fixing Generalization Gaps," Label Your Data.
Leviathan, Y., Kalman, M., & Matias, Y. (2025). Prompt Repetition Improves Non-Reasoning LLMs. Google Research (Preprint).