Skip to content
Effloow
← Back to Articles
ARTICLES ·2026-05-18 ·BY EFFLOOW CONTENT FACTORY

Chain of Draft: Cut LLM Reasoning Tokens by 80%

Chain of Draft (CoD) cuts LLM output tokens by 78–92% vs Chain of Thought with a single prompt change. Learn the technique, benchmarks, and when to use it.
prompt-engineering llm cost-optimization chain-of-thought paper-poc reasoning
SHARE
Chain of Draft: Cut LLM Reasoning Tokens by 80%

Every reasoning token your LLM outputs is a token you pay for. With Chain of Thought (CoT) prompting, that bill gets large quickly — a single GSM8K math problem can cost 200 output tokens just in reasoning steps before the model writes an answer. For production systems handling thousands of requests per hour, this adds up fast.

In February 2025, researchers at Zoom Communications published a paper titled "Chain of Draft: Thinking Faster by Writing Less" (arxiv:2502.18600) describing a prompting technique that reduces reasoning tokens by 78–92% while preserving near-identical accuracy. The method requires no model retraining, no fine-tuning, and no architecture changes. It's a single sentence added to your system prompt.

Effloow Lab reviewed the paper, reproduced its benchmark tables, and analyzed the cost implications. This article walks through the findings and shows you exactly how to implement the technique.

What Chain of Draft Actually Does

Chain of Thought instructs a model to explain its reasoning before giving an answer. That explanation is almost always longer than it needs to be. A model solving 15 - 7 + 3 might spend 60–70 tokens narrating each arithmetic operation, then end with the answer.

Chain of Draft applies a constraint: each intermediate reasoning step must be five words or fewer. The model still reasons through the problem — it does not skip steps — but it compresses each thought into a dense shorthand notation rather than a full sentence.

Compare the two outputs for the same arithmetic word problem:

Chain of Thought response (~67 tokens):

Let me work through this step by step. First, I need to find how 
many apples Sarah has. Sarah starts with 15 apples and gives away 
7 apples to her friends. So Sarah has 15 - 7 = 8 apples remaining. 
Next, she buys 3 more apples from the store. So now Sarah has 8 + 3 
= 11 apples. Therefore, Sarah has 11 apples in total.

Chain of Draft response (~4 tokens):

15-7=8; 8+3=11; 11 apples.

Both responses are correct. The model understood the problem, applied the right operations in sequence, and reached the right answer. The CoD version just communicates its working in the minimal notation a human mathematician would use on a scratch pad.

This is the key insight of the paper: verbose reasoning steps are a training artifact, not a cognitive requirement. Models are trained on human-written text where people explain their thinking at length. But the length of the explanation does not determine the quality of the underlying reasoning.

The Prompt Template

The entire technique comes down to one additional instruction in your prompt. The standard CoD template from the paper is:

Think step by step, but only keep a minimum draft for each thinking 
step, with 5 words at most.

You append this to whatever problem you're giving the model:

COD_SYSTEM_PROMPT = (
    "Think step by step, but only keep a minimum draft for each "
    "thinking step, with 5 words at most."
)

# Standard CoT
cot_prompt = f"Think step by step: {problem}"

# Chain of Draft
cod_prompt = f"{COD_SYSTEM_PROMPT}\n\n{problem}"

That is the entire implementation. No special libraries, no API parameters, no post-processing. If you are already using CoT prompting, you replace your CoT instruction with the CoD instruction and get the token reduction immediately.

Benchmark Results: What the Paper Found

The researchers evaluated CoD across three reasoning categories using GPT-4o and Claude 3.5 Sonnet.

Task Model CoT Accuracy CoD Accuracy CoT Tokens CoD Tokens Reduction
GSM8K (arithmetic) GPT-4o 95.3% 91.4% 205.1 43.9 78.6%
GSM8K (arithmetic) Claude 3.5 Sonnet 91.5% 89.2% 190.0 39.8 79.1%
Date Understanding GPT-4o 96.0% 95.0% 141.7 26.5 81.3%
Coin Flip (symbolic) GPT-4o 100.0% 100.0% 119.3 9.1 92.4%

Source: Xu et al., 2025 (arxiv:2502.18600). Effloow Lab reproduced this table from the paper's reported figures.

A few things are worth noting in this data:

Symbolic reasoning sees the largest gains. The coin flip task — tracking probability state across a sequence of flips — dropped from 119.3 tokens to 9.1 tokens with no accuracy loss. When the reasoning chain is state-tracking rather than explanation-writing, the compression is near-total.

Arithmetic trades a few accuracy points for a large token saving. GSM8K accuracy dropped ~4% (95.3% → 91.4% on GPT-4o). Whether that trade is acceptable depends entirely on your use case. For a math tutor application that shows working, it is not. For a backend calculation agent where cost matters more than explanation quality, it probably is.

Common-sense reasoning holds up well. Date understanding on BIG-bench dropped only 1% in accuracy while saving 81% of tokens. This suggests that CoD's compression preserves inferential correctness even for tasks that require world knowledge, not just arithmetic.

What This Costs in Practice

Effloow Lab calculated the direct API cost impact using GSM8K token counts and output pricing from May 2025 (GPT-4o: $10/1M output tokens; Claude Sonnet: $15/1M output tokens):

Model CoT cost (1k calls) CoD cost (1k calls) Savings
GPT-4o $2.05 $0.44 78.6%
Claude Sonnet 3.5 $2.85 $0.60 79.1%

At 100,000 reasoning calls per day, the difference is roughly $160/day on GPT-4o or $225/day on Claude Sonnet — before accounting for input token overhead from the CoD instruction itself, which adds approximately 15–20 tokens per call and is negligible at scale.

Latency also improves. The paper reports a reduction from 4.2 seconds per call to 1.0 second on GPT-4o for GSM8K tasks (76% faster). On Claude Sonnet, latency dropped from 3.1 seconds to 1.6 seconds (48% faster). Shorter output means fewer TTFT (time-to-first-token) delays and faster streaming completion.

Implementing CoD in a Real Application

Here is a complete implementation pattern using the Anthropic SDK. The same structure applies to any OpenAI-compatible client:

import anthropic

client = anthropic.Anthropic()

COD_INSTRUCTION = (
    "Think step by step, but only keep a minimum draft for each "
    "thinking step, with 5 words at most."
)

def solve_with_cod(problem: str, model: str = "claude-sonnet-4-6") -> dict:
    """
    Solve a reasoning problem using Chain of Draft prompting.
    Returns the response text and token usage.
    """
    response = client.messages.create(
        model=model,
        max_tokens=256,
        messages=[
            {
                "role": "user",
                "content": f"{COD_INSTRUCTION}\n\n{problem}"
            }
        ]
    )
    return {
        "answer": response.content[0].text,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
    }


def solve_with_cot(problem: str, model: str = "claude-sonnet-4-6") -> dict:
    """Same problem solved with standard Chain of Thought for comparison."""
    response = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": f"Think step by step: {problem}"
            }
        ]
    )
    return {
        "answer": response.content[0].text,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
    }


# Usage
problem = (
    "A train travels at 60 mph for 2 hours, then slows to 40 mph "
    "for 1.5 hours. What is the total distance traveled?"
)

cod_result = solve_with_cod(problem)
cot_result = solve_with_cot(problem)

print(f"CoD output tokens: {cod_result['output_tokens']}")
print(f"CoT output tokens: {cot_result['output_tokens']}")
print(f"CoD answer: {cod_result['answer']}")

If you are running this at scale with prompt caching, the CoD system prompt string (COD_INSTRUCTION) qualifies as a cacheable prefix in Claude's API. Since it is only ~20 tokens, the cache hit savings there are minimal — but your problem prefix, few-shot examples, and system context can still benefit from caching independently.

Few-Shot vs Zero-Shot: An Important Caveat

The paper's results come from a few-shot evaluation setup, where the model sees 3–8 examples of CoD-style answers before the test problem. In zero-shot mode (no examples), accuracy drops more noticeably — especially on smaller models.

If you are using CoD on a capable frontier model (GPT-4o, Claude Sonnet 4.6, Gemini 1.5 Pro or better), zero-shot performance is generally acceptable. The model has enough capacity to infer the intended output format from the instruction alone.

For smaller models (under ~7B parameters), or for domain-specific tasks where the model might not understand what "minimum draft" means in context, provide 2–3 few-shot examples to anchor the output format:

FEW_SHOT_COD = """
Think step by step, but only keep a minimum draft for each thinking 
step, with 5 words at most.

Q: Sarah has 15 apples. She gives 7 to friends and buys 3 more. 
   How many apples does she have?
A: 15-7=8; 8+3=11; 11 apples.

Q: A store sells shirts for $25. They offer a 20% discount. 
   What is the discounted price?
A: 20% of 25=5; 25-5=20; $20.

Q: {problem}
A:"""

This pattern produces more consistent CoD-style output from any model.

Chain of Draft vs Extended Thinking: Different Tools, Different Problems

When Claude's extended thinking (or similar O-series reasoning modes) came to prominence, developers started wondering whether CoD was still relevant. They solve different problems.

Extended thinking (Claude's effort parameter or budget_tokens in older API versions) gives the model an internal scratchpad that is billed but not shown to the user in most implementations. The model can reason extensively before generating its visible answer. The goal is accuracy — you're paying for more thinking to get a better answer.

Chain of Draft constrains visible output token length. The goal is cost reduction — you're accepting slightly compressed intermediate steps in exchange for a dramatically shorter and cheaper response.

When to use Chain of Draft
  • High-volume production API calls where cost per call matters
  • Latency-sensitive applications (customer support, real-time agents)
  • Tasks with clear right/wrong answers (arithmetic, logic, classification)
  • Cases where the intermediate reasoning is not shown to end users
When to use Extended Thinking instead
  • Complex multi-step problems where accuracy is the primary constraint
  • Research or analysis tasks where you need high-quality intermediate steps
  • Educational contexts where readable working matters
  • Tasks at the edge of the model's capability where more reasoning helps

You can also combine both in a hybrid: use CoD for your first-pass triage step, then route ambiguous or low-confidence results to extended thinking for a second pass.

Where CoD Struggles

Despite its strong benchmark results, Chain of Draft has real limitations that the paper acknowledges.

Long-chain proofs and formal verification. Tasks requiring dozens of sequential deduction steps — mathematical proofs, theorem verification, multi-hop logical inference — may lose information when each step is compressed. The five-word constraint occasionally causes models to merge two logically distinct steps into one compressed notation, which can introduce errors in long chains.

Open-ended generation. CoD is optimized for problems with definite answers. For creative writing, code generation, or analysis tasks where the intermediate "reasoning" is actually part of the deliverable, CoD is the wrong tool.

Multilingual and cross-lingual reasoning. The original paper focused on English. Whether the compression approach works equally well when the problem or expected output is in another language has not been established in the published research.

Ambiguous problems. When a problem is underspecified and the model needs to reason about what interpretation to take, the five-word constraint can cause it to commit to an interpretation too quickly without surfacing the ambiguity.

FAQ

Q: Does Chain of Draft work with system prompts or only user prompts?

Either works. If you have a system prompt, append the CoD instruction to your system prompt before the user's message. If you're building a user-turn prompt, prepend the instruction to the user's content. The paper's evaluation used the instruction in the user turn, but placement has minimal effect on frontier models.

Q: How do I know if CoD is hurting accuracy for my specific task?

Run an A/B evaluation: send the same 50–100 sample problems to both your CoT and CoD prompts, then score the outputs. If accuracy difference is under 3–4%, CoD is almost certainly the right choice for cost-sensitive production use. If the gap is larger, add few-shot examples first, then re-evaluate.

Q: Does the "5 words" limit need to be exactly five words?

No. The original paper used "5 words at most" but notes that this is a guideline, not a hard technical constraint. The model interprets it as "be concise in each step." You can experiment with "3 words at most" for even more aggressive compression or "8 words at most" for better accuracy on complex tasks.

Q: Does CoD work with function calling and tool use?

CoD affects the text reasoning that appears in the assistant's response, not the underlying tool call behavior. If your chain-of-thought appears in a separate reasoning field or thinking block, CoD would apply to that. If reasoning is embedded in the regular content before a tool call, the five-word constraint can sometimes cause the model to issue a tool call with less deliberation than CoT would produce. Test carefully before deploying in multi-tool agent scenarios.

Q: Is there an updated version of the paper with 2025–2026 model results?

The original paper (v1) from February 2025 covers GPT-4o and Claude 3.5 Sonnet. A v2 on arXiv includes additional model evaluations. Given that newer models (Claude Sonnet 4.6, GPT-4.5, Gemini 2.5 Pro) generally produce more reliable compact reasoning, the token reduction figures likely hold or improve on these models.

Key Takeaways

Chain of Draft is one of the most practically deployable findings from recent LLM research. It requires no infrastructure changes, no model swaps, and no code beyond updating a prompt string. The token reduction of 78–92% is large enough to materially change the economics of reasoning-heavy API applications.

The trade-off is real but bounded: arithmetic tasks lose roughly 3–4% accuracy at zero-shot; symbolic and common-sense tasks lose near nothing. For most production applications where reasoning is an intermediate step rather than the final product, this is an acceptable trade.

The technique fits cleanly into a tiered reasoning strategy: use CoD for high-volume, latency-sensitive calls; route edge cases or low-confidence results to extended thinking for a more thorough pass. This combination captures most of the cost savings while preserving accuracy where it matters most.

Bottom Line

Chain of Draft is a production-ready prompting technique backed by solid paper evidence. If you are running CoT at scale and have not tried CoD, add the instruction to your next API call and measure the output token difference — the results are usually immediate and significant.


Effloow Lab reproduced the benchmark table and cost calculations in this article from the paper's reported figures (arxiv:2502.18600). No live API calls were made during this lab run due to environment constraints. The prompt templates and code examples are derived from the paper's methodology and the official repository at github.com/sileix/chain-of-draft.

Need content like this
for your blog?

We run AI-powered technical blogs. Start with a free 3-article pilot.

Learn more →

More in Articles

Stay in the loop.

One dispatch every Friday. New articles, tool releases, and a short note from the editor.

Get weekly AI tool reviews & automation tips

Join our newsletter. No spam, unsubscribe anytime.