Skip to content
Effloow
← Back to Articles
ARTICLES ·2026-05-21 ·BY EFFLOOW CONTENT FACTORY

Multi-Agent Debate: Better LLM Reasoning Through Peer Critique

How multi-agent debate improves LLM factuality by 8+ points on math benchmarks. Paper breakdown and 90-line Python PoC implementation.
multi-agent llm-reasoning paper-poc claude-api prompt-engineering ai-research
SHARE
Multi-Agent Debate: Better LLM Reasoning Through Peer Critique

Getting a single LLM to reliably reason through a hard problem is difficult. The model might confidently state a wrong answer, pick the simpler arithmetic path, or hallucinate a biographical fact it cannot verify. Prompting tricks like chain-of-thought help, but they can still fail in characteristic ways — especially on multi-step math or factual recall.

A 2023 MIT/Google paper took a different approach: instead of making one model smarter, make multiple models argue with each other. The core claim of "Improving Factuality and Reasoning in Language Models through Multiagent Debate" (Du et al., 2023, ICML 2024) is simple — LLMs are better critics than they are generators. A model that produces a wrong answer will often catch that same error when seeing it labeled as someone else's answer.

This post breaks down the paper's mechanism, shows how the 2025 MACA research extended it with reinforcement learning, and walks through a minimal 90-line Python implementation of the debate loop using the Claude API.

Effloow Lab wrote and syntax-validated the PoC code in a sandbox (see data/lab-runs/multi-agent-debate-llm-reasoning-poc-2026.md). Live API runs require your own ANTHROPIC_API_KEY.


What Is Multi-Agent Debate?

The intuition is borrowed from science: a claim that survives peer review is more trustworthy than one that never faced scrutiny. Multi-Agent Debate (MAD) applies this at inference time.

The protocol in Du et al. (2023):

  1. Round 1 — independent generation. Each of N agents (separate LLM calls with the same model) generates its own answer and reasoning without seeing what others wrote.
  2. Round 2+ — critique and revision. Every agent receives the other agents' answers from the previous round. It is asked to critique those answers, identify errors, and revise its own position if another agent's reasoning is more convincing.
  3. Final consensus. After R rounds, a majority vote or final-round summary produces the output.

The method requires no fine-tuning, no external tool calls, and no specialized infrastructure. Every component is a plain API call.


The Paper's Results

Du et al. tested 3 agents over 3 rounds against a single-model baseline using GPT-3.5-Turbo and GPT-4. Across six benchmarks:

TaskSingle Agent3-Agent 3-Round DebateGain
Arithmetic (GSM8K)77.4%85.7%+8.3pp
Biography factuality66.2%73.0%+6.8pp
MMLU72.3%75.1%+2.8pp
Chess move validity43.0%54.0%+11.0pp
Math reasoning (MATH)79.8%84.2%+4.4pp

The pattern holds: MAD helps most on structured reasoning tasks (arithmetic, chess) where errors have a definite correct answer that a critic can identify. Gains are smaller on MMLU, where knowledge gaps in one agent often appear identically in other agents using the same underlying model.

The reason debate works on math but not uniformly on knowledge tasks: a model that consistently makes the same factual error will not self-correct when shown that error in another agent's output. Debate surfaces reasoning errors better than knowledge gaps.


Why Critique Is Easier Than Generation

There is a useful asymmetry here. When asked "what is the average speed of a trip with two segments?", many LLMs reach for the simpler arithmetic mean of the two speeds. But when shown a peer answer of "36 mph" alongside "37.5 mph" with reasoning, the same model will often correctly identify which derivation is wrong — because checking a calculation requires less working memory than generating one from scratch.

This asymmetry is the paper's central finding. It explains why:

  • The gain is highest for arithmetic and lowest for pure knowledge recall
  • More debate rounds past 2–3 produce diminishing returns (the incorrect agent has usually updated by round 2)
  • Heterogeneous models debate better than homogeneous ones (different models carry different error distributions)

The 90-Line PoC

Below is a complete implementation of the debate loop for the Claude API. The full file is at /tmp/mad-poc/multi_agent_debate.py (syntax-validated).

import anthropic
from typing import NamedTuple

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

class AgentAnswer(NamedTuple):
    agent_id: int
    answer: str

SYSTEM_DEBATE = """You are a rigorous reasoning agent participating in a collaborative debate.
In round 1, propose your best answer with step-by-step reasoning.
In subsequent rounds, you receive other agents' answers. Critique any errors you find,
then revise your own answer if another agent's reasoning is more convincing.
End with: FINAL ANSWER: <your answer>"""

def single_shot(question: str) -> str:
    """Baseline: single Claude call, no debate."""
    resp = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        messages=[{"role": "user", "content": question}]
    )
    return resp.content[0].text

def debate_round(agent_id: int, question: str, round_num: int,
                 others: list[AgentAnswer]) -> str:
    if round_num == 1:
        prompt = f"Question: {question}\n\nProvide your answer with reasoning."
    else:
        others_text = "\n\n".join(
            f"Agent {a.agent_id}: {a.answer}" for a in others
        )
        prompt = (
            f"Question: {question}\n\n"
            f"Other agents' answers:\n{others_text}\n\n"
            "Review these answers critically. Correct any mistakes you see, "
            "and give your revised final answer."
        )
    resp = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        system=SYSTEM_DEBATE,
        messages=[{"role": "user", "content": prompt}]
    )
    return resp.content[0].text

def run_debate(question: str, n_agents: int = 2, n_rounds: int = 2) -> list[str]:
    answers: list[AgentAnswer] = []
    for round_num in range(1, n_rounds + 1):
        new_answers = []
        for agent_id in range(1, n_agents + 1):
            others = [a for a in answers if a.agent_id != agent_id]
            answer = debate_round(agent_id, question, round_num, others)
            new_answers.append(AgentAnswer(agent_id, answer))
        answers = new_answers
    return [a.answer for a in answers]

How to Run It

pip install anthropic
export ANTHROPIC_API_KEY="your-key-here"
python3 multi_agent_debate.py

The script runs three test questions (average speed, biography, rectangle dimensions) through both single-shot and 2-agent 2-round debate, printing intermediate answers so you can watch the correction happen.


The Cost Tradeoff

The core disadvantage of MAD is straightforward: it multiplies your API call count.

Configuration API Calls vs Single-Shot
2 agents × 2 rounds
3 agents × 2 rounds
3 agents × 3 rounds

For tasks where accuracy matters more than latency — background jobs, document verification, automated code review — this tradeoff can be worth it. For interactive applications where a user is waiting for a response, 4–9× latency is usually unacceptable.

A practical middle ground: use MAD only for high-stakes sub-tasks within a larger pipeline. Run single-shot for initial filtering, then trigger debate only on questions the single model rated as low-confidence.


MACA: Internalizing the Debate Signal (2025)

The logical next step after MAD is: if debate helps at inference time, can we train the model to already think that way?

That is what Meta AI and Columbia's LIINC Lab explored in "Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment" (MACA, ICLR 2026 submission). The key idea: run debates, observe which reasoning paths converge across agents, and use that convergence as a reinforcement learning reward signal.

A model post-trained with MACA learns to produce debate-quality reasoning in a single forward pass. Results on the base Llama model:

  • +27.6% GSM8K self-consistency (same model, single pass)
  • +42.7% MathQA multi-agent ensemble accuracy
  • +21.51% individual accuracy on MATH

This is important because it separates the research insight from the deployment constraint. MAD shows that critique improves reasoning. MACA shows that critique-derived signals can be distilled back into the model so it no longer needs 4–9× the API calls to get that benefit.

The MACA codebase is public at facebookresearch/maca and requires a base model you can fine-tune (not accessible via commercial API).


When to Use Debate vs When to Skip It

Debate is worth the cost when:

  • The task has a verifiable correct answer (math, code, logic)
  • You are running async/background workloads where latency is not critical
  • You have seen consistent single-shot failures on a specific question type
  • You want to understand where your prompt is ambiguous (disagreeing agents highlight confusion)

Debate is not worth it when:

  • All agents share the same knowledge gap (the paper itself shows diminishing returns on MMLU)
  • The task is open-ended generation (creative writing, style)
  • You need sub-second response times
  • Budget constraints make 4–9× cost increases impractical

A cheaper alternative to full debate: self-consistency sampling (Wang et al. 2022). Generate 5–10 answers independently at temperature > 0, then take the majority vote. You get some of the variance reduction benefit at lower per-token cost, without the critique rounds that drive most of MAD's gains. For tasks requiring active error correction (not just variance reduction), structured debate outperforms self-consistency — but for knowledge tasks, self-consistency is often the better tradeoff.


Implementation Pitfalls

Three things that trip up implementations:

1. System prompt leakage. If you use a single conversation thread and have agents "take turns," they share context and the debate collapses into one agent agreeing with itself. Each agent must be a separate API call with a clean context, or the independence of round 1 is lost.

2. Over-deference. Some models are trained to be helpful and accommodating. They will revise their correct answer to match a wrong peer answer just to "agree." The system prompt needs to explicitly instruct: revise only if the other agent provides a better reasoning path, not just to reach consensus.

3. Premature convergence. Debate often converges in round 2, and rounds 3+ rarely change the answer. In practice, 2 agents × 2 rounds is a sweet spot. Running 3 rounds is mostly for observability.


Q: Does heterogeneous debate (different models) work better?

Yes. Using GPT-4o as one agent and Claude Sonnet as another typically produces larger gains than two copies of the same model, because the two models have different error distributions. When Agent 1 makes a mistake that Agent 2's training data patterns do not reinforce, Agent 2 is more likely to catch it. The MACA paper's homogeneous results are strong — but real-world mixed-model debates show even higher variance reduction.

Q: Can I use debate for code generation?

It works well for logic-heavy tasks (algorithm correctness, edge case detection). One agent generates code, a second agent reviews it for bugs, a third optionally tries to break it. The critique round maps naturally onto the code review loop developers already run manually. For boilerplate generation, the gain is minimal.

Q: How do I extract a single answer from multiple debating agents?

The simplest method: after the final round, take a majority vote on the "FINAL ANSWER:" extraction. For open-ended answers, use one additional "judge" call that reads all final-round answers and synthesizes a consensus. This adds one more API call but handles cases where agents answer in different formats.

Q: Does this work for smaller/cheaper models?

The original paper used GPT-3.5-Turbo and GPT-4. The technique works better with models that have some instruction-following capability and can follow a "critique the other answer" instruction reliably. Models below ~7B parameters tend to over-defer or fail to produce substantive critiques. Claude Haiku (the model used in the PoC above) is capable enough that debate produces meaningful corrections.


Key Takeaways

Multi-Agent Debate is a technique that anyone with API access can deploy today without any training or infrastructure changes. The math-benchmark gains (+8pp on GSM8K) are real and reproducible. The cost is real too: 4–9× more API calls per question.

The paper's deepest insight is not the accuracy number — it is the asymmetry between generation and critique. LLMs can catch mistakes in others' reasoning that they made themselves. Structuring your prompting pipeline to exploit that asymmetry is the practical takeaway, whether you use full debate, a lightweight review pass, or train a model with MACA to internalize the critic's perspective.

Bottom Line

Multi-Agent Debate is a proven inference-time technique that delivers consistent accuracy gains on reasoning tasks — no fine-tuning required, just more API calls. Use it selectively on high-stakes sub-tasks where +8pp accuracy justifies the 4–9× cost increase.

Need content like this
for your blog?

We run AI-powered technical blogs. Start with a free 3-article pilot.

Learn more →

More in Articles

Stay in the loop.

One dispatch every Friday. New articles, tool releases, and a short note from the editor.

Get weekly AI tool reviews & automation tips

Join our newsletter. No spam, unsubscribe anytime.