ARTICLES ·2026-05-17 ·BY EFFLOOW CONTENT FACTORY

Sakana AI RL Conductor: 7B Orchestrator for Multi-Agent AI

How Sakana AI's 7B RL Conductor beats GPT-5 by routing tasks across frontier models via natural language workflows. ICLR 2026 paper PoC.

multi-agent reinforcement-learning orchestration sakana-ai llm-routing paper-poc grpo agent-architecture

Sakana AI RL Conductor: 7B Orchestrator for Multi-Agent AI

A 7-billion-parameter model beats GPT-5 — not by being smarter, but by knowing when to ask for help and who to ask. That's the core bet behind Sakana AI's RL Conductor, accepted at ICLR 2026. This article breaks down the paper, explains the architecture, and walks through a minimal Python reproduction of the orchestration concept.

Why This Matters: The Orchestration Problem

Most multi-agent frameworks still rely on human-written routing rules. You look at the task, decide which model handles which step, hard-code the topology in Python or YAML, and call it a system. The problem: task complexity is unpredictable. A simple question and a multi-step algorithmic problem need completely different coordination strategies, and no static rule captures that.

Sakana AI's answer is to train a model to learn routing from scratch — using reinforcement learning to discover which topologies work, with only task-correctness as the reward signal. The result, the RL Conductor, is a Qwen2.5-7B model that outperforms every frontier model in its own worker pool on coding and graduate-level reasoning benchmarks.

That's the core claim, and it has significant implications for how developers build multi-model pipelines.

The RL Conductor: Paper Overview

The paper, titled "Learning to Orchestrate Agents in Natural Language with the Conductor", was published April 27, 2026 and accepted at ICLR 2026 — the top peer-reviewed ML conference. The research team trained a relatively compact model to act as a manager that delegates to much larger, more capable workers.

Architecture: Three Roles

The framework separates responsibilities cleanly:

The Conductor — a small model that designs the workflow but never directly solves tasks
Worker agents — frontier LLMs (GPT-5, Claude Sonnet 4, Gemini 2.5 Pro) and open-source models (DeepSeek-R1-Distill-Qwen-32B, Gemma3-27B, Qwen3-32B) that execute subtasks
The orchestration format — natural language instructions with explicit access lists, not code

The key design choice: the Conductor outputs a structured workflow where each step contains:

A natural language instruction targeting a specific aspect of the problem
An assigned agent chosen from the worker pool
An access list — an explicit set of prior step indices whose results this step can read

The access list is the mechanism that makes this work. Instead of passing every prior output to every subsequent step (expensive, noisy), each worker only receives context from steps it actually needs. A verifier step sees the plan and the code. A code-writing step sees the plan but not any other parallel attempts.

Training: GRPO on 960 Problems

Training used Group Relative Policy Optimization (GRPO), a variant of PPO adapted for LLM fine-tuning. Key parameters:

Base model: Qwen2.5-7B
Training iterations: 200
Batch size: 256
Rollouts per question: 64 at temperature 1.0
Dataset: 960 problems from MATH, MMLU, RLPR, and LiveCodeBench
Hardware: 2× NVIDIA H100 80GB GPUs

The training dataset is deliberately small — 960 problems — because the RL objective doesn't require diverse labeled examples. The model explores its own routing strategies, and only task-correctness rewards survive. By the end of training, the model has discovered coordination patterns that human engineers didn't design.

Performance Results

Model	GPQA-Diamond	LiveCodeBench	Role
RL Conductor	87.5%	83.9%	Orchestrator
GPT-5 (worker)	~84%	~79%	Worker in pool
Claude Sonnet 4 (worker)	~83%	~77%	Worker in pool
Gemini 2.5 Pro (worker)	~82%	~76%	Worker in pool
Conductor gain	+3-5% over best worker	SOTA at time of pub.	—

The Conductor's GPQA-Diamond score of 87.5% and LiveCodeBench score of 83.9% were state-of-the-art at time of publication, surpassing every individual worker in its pool — including GPT-5 and OpenAI's o-series models (which weren't in the pool due to cost). The gains of ~3% on GPQA-Diamond match an entire generational jump between o3 and GPT-5.

How the Orchestration Works in Practice

The Conductor doesn't make one routing decision — it generates a full workflow before any worker is called. For a given task, it decides:

How many steps are needed
Which model handles each step
What instruction each model receives
Which prior outputs each model can read

The critical insight: this is all in natural language, not code. The Conductor isn't writing Python to coordinate workers — it's generating a JSON-like plan with prose instructions. This is what makes it generalizable. The routing logic isn't a handcrafted decision tree; it emerged from reward maximization.

Adaptive Topology: Three Observed Patterns

After training, the Conductor developed three distinct coordination patterns based on task difficulty:

Pattern 1 — 1-Shot (simple tasks)
For a factual question like "What is the capital of Japan?", the Conductor routes directly to one model with a single instruction. No coordination overhead.

Pattern 2 — Parallel Fan-Out (analytical tasks)
For a comparison or analysis task, the Conductor spawns multiple workers in parallel, each receiving an independent instruction. A final aggregator step sees all outputs via its access list and synthesizes the result.

Pattern 3 — Planner → Executor → Verifier (complex tasks)
For hard coding or math problems, the Conductor builds a linear pipeline: one model plans the approach, a second implements it with access to the plan, and a third verifies with access to both plan and code. The model learned this pattern without being told it exists.

Recursive Topologies

One discovery that surprised the researchers: the Conductor can assign itself as a worker. This creates recursive call graphs where a sub-orchestration runs nested within the main workflow. The recursive Conductor call receives its parent's output as additional context, which allows it to either spawn a new workflow or short-circuit and return a direct answer. This is a form of test-time compute scaling that emerges from the RL formulation rather than being designed in.

Effloow Lab PoC: Reproducing the Core Concept

Effloow Lab built a minimal Python reproduction of the Conductor architecture to validate the data structures and execution logic described in Section 3 of the paper. The actual 7B model weights are not publicly released (the research paper and commercial Fugu product are separate). The PoC uses mock workers to demonstrate the routing concept.

The PoC is in data/lab-runs/sakana-ai-rl-conductor-multi-agent-orchestration-2026.md.

Core Data Structure

The paper's orchestration format maps cleanly to a Python dataclass:

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class SubTask:
    instruction: str       # natural language task for this worker
    assigned_agent: str    # which model handles this step
    access_list: list[int] # indices of prior steps this worker can read
    result: Optional[str] = None

@dataclass
class ConductorWorkflow:
    task: str
    steps: list[SubTask] = field(default_factory=list)

    def add_step(self, instruction: str, agent: str, access_list: list[int]) -> int:
        idx = len(self.steps)
        self.steps.append(SubTask(instruction, agent, access_list))
        return idx

The access_list is the central mechanism: it's a gating layer that prevents workers from receiving context they don't need. In real deployments, this reduces API costs and avoids long-context noise.

Context Routing

def build_context(workflow: ConductorWorkflow, access_list: list[int]) -> str:
    parts = []
    for idx in access_list:
        if 0 <= idx < len(workflow.steps) and workflow.steps[idx].result:
            parts.append(f"Step {idx} ({workflow.steps[idx].assigned_agent}): {workflow.steps[idx].result}")
    return "\n".join(parts) if parts else ""

This replicates the paper's context mechanism. Each worker only sees what the Conductor explicitly granted access to. For a verifier step with access_list=[0, 1], it sees the plan (step 0) and the implementation (step 1), but nothing else.

Adaptive Plan Building

def build_conductor_plan(task: str) -> ConductorWorkflow:
    workflow = ConductorWorkflow(task=task)
    complexity = classify_task_complexity(task)  # RL-learned in real Conductor

    if complexity == "simple":
        workflow.add_step(f"Answer directly: {task}", "claude-sonnet-4", [])

    elif complexity == "medium":
        step_a = workflow.add_step(f"Analyze technically: {task}", "gpt-5", [])
        step_b = workflow.add_step(f"Analyze practically: {task}", "gemini-2.5-pro", [])
        workflow.add_step("Synthesize both analyses", "claude-sonnet-4", [step_a, step_b])

    else:  # complex
        plan = workflow.add_step(f"Create implementation plan: {task}", "gpt-5", [])
        code = workflow.add_step("Implement the plan", "deepseek-r1-qwen-32b", [plan])
        workflow.add_step("Verify and refine", "claude-sonnet-4", [plan, code])

    return workflow

In the real Conductor, classify_task_complexity doesn't exist as a separate function — the Qwen2.5-7B model generates the full workflow in one forward pass, routing decisions included. The adaptive behavior emerged from RL training, not explicit branching logic.

What the PoC Validates

The PoC confirms that the Conductor's output format is well-specified in the paper and implementable. The SubTask data structure, access-list mechanism, and workflow execution model all map cleanly to code. The gap between the PoC and the real system is the routing intelligence itself — 200 GRPO iterations on two H100s to learn what topology works for what task.

From Research to Commercial: Sakana Fugu

Sakana AI productized the Conductor research into Fugu, their commercial multi-agent orchestration platform. Fugu entered public beta on April 24, 2026 and is accessible via an OpenAI-compatible API — meaning developers already using GPT, Claude, or Gemini via API can drop Fugu in with minimal changes.

Fugu ships in two configurations:

Fugu Mini — optimized for low-latency orchestration, suitable for interactive applications
Fugu Ultra — full model pool utilization for complex reasoning tasks

Benchmark claims for the commercial Fugu include SOTA results on SWE-Pro, GPQA-D, and ALE-Bench. Pricing is not yet publicly disclosed (beta access is by application). The API is OpenAI-compatible, which means standard client libraries work without modification.

The gap between the research Conductor and Fugu is likely significant: Fugu incorporates improvements from the Trinity paper (another Sakana ICLR 2026 paper) and commercial engineering for reliability, latency, and cost control that the research prototype doesn't address.

Practical Implications for Developers

1. The Access List Pattern Is Worth Stealing

Even if you're not training your own Conductor, the access-list pattern for context routing is immediately applicable. Instead of concatenating all previous agent outputs into every subsequent call, explicitly define which outputs each step receives. This:

Reduces token costs for long pipelines
Prevents context pollution (a verifier doesn't need to see unrelated parallel attempts)
Makes the pipeline logic auditable — you can trace exactly what each step could see

2. RL-Based Routing Is Not Yet DIY

The training setup — 2× H100s, GRPO, 960 curated problems — is accessible by research standards but not typical startup infrastructure. The more actionable takeaway is the architectural pattern: separate the orchestration logic from the execution logic, and design workflows as data structures, not hard-coded call chains.

3. Small Orchestrators May Beat Large Direct Solvers

The Conductor demonstrates a principle that's becoming increasingly relevant: at sufficient scale, meta-cognitive coordination (which model, which subtask, which context) may be more valuable than raw parameter count. A 7B model making good routing decisions outperforms 100B+ models solving tasks directly.

This has cost implications: if a small, cheap orchestrator can route to cheaper workers for simple subtasks and only invoke expensive frontier models when necessary, the total cost per solved problem drops substantially.

4. Recursive Self-Delegation Changes Test-Time Compute

The recursive topology — where the Conductor calls itself as a worker — is a practical form of test-time scaling that doesn't require beam search or chain-of-thought length control. The model simply decides whether a subtask warrants another orchestration layer or a direct answer. If this pattern generalizes, it suggests that orchestration models might improve monotonically with additional "thinking budget" in a way that's more controllable than raw token generation.

Common Questions

Q: Is the Conductor model available as open weights?

No. As of May 2026, the research Qwen2.5-7B Conductor model from the paper is not publicly released. The commercial Fugu product is available via API in beta. The paper provides the training methodology in detail, so teams with H100 access could reproduce training.

Q: How does this compare to LangGraph or AutoGen orchestration?

LangGraph and AutoGen use statically defined graphs or hand-written coordination logic. The Conductor generates the workflow dynamically per task. The access-list mechanism is also more explicit than typical message-passing in multi-agent frameworks, which often give agents full conversation history.

Q: What happens when the Conductor makes a bad routing decision?

The paper doesn't provide extensive ablations on failure modes. In the mock PoC, routing errors are deterministic (wrong keyword match). In the real system, the RL training minimizes incorrect routing over 64 rollouts per training example, but no routing system achieves 100% accuracy. The verifier step in complex pipelines partially mitigates this by catching execution errors.

Q: Can I use Fugu with Claude or local models?

Fugu's OpenAI-compatible API abstracts the worker pool selection. According to the Fugu beta announcement, it coordinates across OpenAI, Anthropic, and Google frontier models. Local models (DeepSeek, Gemma) were in the research Conductor's pool but are not confirmed in Fugu's commercial configuration.

Q: Is training reproducible without H100s?

Potentially with A100s or equivalent. The bottleneck is the 64 rollouts per training step (each rollout involves calling multiple frontier model workers), not raw GPU compute for the 7B base model itself. Training cost is dominated by API calls to GPT-5, Claude, and Gemini during GRPO rollouts.

Key Takeaways

Sakana AI's RL Conductor trains a 7B model to route tasks across frontier models via natural language workflows — outperforming every model in its own pool on GPQA-Diamond (87.5%) and LiveCodeBench (83.9%)
The core mechanism is an access list per subtask: each worker only receives context from explicitly listed prior steps, preventing noise and reducing token costs
The Conductor learns three topology patterns (1-shot, parallel fan-out, planner-executor-verifier) through RL reward maximization — no human-designed routing rules
Recursive self-delegation allows test-time scaling without beam search: the Conductor can call itself as a worker for sub-problems
The commercial product Sakana Fugu (beta) packages this research into an OpenAI-compatible API, available by application
The access-list pattern and workflow-as-data-structure approach are immediately applicable to developer-built multi-agent pipelines, regardless of whether you use Fugu

Bottom Line

The RL Conductor is the most rigorous demonstration to date that orchestration intelligence — not model size — can be the decisive factor in multi-agent performance. The access-list routing pattern and natural language workflow format are practical ideas that don't require a Conductor model to adopt. Watch Fugu's beta rollout for early pricing and API access.

Need content like this
for your blog?

We run AI-powered technical blogs. Start with a free 3-article pilot.

Learn more →