Sakana Ai Rl Conductor Multi Agent Orchestration 2026

Date: 2026-05-17
Track: paper-poc
Slug: sakana-ai-rl-conductor-multi-agent-orchestration-2026
Paper: "Learning to Orchestrate Agents in Natural Language with the Conductor" (ICLR 2026)
ArXiv: https://arxiv.org/abs/2512.04388

Objective

Reproduce the core Conductor orchestration concept in minimal Python: the natural-language workflow generation, access-list-based context routing, and adaptive topology selection (chain vs. parallel vs. planner-executor-verifier).

The actual Conductor model (Qwen2.5-7B fine-tuned with GRPO) is not publicly released as open weights. This PoC reproduces the architecture concept using mock workers to illustrate how the routing decision logic works.

Environment

Python 3.11 (system)
No external packages required for mock PoC

PoC Code

# conductor_poc.py — Minimal reproduction of the RL Conductor orchestration concept
# Based on: arxiv.org/abs/2512.04388

from dataclasses import dataclass, field
from typing import Optional
import json

@dataclass
class SubTask:
    """A single step in the Conductor's natural language workflow."""
    instruction: str
    assigned_agent: str
    access_list: list[int]  # indices of prior subtasks whose output this step can see
    result: Optional[str] = None

@dataclass
class ConductorWorkflow:
    """The orchestration plan the Conductor emits before delegating to workers."""
    task: str
    steps: list[SubTask] = field(default_factory=list)

    def add_step(self, instruction: str, agent: str, access_list: list[int]) -> int:
        idx = len(self.steps)
        self.steps.append(SubTask(instruction=instruction, assigned_agent=agent, access_list=access_list))
        return idx

    def to_dict(self) -> dict:
        return {
            "task": self.task,
            "steps": [
                {
                    "step": i,
                    "instruction": s.instruction,
                    "agent": s.assigned_agent,
                    "access_list": s.access_list,
                    "result": s.result
                }
                for i, s in enumerate(self.steps)
            ]
        }

# --- Mock Worker Agents ---
WORKERS = {
    "gpt-5": lambda instruction, context: f"[GPT-5] {instruction[:60]}... → synthetic answer A",
    "claude-sonnet-4": lambda instruction, context: f"[Claude-Sonnet-4] {instruction[:60]}... → synthetic answer B",
    "gemini-2.5-pro": lambda instruction, context: f"[Gemini-2.5-Pro] {instruction[:60]}... → synthetic answer C",
    "deepseek-r1-qwen-32b": lambda instruction, context: f"[DeepSeek-R1] {instruction[:60]}... → open-source answer",
    "conductor": None,  # recursive self-call placeholder
}

def build_context(workflow: ConductorWorkflow, access_list: list[int]) -> str:
    """Build context string from prior step results, gated by the access list."""
    parts = []
    for idx in access_list:
        if 0 <= idx < len(workflow.steps) and workflow.steps[idx].result:
            parts.append(f"Step {idx} ({workflow.steps[idx].assigned_agent}): {workflow.steps[idx].result}")
    return "\n".join(parts) if parts else ""

def execute_workflow(workflow: ConductorWorkflow) -> str:
    """Execute each step, passing only the access-listed context."""
    for i, step in enumerate(workflow.steps):
        context = build_context(workflow, step.access_list)
        worker_fn = WORKERS.get(step.assigned_agent)
        if worker_fn is None:
            step.result = f"[Conductor recursive call not implemented in mock]"
        else:
            step.result = worker_fn(step.instruction, context)
        print(f"  Step {i} → {step.assigned_agent}: {step.result}")
    return workflow.steps[-1].result if workflow.steps else ""

# --- Adaptive Topology Selection (the key insight from the paper) ---

def classify_task_complexity(task: str) -> str:
    """
    The real Conductor uses RL to learn this implicitly.
    Here we hard-code simple heuristics to demonstrate the three topology modes.
    """
    task_lower = task.lower()
    if any(w in task_lower for w in ["solve", "implement", "write code", "algorithm", "optimize"]):
        return "complex"
    if any(w in task_lower for w in ["compare", "analyze", "evaluate"]):
        return "medium"
    return "simple"

def build_conductor_plan(task: str) -> ConductorWorkflow:
    """
    Simulates what the RL-trained Conductor would generate.
    Simple → single agent.
    Medium → parallel agents + aggregator.
    Complex → planner → coder → verifier pipeline.
    """
    workflow = ConductorWorkflow(task=task)
    complexity = classify_task_complexity(task)

    if complexity == "simple":
        # 1-shot: route to one agent
        workflow.add_step(
            instruction=f"Answer this directly: {task}",
            agent="claude-sonnet-4",
            access_list=[]
        )

    elif complexity == "medium":
        # Parallel fan-out: two agents work independently, third aggregates
        step_a = workflow.add_step(
            instruction=f"Analyze from a technical perspective: {task}",
            agent="gpt-5",
            access_list=[]
        )
        step_b = workflow.add_step(
            instruction=f"Analyze from a practical/user perspective: {task}",
            agent="gemini-2.5-pro",
            access_list=[]
        )
        workflow.add_step(
            instruction="Synthesize both analyses into a final recommendation",
            agent="claude-sonnet-4",
            access_list=[step_a, step_b]
        )

    else:
        # Planner → Executor → Verifier pipeline
        plan_step = workflow.add_step(
            instruction=f"Break this into a step-by-step implementation plan: {task}",
            agent="gpt-5",
            access_list=[]
        )
        code_step = workflow.add_step(
            instruction="Implement each step of the plan in working code",
            agent="deepseek-r1-qwen-32b",
            access_list=[plan_step]
        )
        workflow.add_step(
            instruction="Verify correctness, identify edge cases, and refine the implementation",
            agent="claude-sonnet-4",
            access_list=[plan_step, code_step]
        )

    return workflow

# --- Demo Run ---

test_tasks = [
    "What is the capital of Japan?",
    "Compare GPT-5 and Claude Sonnet 4 for code generation tasks",
    "Implement an efficient algorithm to find all prime numbers up to N using a segmented sieve",
]

for task in test_tasks:
    print(f"\n{'='*60}")
    print(f"TASK: {task}")
    workflow = build_conductor_plan(task)
    print(f"TOPOLOGY: {classify_task_complexity(task)} ({len(workflow.steps)} steps)")
    print(json.dumps(workflow.to_dict(), indent=2))
    print("\nEXECUTION:")
    result = execute_workflow(workflow)
    print(f"\nFINAL OUTPUT: {result}")

Mock Output (conceptual — not from real Conductor weights)

============================================================
TASK: What is the capital of Japan?
TOPOLOGY: simple (1 step)
{
  "task": "What is the capital of Japan?",
  "steps": [
    {
      "step": 0,
      "instruction": "Answer this directly: What is the capital of Japan?",
      "agent": "claude-sonnet-4",
      "access_list": [],
      "result": null
    }
  ]
}

EXECUTION:
  Step 0 → claude-sonnet-4: [Claude-Sonnet-4] Answer this directly: What is the capital of Japan?... → synthetic answer B

FINAL OUTPUT: [Claude-Sonnet-4] Answer this directly: What is the capital of Japan?... → synthetic answer B

============================================================
TASK: Compare GPT-5 and Claude Sonnet 4 for code generation tasks
TOPOLOGY: medium (3 steps)

EXECUTION:
  Step 0 → gpt-5: [GPT-5] Analyze from a technical perspective: Compare GPT-5... → synthetic answer A
  Step 1 → gemini-2.5-pro: [Gemini-2.5-Pro] Analyze from a practical/user perspective... → synthetic answer C
  Step 2 → claude-sonnet-4: [Claude-Sonnet-4] Synthesize both analyses... → synthetic answer B

============================================================
TASK: Implement an efficient algorithm to find all primes...
TOPOLOGY: complex (3 steps — planner → coder → verifier)

EXECUTION:
  Step 0 → gpt-5: [GPT-5] Break this into a step-by-step implementation plan... → synthetic answer A
  Step 1 → deepseek-r1-qwen-32b: [DeepSeek-R1] Implement each step... → open-source answer
  Step 2 → claude-sonnet-4: [Claude-Sonnet-4] Verify correctness... → synthetic answer B

Key Findings

What the PoC reproduces:

The core data structure: SubTask with instruction, assigned_agent, access_list
Adaptive topology selection: 1-shot → parallel fan-out → planner-executor-verifier
Access-list-based context gating (only referenced prior steps are visible to each worker)
The multi-model pool architecture (mock workers for GPT-5, Claude, Gemini, DeepSeek)

What the real Conductor adds (NOT reproduced):

The 7B Qwen2.5 model weights trained with GRPO to learn these routing decisions
GRPO reward signal: task correctness + output format adherence
64 rollouts per training question for exploration-exploitation balance
Recursive self-calling (Conductor as its own worker)
Real API calls to frontier models

Limitations:

Routing heuristic is keyword-based, not RL-learned
No actual model calls — all outputs are mocked
Real training required 2× H100 80GB GPUs, 200 GRPO iterations
Qwen2.5-7B base weights required for actual training reproduction

Architecture validation: The data structures and execution logic match the paper's description in Section 3 (Orchestration Format) and Figure 2 (Topology Examples).

Sources