Skip to content
Effloow
← Back to Articles
ARTICLES ·2026-06-01 ·BY EFFLOOW CONTENT FACTORY

MARLIN: Multi-Agent RL Cuts LLM Inference Carbon by 33%

MARLIN (arXiv:2605.13496) uses multi-agent RL to co-optimize LLM inference latency, carbon emissions, water use, and cost across geo-distributed datacenters.
reinforcement-learning sustainability llm-inference multi-agent cloud-infrastructure paper-poc
SHARE
MARLIN: Multi-Agent RL Cuts LLM Inference Carbon by 33%

If your team runs LLM inference at any meaningful scale, you already know the compute bill. What you may not have budgeted for is the regulatory bill coming behind it.

The EU's Energy Efficiency Directive is already requiring annual PUE and water-usage reporting for datacenters above 500 kW IT power demand. Germany's EnEfG mandates a PUE ceiling of 1.2 for new datacenter builds from July 2026. A broader EU Datacenter Energy Efficiency Package was planned for adoption on June 3, 2026, introducing a formal rating scheme and a roadmap toward minimum performance standards. The EU AI Act layers voluntary codes of conduct on energy-efficient inference on top of all that.

Meanwhile, the hardware picture keeps getting harder. Global datacenter electricity consumption hit approximately 485 TWh in 2025, up 50% from the prior year according to IEA data. AI-focused compute demand is projected to more than quadruple by 2030. A single 100-word AI prompt consumes roughly 519 ml of water in cooling load. US datacenters directly consumed 17.4 billion gallons of water in 2023; projections for 2028 run as high as 73 billion gallons.

The pressure is real, and it's arriving on a known schedule. That's the backdrop for a paper that landed on arXiv in May 2026 and deserves more attention than it's gotten.

What MARLIN Is

MARLIN (Multi-Agent Game-Theoretic Reinforcement Learning for Sustainable LLM Inference in Cloud Datacenters, arXiv:2605.13496, submitted May 13, 2026) comes out of Colorado State University and Hewlett Packard Labs. The core claim: you can jointly optimize time-to-first-token (TTFT), carbon emissions, water usage, and energy cost for LLM inference across geo-distributed datacenters by framing the scheduling problem as a multi-agent game and training specialized RL agents to negotiate toward a stable allocation.

The paper benchmarks against eight prior frameworks: Helix, Splitwise, NSGA-II, PerLLM, and SLIT as heuristic baselines, plus Q-Learning, DDQN, and Actor-Critic as RL baselines. The experimental setup used real-world Azure traces for GPT-3 and GPT-4, plus Llama-7B and Llama-70B workloads, across 8 geo-distributed datacenters with 1,000 compute nodes each, running heterogeneous NVIDIA A100 and H100 configurations.

The headline results for MARLIN-Balanced versus the best-performing baseline across all frameworks:

  • TTFT: at least 18% reduction
  • Carbon emissions: at least 33% reduction
  • Water usage: at least 43% reduction
  • Energy costs: at least 11% reduction

Against the SLIT baseline specifically, the numbers are more dramatic: 61% carbon reduction, 81% water reduction, 61% cost reduction, with a 2.3% TTFT improvement. At 12 datacenters in the scalability test, carbon drops 65% and water drops 82% versus SLIT. Pareto hypervolume (a standard metric for multi-objective optimizer coverage) comes in at 1.1251 for MARLIN versus 0.7681 for DDQN, the strongest RL baseline.

One number worth anchoring everything else to: inference now accounts for up to 90% of total LLM lifecycle energy use, dwarfing training. That figure comes from the MARLIN abstract itself, and it matches the broader literature. Optimizing training efficiency is a one-time gain. Optimizing inference scheduling is a continuous lever on every request.

The Architecture: Two Phases, Four Agents

MARLIN uses Soft Actor-Critic (SAC) as its RL algorithm, augmented with Feature-wise Linear Modulation (FiLM) layers and Hindsight Experience Replay (HER). The architecture splits into two phases.

Phase 1: Independent specialization. Each agent owns one objective (latency, carbon, water, or cost) and generates a scheduling plan that maximizes its own metric without regard for the others. These agents train separately and develop domain expertise: the carbon agent learns which datacenter regions run on greener grids, the latency agent learns to route to low-queue-depth nodes with fast prefill capacity.

Phase 2: Game-theoretic negotiation. The four proposals go into a negotiation layer that blends them using utility-scaled weighted voting. Each agent holds a capital reserve. If a blended plan degrades that agent's objective beyond a threshold (150 capital units with 0.5 pull strength, per the paper), the agent can exercise a veto that pulls the final allocation back toward its own proposal. The paper grounds this in individual rationality theory: the system converges to an allocation that no single agent prefers to abandon unilaterally. That is a Nash equilibrium.

A regression-based workload predictor feeds state inputs to all agents, achieving over 90% accuracy at microsecond latency. The predictor is what makes real-time scheduling practical: agents need to act on incoming inference requests without waiting for heavyweight inference from the predictor itself.

The key insight is that these objectives trade off in time and geography simultaneously, not just in the abstract. A datacenter in eu-green might have near-zero marginal carbon intensity at 2 AM when the grid is running on wind, but its transatlantic latency makes it a bad TTFT choice for a US user at peak hours. The game-theoretic layer lets agents express those region- and time-specific preferences through their capital reserves rather than forcing the system designer to hand-tune a static weight vector.

The Nash Equilibrium in Plain English

Most multi-objective schedulers solve this with a fixed weight vector: something like "70% TTFT, 20% carbon, 10% cost." The weights encode someone's prior judgment about what matters. That judgment goes stale as grid conditions shift, GPU availability changes, or traffic patterns move across time zones.

MARLIN's approach is different. Rather than one policy optimizing a weighted sum, you have four agents negotiating. The equilibrium the system converges to has a specific property: given what the other agents are doing, no single agent can improve its own outcome by switching to a different allocation. In game theory, that's a Nash equilibrium, and it's stable in a way that a fixed-weight scalarization is not.

The capital reserve mechanic is what keeps the equilibrium from being trivially dominated by the latency agent. Left unchecked, a speed-only objective would always win, since it's the most immediately measurable user-facing metric. Carbon and water agents accumulate capital over time by having their preferences satisfied. That capital funds future veto rights, giving sustainability objectives real leverage in the negotiation rather than a token weight in a sum.

Pareto Tradeoff Geometry: What the PoC Showed

Effloow Lab ran a simplified Nash equilibrium simulation inspired by the MARLIN scheduling concept. The full evidence note is at data/lab-runs/marlin-multi-agent-rl-sustainable-llm-inference-poc-2026.md. The PoC hand-coded 8 datacenter scheduling scenarios across four regions (us-east, us-west, eu-green, and asia-spot) with fixed TTFT, carbon, and cost values drawn from plausible inference infrastructure ranges. Three agents (speed, carbon, cost) scored every scenario via linear normalization. A Pareto frontier was computed, then a best-response iteration loop simulated the capital-reserve mechanic using Python 3.12.8 and stdlib only.

No ML training occurred. No real inference workloads ran. This does not reproduce MARLIN's actual system. What it does illustrate is the tradeoff geometry.

# Eight synthetic datacenter scenarios
# [region, ttft_s, carbon_kgco2, cost_usd, gpu_util, batch_size]
SCENARIOS = [
    ("us-west",   0.78, 0.72, 0.09, 90, 32),
    ("us-east",   0.82, 0.91, 0.08, 85, 28),
    ("us-west",   0.95, 0.68, 0.10, 80, 24),
    ("eu-green",  1.05, 0.28, 0.15, 75, 20),
    ("eu-green",  1.08, 0.29, 0.14, 72, 22),
    ("eu-green",  1.10, 0.31, 0.13, 70, 18),
    ("asia-spot", 1.35, 0.55, 0.05, 65, 16),
    ("us-east",   1.50, 0.82, 0.04, 60, 12),
]

def normalize(values):
    mn, mx = min(values), max(values)
    return [(mx - v) / (mx - mn) if mx != mn else 1.0 for v in values]

speed_scores  = normalize([s[1] for s in SCENARIOS])   # lower TTFT is better
carbon_scores = normalize([s[2] for s in SCENARIOS])   # lower carbon is better
cost_scores   = normalize([s[3] for s in SCENARIOS])   # lower cost is better

def is_dominated(i, score_matrix):
    for j in range(len(SCENARIOS)):
        if j == i:
            continue
        if all(score_matrix[obj][j] >= score_matrix[obj][i] for obj in range(3)) and \
           any(score_matrix[obj][j] >  score_matrix[obj][i] for obj in range(3)):
            return True
    return False

score_matrix = [speed_scores, carbon_scores, cost_scores]
pareto_front = [i for i in range(len(SCENARIOS))
                if not is_dominated(i, score_matrix)]

agent_utils = {"speed": speed_scores, "carbon": carbon_scores, "cost": cost_scores}
VETO_THRESHOLD = 0.25
current = pareto_front[0]

for _ in range(20):
    moved = False
    for agent, scores in agent_utils.items():
        best_for_agent = max(pareto_front, key=lambda i: scores[i])
        if best_for_agent != current:
            other_losses = [
                agent_utils[other][current] - agent_utils[other][best_for_agent]
                for other in agent_utils if other != agent
            ]
            if all(loss < VETO_THRESHOLD for loss in other_losses):
                current = best_for_agent
                moved = True
    if not moved:
        break

r, ttft, carbon, cost, util, batch = SCENARIOS[current]
print(f"Nash equilibrium: {r}, TTFT={ttft}s, Carbon={carbon}kgCO2, Cost=${cost}")
print(f"Agent scores: speed={speed_scores[current]:.3f}, "
      f"carbon={carbon_scores[current]:.3f}, cost={cost_scores[current]:.3f}")

Output from the PoC run:

Nash equilibrium (stable allocation):
  Region:  us-west
  TTFT:    0.78s
  Carbon:  0.72 kgCO2
  Cost:    $0.09
  GPU util: 90%  Batch size: 32

  Agent scores at equilibrium:
    speed  : 1.000
    carbon : 0.302
    cost   : 0.545

All 8 synthetic scenarios landed on the Pareto frontier — no scenario was strictly dominated. The equilibrium settled on us-west, the fastest allocation. The eu-green scenarios achieved near-perfect carbon scores (0.95 to 1.00) but the carbon agent lacked enough capital to veto the speed-dominant choice.

That tension is exactly what MARLIN claims to resolve through learned capital accumulation. With sufficient training on real traffic, the carbon agent builds reserves during low-load windows and exercises pull strength during high-load periods when eu-green infrastructure has spare capacity. The PoC demonstrates the geometry is structurally coherent; it does not validate the paper's specific quantitative claims.

How MARLIN Relates to Current Inference Infrastructure

MARLIN operates at the routing layer, above the inference engine rather than inside it. It decides which datacenter handles a request before vLLM, SGLang, or TensorRT-LLM ever sees it. This means it's compatible with any inference backend and doesn't require changes to the serving stack itself.

The framework's inputs are: TTFT measurements from each region, real-time carbon intensity data per grid zone, water usage estimates per kWh per region, and current queue depths. All of these are observable without deep integration with the inference engine. Carbon intensity APIs (Electricity Maps, WattTime) are already in production use for carbon-aware compute scheduling. Queue depth monitoring is standard in any serious serving deployment.

There's a connection here to the prefill-decode disaggregation work happening inside inference frameworks. MARLIN routes across datacenters; P/D disaggregation routes across GPU partitions within a cluster. The same logic applies at both levels: the prefill phase is compute-bound, the decode phase is memory-bandwidth-bound, and they compete for the same resources when colocated. vLLM, SGLang, and TensorRT-LLM all support P/D disaggregation as of early 2026. Combining geo-distributed routing with intra-cluster disaggregation is the natural next step, and it's where MARLIN's architecture fits into the broader stack.

What Infrastructure Teams Should Watch

The EU regulatory timeline creates a concrete forcing function. The EU Energy Efficiency Directive is already active. Germany's PUE 1.2 mandate hits new builds in July 2026. The EU AI Act enforcement begins August 2026. For teams operating in EU jurisdictions or serving EU users from EU datacenters, this is now a compliance question with dates attached.

Three things worth doing now, before production-grade MARL schedulers are available:

First, add carbon intensity as a monitoring signal for your inference clusters. If you don't measure it, you can't optimize it, and you definitely can't report it. Electricity Maps and WattTime both have APIs that return real-time carbon intensity per grid zone.

Second, separate your TTFT SLO from your cost SLO in your serving architecture. The MARLIN architecture only works because each objective has a dedicated reward signal. If your current system treats all of "be fast and cheap" as a single metric, you've already lost the ability to make the trade-offs that carbon-aware routing requires.

Third, if your serving fleet spans multiple cloud regions, review what signal your current load balancer uses for routing decisions. Latency and cost are standard. Carbon intensity almost certainly isn't. That's a relatively cheap add that gets you some fraction of the MARLIN benefit without any RL training.

Limitations and Open Questions

The paper's most impressive numbers come from the comparison against SLIT (a heuristic). Against DDQN, the strongest RL baseline, the reported improvement is Pareto hypervolume coverage (1.1251 vs. 0.7681) rather than per-metric percentage reductions. Production deployments running an RL-based scheduler will see less dramatic gains than the SLIT comparison implies.

The water reduction claim (43-81%) is notable but depends on the datacenter's cooling infrastructure. Evaporative cooling at a datacenter in an arid climate has a different water profile than liquid cooling in a temperate one. MARLIN models water consumption as a function of workload placement, which is valid, but a team in a datacenter without workload-responsive cooling has less leverage on that dimension than the model assumes.

The capital-reserve hyperparameters (150 capital units, 0.5 pull strength) are central to whether sustainability agents have real influence or get overridden. The paper doesn't derive these from first principles; they appear to be empirically tuned. That's a calibration burden for any team deploying a MARLIN-style system in a new environment.

When to Expect This in Cloud APIs

MARLIN is a research paper, not a product. The authors have not announced a public implementation or a cloud provider integration. The paper was submitted May 13, 2026 and has not yet completed peer review at the time of writing.

The realistic deployment path runs through managed inference services. A framework that requires geo-distributed RL training on live inference traffic with measured carbon intensity data per region is more likely to appear inside a cloud provider's managed LLM serving product than as a self-hosted tool. Azure's carbon-aware workload shifting and Google Cloud's carbon-intelligent computing platform are the clearest prior examples of this pattern.

If the paper clears peer review and the authors ship an open implementation, integration into a serving framework like vLLM or SGLang could follow within 12 to 18 months. Managed cloud API exposure would lag another 12 to 24 months after that. For teams making infrastructure decisions today, the actionable version is simpler: measure carbon intensity, route batch workloads to greener windows, and build SLO frameworks that can express sustainability objectives separately from latency objectives. The multi-agent optimization layer is coming; the infrastructure prerequisites are already buildable.

Summary

MARLIN (arXiv:2605.13496) proposes a multi-agent game-theoretic RL framework for LLM inference scheduling that co-optimizes TTFT, carbon emissions, water usage, and energy cost across geo-distributed datacenters. Against the best-performing prior baselines, the paper reports at least 18% TTFT reduction, 33% carbon reduction, 43% water reduction, and 11% cost reduction. The architecture uses specialized SAC agents negotiating through a capital-reserve veto mechanism that converges to a Nash equilibrium — a stable allocation where no agent can unilaterally improve its objective.

The timing matters because EU regulatory pressure on datacenter sustainability is arriving on a concrete schedule, and inference energy is now the dominant component of LLM lifecycle footprint. The problem MARLIN addresses is real, growing, and has a regulatory deadline attached to it.

The paper is not a production tool today. Effloow Lab's simplified PoC confirmed that the Pareto tradeoff geometry is structurally coherent under synthetic conditions. The quantitative claims require validation against real infrastructure. The approach is worth tracking closely over the next 18 months.


Paper: arXiv:2605.13496. Authors: Hayden Moore, Sirui Qi, Sudeep Pasricha (Colorado State University); Dejan Milojicic, Cullen Bash (Hewlett Packard Labs). Submitted May 13, 2026.

Lab note: data/lab-runs/marlin-multi-agent-rl-sustainable-llm-inference-poc-2026.md

Need content like this
for your blog?

We run AI-powered technical blogs. Start with a free 3-article pilot.

Learn more →

More in Articles

Stay in the loop.

One dispatch every Friday. New articles, tool releases, and a short note from the editor.

Get weekly AI tool reviews & automation tips

Join our newsletter. No spam, unsubscribe anytime.