ARTICLES ·2026-05-24 ·BY EFFLOOW CONTENT FACTORY

AutoExperiment: Testing AI Agents on Research Replication

How CMU's AutoExperiment benchmark uses progressive code masking to measure AI agents' ability to replicate ML research from paper descriptions alone.

ai-agents benchmarks research-replication paper-poc llm-evaluation code-generation ai-research

AutoExperiment: Testing AI Agents on Research Replication

When a developer asks an AI agent to "implement this function from the paper," success at n=1 feels routine. The agent reads the description, generates Python, tests pass. But what happens when two interdependent functions are missing? Or five? A new benchmark from Carnegie Mellon University answers this question with enough precision to shift how we think about AI research assistants.

AutoExperiment (arXiv 2506.19724), published in June 2025, introduces a systematic evaluation framework that turns this intuition into measurable data. The findings are striking: the best available agents pass about a third of single-function tasks, but performance collapses by roughly 75% as soon as a second function is masked.

What AutoExperiment Actually Measures

The benchmark distinguishes two concepts that practitioners often conflate:

Reproduction — given a paper, its original codebase, and one masked function, can an agent re-implement that function from the paper's natural language description alone?

Replication — given a paper and a codebase with multiple masked functions (n ≥ 2), can an agent implement all of them and reproduce the experimental results end-to-end?

This is the "progressive code masking" mechanism. Each benchmark task is identified by a combined_id format: {paper_id}_{function_id}. For a single mask, a task might be 2205.00048_0. For two masks, it becomes 2205.00048_0,1. The number of masked functions n is the primary difficulty dial.

In each task, the agent receives:

The full research paper PDF
The codebase with n target functions replaced by empty stubs (signature + docstring only)
The exact command to run the experiment
A sandboxed execution environment

Success means the agent's generated code produces results within an acceptable numerical tolerance of the original published outputs.

The benchmark covers 18 machine learning papers drawn from recent conference proceedings. The papers span areas including optimization algorithms, training procedures, and model evaluation methods — domains where a competent researcher would be expected to implement functions from a written description. The tasks are not cherry-picked for difficulty; they reflect the realistic diversity of ML research codebases, with some functions that are self-contained and some that rely on state passed across multiple calls.

One important design choice: the benchmark provides the exact run command. Agents do not need to figure out how to invoke the experiment — only how to fill in the missing implementations. This isolates the "understand the paper and generate correct code" challenge from the "figure out how the codebase is organized" challenge. The former is the focus; the latter would add noise without additional signal about the agent's core capability.

The Performance Numbers

The benchmark was run against four major frontier models. At n=1 (single function), results look reasonable:

Model	Pass Rate (n=1)	Pass Rate (n≥2)	Drop
Claude-3.7-sonnet	36.5%	~9.6%	−74%
GPT-4o	35.3%	~8.5%	−76%
Claude-3.5-sonnet	31.8%	~9.6%	−70%
GPT-4o-mini	27.1%	~6.0%	−78%

Two patterns stand out. First, the absolute numbers at n=1 are modest — the best agent succeeds on fewer than 4 in 10 tasks even when only a single function is missing. Second, and more importantly, the drop to n≥2 is severe and consistent across all models. This is not a gap between good and bad models; it is a structural limitation that affects all current frontier LLMs.

Why n=2 Is So Hard: Cross-Function Dependencies

The paper's analysis points to a specific failure mode that our Effloow Lab PoC reproduced. When two functions are masked, the agent must simultaneously:

Infer the correct implementation of function A
Infer the correct implementation of function B
Get the data flow between A and B right — even when the paper's text makes it implicit

Consider a simplified adaptive learning rate system (our PoC, not an AutoExperiment task):

# Function A: compute EMA of squared gradients
def compute_ema(value: float, prev_ema: float, beta: float = 0.9) -> float:
    return beta * prev_ema + (1 - beta) * value

# Function B: compute adaptive learning rate
def compute_adaptive_lr(grad: float, squared_grad_ema: float, 
                        base_lr: float = 0.01, eps: float = 1e-8) -> float:
    return base_lr / (np.sqrt(squared_grad_ema) + eps)

# Pipeline: gradient → squared → EMA → adaptive LR
def pipeline(grads):
    ema = 0.0
    for g in grads:
        ema = compute_ema(g**2, ema)  # <-- squaring happens HERE
        lr = compute_adaptive_lr(g, ema, base_lr=0.01)

When both functions are masked, an agent reading the paper might correctly reconstruct the EMA formula and the adaptive LR formula in isolation. But if the paper describes "we apply EMA to gradient statistics" without explicitly stating g**2, the agent may pass g instead of g**2 into the EMA update. Both functions look correct in isolation. The pipeline fails.

Running this scenario in our PoC produced a pass rate of 20% at n=2, compared to 100% at n=1 — exactly the qualitative pattern the paper reports.

Agentic vs. Agentless: The Execution Gap

The benchmark tests agents in two modes:

Agentless mode: the agent generates code in a single pass, then the code runs. No feedback loop. This mirrors how a developer might paste a function signature into a coding assistant and copy-paste the result.

Agentic mode: the agent can observe execution errors, print statements, and intermediate results, then revise its implementation iteratively. Each iteration is a full cycle: generate → execute in sandbox → observe output → revise.

The results show that agentic execution consistently outperforms agentless execution, but with an important nuance. The improvement is larger at n=1 than at n≥2. When dependencies span multiple functions, iteration helps — but only up to a point. The agent can debug syntax errors and obvious runtime failures, but inferring the wrong data flow is harder to self-correct because the intermediate results may look plausible even when they are wrong.

Consider the n=2 case from our PoC: an agent that forgets to square the gradient before the EMA update will observe learning rates that are different from expected — but unless it knows what the expected values are, it cannot tell whether its implementation is wrong or whether the paper's description was ambiguous. This is the fundamental limit of self-correction without a ground truth oracle.

For developers building research pipelines, this suggests a two-tier approach. Use agentic execution (run-observe-revise) for the generation phase, but add a separate verification step that compares output shapes, ranges, and key numerical values against paper-reported figures. The agent provides the code; the verifier provides the signal about whether it is correct.

This distinction matters for practitioners building research assistant agents. Adding a "run and observe" loop improves results. Adding a verifier that compares output distributions against expected ranges would improve them further. The benchmark's authors explicitly frame this as the "verifier approach" direction — a separate agent or evaluation function that grades the output without relying on the generating agent to self-assess.

Pass@1 vs. Pass@k: The Reliability Gap

The benchmark also measures Pass@k — the probability of getting at least one passing implementation in k attempts. The gap between Pass@1 and Pass@5 is large across all models.

A single-shot agent at n=1 passes ~36% of tasks. If you run five attempts and take the best, success rates jump meaningfully. This tells you that the agent understands the task more often than it executes it cleanly on the first try. There is latent capability that does not always surface on the first generation due to sampling randomness, prompt interpretation variance, or small implementation details that the agent handles differently across attempts.

This gap is significant for architectural decisions. If Pass@1 and Pass@5 were close, it would suggest the agent is consistently either succeeding or failing — retrying would not help. The large gap means retrying does help, which means the agent's "understanding" of the task is probabilistic rather than deterministic.

There are two practical implications. First, for any pipeline where correctness is required (not just plausible-looking output), building a multi-attempt system with output verification is directly motivated by the data. Second, the Pass@k gap sets an upper bound on what better sampling strategies can achieve — you cannot get above the Pass@k ceiling without improving the underlying model or the prompt.

For practitioners, this suggests a verifier-based architecture: run the agent multiple times, execute all candidate implementations, and use numerical output comparison to select the passing one automatically. The paper frames this as the "Pass@k motivates verifier approaches" direction. The verifier does not need to be sophisticated — for many ML experiments, comparing scalar loss values or accuracy figures within a tolerance band is sufficient to identify the correct implementation among a pool of candidates.

How AutoExperiment Compares to Other Benchmarks

Research agent evaluation is a crowded space in 2025-2026. A few reference points:

PaperBench (OpenAI, April 2025) — agents replicate 20 ICML 2024 papers from scratch, building entire codebases. Best agent score: 21.0% (Claude 3.5 Sonnet). Human ML PhDs achieved 41.4% in 48 hours. PaperBench is harder by design: no codebase scaffold, no masking — write everything from zero.

LMR-Bench (2506.17335) — 28 code reproduction tasks focused on language modeling research specifically. Narrower domain, similar difficulty tier.

ReplicatorBench (2602.11354) — 19 tasks in social and behavioral sciences, emphasizing data collection and statistical analysis over ML code. Adds the challenge of finding new data for replication, not just re-running existing experiments.

AutoExperiment fills a specific niche: controllable difficulty via n, with a codebase scaffold provided (making it easier than PaperBench) but requiring correct function re-implementation (making it harder than pure code-running tasks).

Benchmark	Domain	Scaffold Provided	Difficulty Dial	Best Agent Score
AutoExperiment	ML papers	Yes (partial)	n masked functions	36.5% (n=1)
PaperBench	AI/ML papers	No	Paper complexity	21.0%
LMR-Bench	LM research	Partial	Task category	[DATA NOT AVAILABLE]
ReplicatorBench	Social sciences	No	Domain + data access	[DATA NOT AVAILABLE]

What This Means for Developers Building Research Agents

The AutoExperiment results are a useful design specification for anyone building an AI research assistant or an automated paper-to-code pipeline. Rather than treating the benchmark numbers as discouraging, treat them as a calibration tool — they tell you where your agent will succeed without human oversight and where it will need additional support.

Design for n=1 first. If your use case is "implement this one described function," current agents handle a third of cases without intervention. That is useful — it is not solved. For a research workflow that involves implementing dozens of small helper functions, a ~35% automation rate means significant time savings even with human review on the failures.

Expect n=2 to require intervention. Cross-function dependency inference is the current hard wall. If your pipeline needs to fill in two or more interdependent functions, plan for a human review step or a multi-agent verification pass where a second agent checks the data flow between implementations before running the full experiment.

Use agentic execution. A "generate and observe" loop is strictly better than single-shot generation. The overhead of a sandbox run is almost always worth it. Even a lightweight Docker container that executes the function on a small input and checks the output type and range catches a meaningful fraction of failures before they propagate.

Consider a verifier architecture. Pass@k results strongly suggest that running 3-5 candidates and selecting the passing one is more reliable than optimizing a single generation. For automated pipelines, build output comparison into your evaluation loop. The verifier does not need to be an LLM — a simple numerical comparison against paper-reported reference values often suffices.

Instrument result tolerance carefully. The benchmark uses numerical comparison with tolerances. For your domain, define what "close enough" means before you build the evaluator. Different ML metrics have very different scale sensitivities. A 1% error in a loss value might be acceptable; a 1% error in a precision-recall cutoff might not be.

Track your own n-distribution. If you are building a domain-specific research agent, measure how many of your typical tasks involve single-function versus multi-function gaps. If most of your tasks are n=1, the benchmark numbers suggest you can automate meaningfully today. If your tasks are typically n≥3, you are in territory where even the best frontier models currently struggle without architectural support.

Running a Minimal PoC

Effloow Lab ran a simplified reproduction of the core mechanism (see data/lab-runs/autoexperiment-research-agent-replication-poc-2026.md). The full AutoExperiment benchmark requires Docker, CUDA access, and the 18-paper task set from the official repository. A minimal demonstration of the masking mechanism in pure Python looks like this:

import numpy as np

# Step 1: Original function (what the paper published)
def adaptive_lr(grad, ema_sq_grad, base_lr=0.01, eps=1e-8):
    return base_lr / (np.sqrt(ema_sq_grad) + eps)

# Step 2: Masked version (what the agent sees)
def adaptive_lr_masked(grad, ema_sq_grad, base_lr=0.01, eps=1e-8):
    """
    MASKED. Paper says: "divide base_lr by square root of EMA of
    squared gradients plus epsilon for numerical stability."
    """
    raise NotImplementedError("Implement from paper description")

# Step 3: Agent re-implementation
def adaptive_lr_agent(grad, ema_sq_grad, base_lr=0.01, eps=1e-8):
    return base_lr / (ema_sq_grad ** 0.5 + eps)  # from description

# Step 4: Numerical evaluation
original = adaptive_lr(0.5, 0.25)
agent    = adaptive_lr_agent(0.5, 0.25)
passed   = abs(original - agent) / abs(original) < 1e-5

print(f"Original: {original}, Agent: {agent}, Pass: {passed}")
# Original: 0.02, Agent: 0.02, Pass: True

At n=1 with a precise description, the agent succeeds. The challenge is that real benchmark descriptions are often less precise — and at n=2, the interdependencies compound every ambiguity.

The Broader Context: Why Research Replication Matters Now

The AutoExperiment paper arrives at a moment when the research community is seriously examining whether AI agents can accelerate scientific progress — not just as writing assistants, but as active participants in the experimental cycle. Several forces make this question timely.

The volume of published ML papers has grown faster than the research community's capacity to verify results. Reproducibility crises in machine learning have been documented: papers that pass peer review but cannot be reproduced by independent groups are a real and recurring problem. An agent that can automatically attempt replication at scale would change the economics of reproducibility checking from "rarely done" to "routinely expected."

The tooling for running AI agents in sandboxed environments has also matured. Docker, cloud GPU spot instances, and framework-specific execution environments make it practical to run thousands of small experiments programmatically. The infrastructure is ready; the agent capability is the current bottleneck. AutoExperiment gives the research community a concrete number to track — 36.5% at n=1 today, and a target to beat as models improve.

Finally, benchmarks like AutoExperiment, PaperBench, and LMR-Bench provide shared evaluation language that the field needs. Without common benchmarks, claims about "AI research agents" are hard to compare. With measurable numbers, progress is visible and architectural decisions become evidence-based rather than anecdotal.

FAQ

Q: How is AutoExperiment different from just running GitHub Copilot on a paper?

Copilot and similar autocomplete tools do not run the code — they generate tokens. AutoExperiment requires the generated code to execute correctly in a sandboxed environment and produce results within numerical tolerance of the published paper. The evaluation is end-to-end, not syntax-level. A function that compiles and looks plausible but produces subtly wrong outputs fails the benchmark. This matters because many research implementations contain mathematical details that are easy to get almost right but hard to get exactly right — squaring before averaging, eps placement, beta initialization — and these details determine whether the experiment reproduces.

Q: Does the benchmark include non-ML papers?

The current dataset focuses on ML experimentation papers. Related benchmarks (ReplicatorBench, for instance) extend to social sciences and astrophysics. AutoExperiment's authors have open-sourced the benchmark construction pipeline, which in principle allows extending to other domains. The core requirement is that the paper has an accompanying codebase and specific functions that can be meaningfully masked without making the task trivially unsolvable from context alone.

Q: What does "Pass@k" mean in practice?

Pass@k is the probability that at least one of k independent agent runs produces a passing implementation. Pass@1 is a single attempt; Pass@5 means you run five independent attempts and report success if any of them pass. The gap between Pass@1 and Pass@5 quantifies how much consistency matters beyond raw capability. A large gap means the agent can solve the task but does not do so reliably — which is useful for automated pipelines that can afford multiple attempts, but problematic for any workflow that needs a single reliable answer without verification.

Q: Why is the n=1 pass rate only ~36% if descriptions are usually clear?

AutoExperiment's task set uses real published papers, and research paper descriptions of implementation details are often imprecise, assume domain knowledge, or describe the algorithm at a level of abstraction that is several steps above the actual code. The agent must bridge that gap without the original author's intuition. Additionally, some functions interact with global state, use conventions specific to a particular codebase, or depend on hyperparameters whose values are mentioned only once in the paper. These contextual details are where agents most commonly fail at n=1.

Q: Should I worry about these results if I am just using AI to help write research code?

Not directly. The benchmark evaluates a fully automated task without human feedback. When you use an AI assistant interactively, you provide clarification, verify outputs, and catch obvious mistakes as you go. That interactive loop is more forgiving than the benchmark's automated evaluation. The results matter most if you are building a pipeline that runs without human oversight — where the agent must succeed or fail entirely on its own.

Key Takeaways

The AutoExperiment benchmark offers a concrete, measurable picture of where AI research agents currently stand:

n=1: current frontier agents succeed on ~30-37% of single-function tasks
n≥2: performance drops to ~9-10%, driven by cross-function dependency errors
Agentic execution (run-observe-fix loop) outperforms single-shot generation
Pass@k gap motivates verifier-based architectures where multiple candidates are tested and the passing one is selected

For developers building research assistant tools, this is less a warning than a design brief. The capability is real. The boundary conditions are now quantified. Build your agent pipelines around n=1 tasks, add verification loops, and treat n≥2 replication as requiring human oversight until the next generation of benchmarks tells a different story.

Bottom Line

AutoExperiment is the most practically useful research-agent benchmark to land in 2025: controllable difficulty, executable evaluation, and results that directly inform how to architect AI research pipelines. The n=2 collapse is the number every research agent developer should have in their head.

Need content like this
for your blog?

We run AI-powered technical blogs. Start with a free 3-article pilot.

Learn more →