LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025)
A research team from the University of Texas at Dallas published LMR-BENCH at EMNLP 2025, asking a specific question: can LLM agents reproduce the core implementation from an NLP research paper when given the paper, a partially masked codebase, and explicit instructions?
This is harder than it sounds. And the benchmark design is smart enough to be worth understanding in detail.
Sources: arXiv 2506.17335 | ACL Anthology | GitHub
What the benchmark actually tests
LMR-BENCH contains 28 reproduction tasks drawn from 23 NLP papers published in ACL, EMNLP, NAACL, and AAAI over the past five years. Each task follows the same structure:
- Paper: the full PDF
- Masked repository: a real codebase from the paper, but with one or more critical functions replaced by
# TODO: implementstubs - Implementation instruction: a natural language description of what the masked function should do, including cross-file dependencies and design intent
The agent's job is to generate patch code that fills the stubs correctly.
This tests something distinct from "can an LLM write a function from a docstring." The function body has to match what the paper describes, use the surrounding codebase's conventions, and pass unit tests against the paper's reference implementation.
The nine task categories
| Category | What gets masked |
|---|---|
| Tokenization | Custom tokenizer logic |
| Attention mechanism | Scaled dot-product or custom attention |
| Positional encoding | RoPE, ALiBi, learned variants |
| Loss function | Custom training objectives |
| Data preprocessing | Dataset-specific transforms |
| Model architecture | Layer definitions, custom blocks |
| Training procedure | Optimizer steps, gradient modifications |
| Decoding strategy | Beam search variants, constrained decoding |
| Evaluation metric | BLEU variants, task-specific metrics |
The hardest category is model architecture: reproducing a custom layer requires reading across multiple files to understand tensor shapes, class inheritance, and forward pass conventions — exactly the kind of multi-file reasoning that current LLMs struggle with.
The easiest is evaluation metric: formulas are usually self-contained, well-documented in the paper, and don't require deep codebase knowledge.
How masking works in practice
Here's what a masked task looks like (synthetic example based on paper methodology):
# Original in paper's codebase: rotary_embedding.py
def apply_rotary_emb(xq, xk, freqs_cis):
"""Apply rotary embeddings to query and key tensors."""
xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
return xq_out.type_as(xq), xk_out.type_as(xk)
# Masked version (what the agent receives):
def apply_rotary_emb(xq, xk, freqs_cis):
"""Apply rotary embeddings to query and key tensors."""
# TODO: implement
# Instruction: Apply rotary position embeddings to xq and xk.
# Use torch.view_as_complex for complex number representation.
# freqs_cis shape must be broadcast-compatible with xq_.
# Return float tensors matching input dtype.
raise NotImplementedError
The info.json for this task would also specify which files the agent should read (reshape_for_broadcast definition lives in utils.py, for example).
Dual evaluation: unit tests + LLM-as-judge
LMR-BENCH scores agents on two axes:
Axis 1 — Functional correctness (unit tests) Numerical equivalence against the reference implementation. The agent's patch must produce the same tensor outputs as the original function.
Axis 2 — Implementation fidelity (LLM-as-judge) GPT-4o reads the paper's algorithm description and the agent's code, then scores whether the implementation actually follows the described method — even if it passes unit tests through an equivalent but differently structured approach.
This dual axis matters because:
- A function can pass unit tests but use a different algorithm (memorized shortcut)
- A function can fail unit tests due to floating-point differences but be conceptually correct
Both axes tell you different things about the agent's reasoning.
What the results show
The paper doesn't release a full leaderboard in the public arXiv version, but the findings indicate:
- o3-mini (high compute) was the best-performing model tested
- Pass@1 rates ranged roughly from 20% to 60% across task categories
- Multi-file reasoning was the single biggest differentiator: models that could trace function calls across 3+ files significantly outperformed those that stayed in the target file
- Simply giving the model the paper PDF without the masked code resulted in worse performance than giving both — the code context matters more than the paper text for reproduction tasks
The last point is counterintuitive. You'd expect the paper's equations to be the key signal. But the surrounding codebase (tensor shapes, variable naming, utility functions) constrains the solution space more tightly than the abstract algorithm description.
Why this benchmark matters for developers
If you're building an AI-assisted research coding tool (or evaluating whether an agent can help you implement a paper), LMR-BENCH is the most realistic evaluation framework available. The alternatives:
- HumanEval / MBPP: function-level, no paper context, no cross-file reasoning
- SWE-bench: bug fixing in large codebases, different skill set from paper reproduction
- APPS: competitive programming, not research implementation
LMR-BENCH specifically targets the "I read a paper, now implement it" workflow — which is what most ML engineers actually do.
Running the benchmark yourself
The benchmark repo requires Python ≥ 3.12 and supports any LLM backend through its evaluation harness:
git clone https://github.com/du-nlp-lab/LMR-Bench
cd LMR-Bench
pip install -r requirements.txt
# Run a single task with Claude
python evaluate.py \
--task benchmark/rotary_emb_task/ \
--model claude-opus-4-7-20251001 \
--api-key $ANTHROPIC_API_KEY
The evaluation harness handles: sending the paper + masked code to the model, collecting the patch, running unit tests, and recording fidelity scores.
What to expect if you run it
Based on the paper's findings, expect:
- Evaluation metric tasks: 50–60% pass@1 with a capable model
- Model architecture tasks: 20–30% pass@1, sometimes lower
- Most failures: not wrong algorithm, but wrong tensor handling — shape mismatches from not reading the surrounding code carefully enough
If you're using this to evaluate your own agent, the architecture tasks are the most informative discriminator between models.
The broader picture
LMR-BENCH reveals a gap that matters: LLMs can explain papers well and can write code well, but the intersection — implement exactly what this paper describes, in this codebase, with these constraints — is still hard. The benchmark gives that gap a number.
For the AI research community, this is also a forcing function: if you want your paper to be reproducible by an LLM agent, write clearer implementation instructions and reduce cross-file dependencies in your codebase.
Paper: Shuo Yan et al., "LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research," EMNLP 2025. arXiv:2506.17335. All results cited from the published paper. PoC evidence in data/lab-runs/lmr-bench-llm-reproduce-nlp-research-code-paper-poc-2026.md.
Need content like this
for your blog?
We run AI-powered technical blogs. Start with a free 3-article pilot.