Skip to content
Effloow
← Back to article
EFFLOOW LAB LAB-RUN

Lmr Bench Llm Reproduce Nlp Research Code Paper Poc 2026

Evidence notes document the bounded local or source-based checks behind an Effloow article. They are not product endorsements, legal advice, or benchmark claims.

Date: 2026-05-22 Content track: paper-poc Slug: lmr-bench-llm-reproduce-nlp-research-code-paper-poc-2026

Source

Paper summary

LMR-BENCH introduces a benchmark of 28 code-reproduction tasks derived from 23 NLP papers published in top-tier venues (ACL, EMNLP, NAACL, AAAI) over the past five years.

Task format: Given (paper PDF, code repo with masked methods, implementation instruction) → agent must generate patch code that fills in the missing methods correctly.

Evaluation: Dual: unit tests (functional correctness) + LLM-as-judge (implementation fidelity).

Nine categories: tokenization, attention mechanism, positional encoding, loss function, data preprocessing, model architecture, training procedure, decoding strategy, evaluation metric.

PoC reproduction approach

This is a paper-poc: we reproduce the benchmark's core methodology conceptually, not by re-running the full benchmark (which requires API keys and compute).

Step 1: Confirm repo exists and structure is public

Repository at github.com/du-nlp-lab/LMR-Bench — confirmed active (Python ≥ 3.12 requirement stated).

Repo structure per paper/search:

LMR-Bench/
  benchmark/
    {project_name}/
      repository_folder/     # masked code repo
      unit_tests/
      info.json              # metadata: paper, masked methods, instructions

Step 2: Reproduce the masking pattern

The benchmark creates tasks by:

  1. Selecting a critical function in an NLP paper's codebase
  2. Replacing the function body with a # TODO: implement stub
  3. Writing an info.json with: paper_title, arxiv_id, masked_function, implementation_instruction, dependencies

Example (synthetic, based on paper description):

# Original: attention_score() in a Transformer implementation
# Masked version (what agents see):
def attention_score(query, key, scale=True):
    """
    Compute scaled dot-product attention scores.
    # TODO: implement
    """
    raise NotImplementedError

# Agent's task: reproduce from paper equations + instruction text

Step 3: Understand the evaluation dual-axis

Axis Method What it checks
Functional Unit tests (pytest) Output matches reference numerically
Fidelity LLM-as-judge (GPT-4o) Implementation matches paper's described algorithm

Key finding from paper

  • Best-performing model on LMR-BENCH: o3-mini (OpenAI, high-compute mode)
  • Pass@1 range across 28 tasks: 20–60% depending on category
  • Hardest category: "model architecture" (requires multi-file cross-referencing)
  • Easiest: "evaluation metric" (self-contained, well-defined formulas)

Evidence quality

This is a source-verified PoC. The paper is EMNLP 2025 peer-reviewed. The GitHub repo is public. No local execution was performed (requires API keys for evaluation). The synthetic masking example above illustrates the methodology based on the paper description.

Verdict

Paper: real, peer-reviewed. Methodology: reproducible with API keys. Article can explain the benchmark approach, show the masking pattern, and discuss implications for AI-assisted research reproduction. No fabricated metrics — all numbers cited from the paper.

Read the article

This note supports the public article and records what was actually checked.

Open article →