SciAgentGYM: 1,780 Scientific Tools, One Hard Benchmark
Every week a new LLM claims to be "state-of-the-art on scientific tasks." Those claims usually rest on multiple-choice chemistry questions or single-step math proofs — tasks that a well-trained language model can pattern-match from training data alone.
Real scientific work looks nothing like that. A chemist computing molecular properties calls a SMILES parser, feeds the output into a molecular geometry optimizer, runs a density functional theory calculation on the result, and extracts energy values from the DFT output. That's four sequential tool calls with strict dependency ordering. If any step fails, the whole workflow collapses.
SciAgentGYM (arXiv:2602.12984), published by Fudan NLP researchers in February 2026, is the first benchmark environment built specifically for this kind of evaluation: multi-step scientific tool use in LLM agents. The results are sobering.
What SciAgentGYM Is — and Why It's Different
Most LLM benchmarks test a model's knowledge. SciAgentGYM tests whether an agent can operate in a scientific environment — selecting, sequencing, and executing domain-specific computational tools to reach a verifiable answer.
The system has three tightly coupled components:
SciAgentGym (the environment) provides 1,780 domain-specific scientific tools spanning four natural science disciplines: Physics, Chemistry, Biology (Life Sciences), and Materials Science. The runtime also includes a filesystem for artifact management between tool calls, scientific databases for knowledge retrieval, and a Python interpreter for custom computation. Agents interact with this environment the same way a research software stack works: outputs from one tool become inputs to the next.
SciAgentBench (the evaluation suite) contains 259 tasks and 1,134 sub-questions built through a four-stage quality pipeline. The authors aggregated roughly 5,000 candidate tasks from existing benchmarks, filtered out any task where four frontier LLMs averaged above 50% accuracy (keeping only genuinely hard ones), executed each retained task inside SciAgentGym to verify it was actually solvable, and had domain experts validate that solutions genuinely require multi-step reasoning rather than direct recall.
The task difficulty is stratified into three levels:
- L1 — up to 3 tool-call steps
- L2 — 4 to 7 steps
- L3 — 8 or more steps
Notably, 79% of the benchmark falls into L2 or L3. Short, easy tasks aren't the point.
SciForge (the data synthesis method) is a training approach that models the tool action space as a dependency graph and generates logic-aware training trajectories from it. It's described further below.
The Domain Breakdown
SciAgentBench's 259 tasks split across disciplines as follows:
| Domain | Tasks | Share | Tool-Use Benefit (avg) |
|---|---|---|---|
| Physics | 109 | 42% | +2.5% |
| Chemistry | 81 | 31% | +7.0% |
| Materials Science | 37 | 14% | +3.7% |
| Life Sciences | 32 | 12% | +8.4% ← highest gain |
The "tool-use benefit" column is telling. In Physics, agents already have strong parametric knowledge from training data, so adding tools only adds +2.5%. In Chemistry and Life Sciences — where calculations are more procedural and outputs depend heavily on molecular data that can't be memorized — using the correct tools lifts performance by 7–8 percentage points. This suggests the benchmark correctly captures where tool use actually matters.
The Core Finding: Long-Horizon Performance Collapse
The most striking result in the paper is this: GPT-5 achieves a 60.6% success rate on L1 tasks but drops to 30.9% on L3 tasks — nearly halving its performance as interaction horizons extend. The authors attribute this primarily to failures in multi-step workflow execution: errors in intermediate steps cascade, and the model fails to recover or retry correctly.
The paper evaluated four frontier models — Claude-Sonnet-4.5, DeepSeek-R1, Qwen3-235B, and GPT-5 — and found the same sharp degradation pattern across all of them. No frontier model escaped the performance collapse on long-horizon tasks.
There's a straightforward lesson here for developers building scientific agents: raw benchmark scores at single-step tasks don't predict performance on real workflows. A model that scores 60% on L1 may be averaging below 31% on the tasks your pipeline actually needs.
Why This Matters: The Tool-Dependency Structure
To understand what makes L3 tasks hard, consider a Chemistry task that asks an agent to identify the most stable isomer of a given organic compound. The required tool chain looks something like this:
- Parse the SMILES string into an internal molecule object
- Enumerate possible isomers using the stereoisomer generator
- Optimize 3D geometry for each candidate
- Run DFT calculations on each optimized structure
- Extract total energies from each DFT output
- Compare and return the minimum-energy isomer
That's a six-step chain. Any misordering — say, trying to run DFT before geometry optimization completes — produces a hard failure. Any incorrect tool selection — using a 2D descriptor calculator instead of the 3D optimizer — produces silent errors that propagate downstream.
Effloow Lab reproduced this dependency structure in a minimal Python simulation (stdlib only, no API keys). Building a seven-node Chemistry tool graph with BFS traversal for transitive dependency resolution, the PoC confirmed that the L1/L2/L3 classification boundaries closely mirror real scientific workflow complexity. See data/lab-runs/sciagentgym-scientific-tool-use-llm-benchmark-poc-2026.md for the full run log.
The key structural insight the PoC reinforces: task complexity in scientific tool use isn't additive, it's multiplicative. A six-step task isn't twice as hard as a three-step task — it's exponentially harder because each intermediate step's failure probability compounds.
SciForge: Teaching Smaller Models the Structure
The most practically interesting finding in the paper is that you don't need a frontier-scale model to perform well on SciAgentBench. You need a model that has been trained to understand tool dependency structure.
SciForge achieves this by treating the tool action space as a directed acyclic graph. Instead of collecting training trajectories as flat sequences of tool calls, SciForge generates trajectories that respect and encode the dependency relationships between tools. The result is that fine-tuned models learn not just which tools to call, but in what order and why.
The numbers make the point: fine-tuning an 8B model on SciForge-generated trajectories produces SciAgent-8B, which:
- Achieves a +6.7% improvement over its base model's score
- Outperforms the Qwen3-VL-235B-Instruct — a model roughly 29x larger
- Shows positive cross-domain transfer: gains in Chemistry generalize to Physics and Materials Science tasks without domain-specific fine-tuning
SciAgent-4B (the smaller variant) achieves +5.5%, also competitive with models many times its size.
This isn't a fluke of scale. The paper's interpretation is that scientific tool-use capability is learnable and transferable as a structural skill, independent of raw domain knowledge. A model trained to reason about tool dependencies in one scientific domain can apply that structural reasoning in another.
Scale does not solve multi-step scientific tool use. Dependency-aware training does. An 8B model fine-tuned on SciForge trajectories beats a 235B model on the same benchmark — not because it knows more chemistry, but because it understands how tools chain together.
How It Compares to Existing Scientific Benchmarks
SciAgentBench isn't the first attempt to evaluate LLMs on scientific tasks. But it occupies a distinct niche:
ScienceAgentBench (OSU NLP, ICLR 2025) focuses on data-driven scientific discovery workflows — primarily Python-based analysis pipelines. It's strong on computational workflows but lighter on the domain-specific tool ecosystems that characterize wet-lab and simulation-heavy science.
FrontierMath and GPQA evaluate scientific knowledge through question answering. No tool interaction is required or measured.
SciAgentGYM's differentiation is the combination of: (1) interactive, closed-loop tool execution — not just producing code, but running it and observing outputs — and (2) 1,780 domain-specific tools that model the actual software stacks scientists use, rather than a generic Python environment.
The closest architectural comparison is to SWE-bench for software engineering: both run agents inside real execution environments, evaluate based on outcome not output text, and reward correct multi-step planning over single-shot reasoning.
What Developers Should Take Away
If you're building a scientific agent or workflow — drug discovery pipelines, materials screening, biological pathway analysis — several things follow directly from this benchmark:
Don't evaluate with L1-equivalent tasks. A success rate of 60% on two-step tasks is a ceiling, not a floor. Measure the workflows your production system actually runs: if they have 6+ interdependent tool calls, test them explicitly.
Dependency order matters as much as tool selection. Most agent frameworks (LangGraph, AutoGen, OpenAI Agents SDK, PydanticAI) can invoke tools in the right sequence if instructed correctly — but this requires that the model actually understands which tool outputs are prerequisites for which tool inputs. System prompt engineering alone isn't sufficient for complex dependency chains.
Fine-tuning on structured trajectories is underexplored. The SciForge result suggests that tool-sequencing is a teachable skill. If you're building domain-specific agents at scale, generating dependency-graph-aware training data and fine-tuning a smaller model may produce more reliable workflows than prompting a frontier model with instructions.
Track intermediate failures, not just terminal outcomes. The paper's finding that cascading step failures cause the L1→L3 drop means that coarse-grained end-task metrics hide where your agent actually breaks. Instrument each tool call separately.
Getting Started with SciAgentGYM
The benchmark environment is open source at github.com/CMarsRover/SciAgentGYM. The repository includes the full tool suite, the benchmark task set, and evaluation harness.
To run your own model against SciAgentBench, the general setup involves:
git clone https://github.com/CMarsRover/SciAgentGYM
cd SciAgentGYM
pip install -r requirements.txt
The benchmark requires domain-specific Python packages (RDKit for Chemistry, PySCF or equivalent for Physics, pymatgen for Materials Science) alongside an LLM API key. The README documents which tools map to which packages. Running a full evaluation sweep across all 259 tasks against a frontier model incurs real API costs — the paper's evaluation used GPT-5, Claude-Sonnet-4.5, DeepSeek-R1, and Qwen3-235B.
For development and debugging, the SciAgentBench tasks include L1 subsets that run on shorter tool chains — a reasonable starting point before scaling to full L2/L3 evaluation.
FAQ
Q: Is SciAgentGYM only relevant for actual science applications?
No. The benchmark is a proxy for any workflow where tool calls have strict dependency ordering and intermediate outputs are consumed by downstream steps. Financial modeling pipelines, data engineering workflows, and complex DevOps automation all exhibit the same structural challenge that makes L3 science tasks hard.
Q: How does SciForge compare to standard instruction fine-tuning?
Standard instruction fine-tuning teaches a model "here's a task, here's the output." SciForge fine-tuning teaches a model "here's the tool dependency graph, here's how trajectories should flow through it." The dependency-aware approach produces significantly better performance on long-horizon tasks because the model learns causal ordering, not just output format.
Q: Which model performed best overall on SciAgentBench?
The paper evaluated GPT-5, Claude-Sonnet-4.5, DeepSeek-R1, and Qwen3-235B. Among frontier models, GPT-5 achieved a 60.6% success rate on L1 tasks — but even that best-in-class performance fell to 30.9% on L3. SciAgent-8B (fine-tuned via SciForge) showed notably better long-horizon resilience than the frontier models in the paper's comparisons.
Q: Can I add my own tools to the environment?
Yes. SciAgentGYM's design allows domain-specific tool registration. The evaluation infrastructure routes tool calls through a standardized interface, so new tools that follow the input/output schema can be added without modifying the core framework.
Q: Is 259 tasks enough to be statistically meaningful?
For tool-use benchmarks that require closed-loop execution, 259 tasks is actually substantial — each task requires multiple execution steps and domain-expert validation. SWE-bench Verified (the gold standard for coding agents) has 500 tasks; SciAgentBench's 259 tasks with 1,134 sub-questions provide granular scoring at the sub-question level that single-outcome benchmarks don't.
Key Takeaways
- SciAgentGYM (arXiv:2602.12984) is the first benchmark to evaluate LLMs on multi-step scientific tool-use through closed-loop interaction, using 1,780 real domain-specific tools across Physics, Chemistry, Materials Science, and Life Sciences.
- Even GPT-5 drops from 60.6% on simple tasks (L1) to 30.9% on long-horizon tasks (L3) — a degradation pattern shared by all tested frontier models.
- Tool use benefits Chemistry (+7.0%) and Life Sciences (+8.4%) more than Physics (+2.5%), reflecting where parametric knowledge falls short.
- SciForge — a dependency-graph-based data synthesis method — enables an 8B fine-tuned model (SciAgent-8B) to outperform the 235B Qwen3-VL-235B-Instruct, with +6.7% improvement and cross-domain transfer.
- For developers: measure tool-call success at each intermediate step, not just end-task outcomes; fine-tuning on dependency-structured trajectories is an underused lever for scientific agents.
The benchmark and environment are open at github.com/CMarsRover/SciAgentGYM. If your agent needs to navigate a real scientific tool chain, this is the evaluation suite to run it against before claiming production readiness.
Need content like this
for your blog?
We run AI-powered technical blogs. Start with a free 3-article pilot.