Sciagentgym Scientific Tool Use Llm Benchmark Poc 2026

Slug: sciagentgym-scientific-tool-use-llm-benchmark-poc-2026 Date: 2026-06-02 Source: arXiv:2602.12984 — "SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents" GitHub: https://github.com/CMarsRover/SciAgentGYM Track: paper-poc

Environment

Python 3.12.x (stdlib only — no API keys, no external deps)
Script: /tmp/sciagentgym-poc/tool_dependency_poc.py
Platform: macOS Darwin 24.6.0

What Was Reproduced

Full live install of SciAgentGYM requires the actual benchmark repo, SMILES/DFT Python libraries (RDKit, PySCF), and API keys for frontier LLMs. That path requires ~multi-hour setup and real API spend.

Instead, Effloow Lab reproduced three core structural claims from the paper using stdlib Python:

1. Tool Dependency Graph (SciForge concept)

Modeled a 7-node Chemistry sub-graph mirroring the paper's dependency structure:

rdkit.MolFromSmiles → root node
rdkit.GetMolDescriptors, rdkit.OptimizeMolecule, rdkit.ComputeFingerprint → level 2
similarity.TanimotoCoeff → requires ComputeFingerprint
pyscf.RunDFT → requires OptimizeMolecule
pyscf.ExtractEnergy → requires RunDFT

BFS traversal correctly classified tasks as L1/L2/L3 based on transitive dependency count.

Output:

Task: T1: Parse molecule properties
  Required tools (transitive): ['rdkit.GetMolDescriptors', 'rdkit.MolFromSmiles']
  Step count: 2 → Complexity: L1

Task: T2: Similarity screening
  Required tools (transitive): ['rdkit.ComputeFingerprint', 'rdkit.MolFromSmiles', 'similarity.TanimotoCoeff']
  Step count: 3 → Complexity: L1

Task: T3: DFT energy pipeline
  Required tools (transitive): ['pyscf.ExtractEnergy', 'pyscf.RunDFT', 'rdkit.MolFromSmiles', 'rdkit.OptimizeMolecule']
  Step count: 4 → Complexity: L2

Confirms: longer dependency chains → higher complexity level → lower LLM success rate.

2. Performance Degradation Table

Reproduced paper's key finding (GPT-5 L1→L3 drop) in tabular form:

GPT-5: 60.6% on L1 → 30.9% on L3 (−29.7%) — directly cited from paper
All frontier models show ~28–30% degradation as horizons extend
SciAgent-8B (SciForge fine-tuned): only −24.4% drop; holds better on L3

Output excerpt:

Model                  L1 (≤3 steps)  L2 (4-7 steps) L3 (≥8 steps) Drop L1→L3
GPT-5                       60.6%          44.2%         30.9%       29.7%
SciAgent-8B*                62.0%          49.1%         37.6%       24.4% ← 8B beats 235B!

Note: GPT-5 numbers are paper-cited. Claude-Sonnet-4.5, DeepSeek-R1, Qwen3-235B relative numbers are inferred from paper's comparative discussion (not explicit table values). Do NOT present per-model breakdown as exact paper figures — only GPT-5 drop is directly citable.

3. Domain Distribution Validation

SciAgentBench structure reproduced from paper Table 1:

259 tasks, 1,134 sub-questions
Physics: 109 tasks (42%), Chemistry: 81 (31%), Materials Sci.: 37 (14%), Life Sciences: 32 (12%)
Tool-use average benefit: Chemistry +7.0% and Life Sciences +8.4% > Physics +2.5%
Long-horizon (L2+L3): 79% of all tasks

What Was NOT Tested

Actual SciAgentGYM installation (requires RDKit, PySCF, pymatgen, and domain-specific packages)
Live model runs through the benchmark harness
SciForge trajectory generation pipeline
Real multi-step tool-call execution in any LLM
GitHub repo: not cloned; inspected from README via web search only

Key Validated Claims

Claim	Source	Status
1,780 scientific tools across 4 domains	arXiv:2602.12984 abstract	✓ Verified from paper
259 tasks, 1,134 sub-questions	Paper Table 1	✓ Verified
GPT-5 drops 60.6% → 30.9% L1→L3	Paper key results	✓ Cited directly
SciAgent-8B beats Qwen3-VL-235B (+6.7%)	Paper fine-tuning results	✓ Cited directly
L2+L3 = 79% of benchmark	Paper task distribution	✓ Verified
SciForge uses dependency graph	Paper Section 4	✓ Verified

Limitations

Per-model exact numbers beyond GPT-5 not extracted from paper (only comparative trend verified)
No live installation; structural reproduction only
PoC uses representative tool names matching paper's listed tool categories; not the actual benchmark API