← Back to articles
Openai Agents Sdk Flagship
Evidence notes document the bounded local or source-based checks behind an Effloow article. They are not product endorsements, legal advice, or benchmark claims.
Proof Studio Evidence Note: OpenAI Agents SDK Tool-Failure Recovery
- Report: /articles/openai-agents-sdk-tool-failure-recovery-proof-2026
- Claim: An OpenAI Agents SDK agent on
gpt-5.5-2026-04-23completes a three-step tool workflow and recovers from an injected transient tool failure without human intervention. - Claim scope: agent-reliability (bound to the exact model above — no inference to other models)
- Test date: 2026-06-13 (UTC)
- Environment:
openai-agents0.17.5,openai2.41.1, Python 3.12,max_turns=10, default tool-error handling - Evidence level: sandbox-executed
Run record
8 runs, 2 scenarios. Every run saved exactly the ground-truth values (valid_count: 5, invalid_count: 3, total_amount: 994.49).
| Run | Scenario | Tool sequence | Correct | Latency | Tokens |
|---|---|---|---|---|---|
| 001 | clean | read → save | yes | 7.80s | 1,076 |
| 002 | clean | read → save | yes | 7.11s | 1,076 |
| 003 | clean | read → save | yes | 5.83s | 1,076 |
| 004 | clean | read → save | yes | 4.39s | 1,077 |
| 005 | transient-failure | read → save → save | yes (recovered) | 5.83s | 1,632 |
| 006 | transient-failure | read → save → save | yes (recovered) | 5.96s | 1,636 |
| 007 | transient-failure | read → save → save | yes (recovered) | 6.12s | 1,632 |
| 008 | transient-failure | read → save → save | yes (recovered) | 5.82s | 1,633 |
Totals: 10,837 tokens (10,156 input / 681 output). Failure-scenario runs cost a mean 1,633 tokens vs 1,076 clean — a 52% overhead per recovered tool failure.
Artifact manifest (SHA-256)
Each run is an append-only JSON artifact recorded at execution time. Hashes below are from the manifest ledger; failed runs would remain in this record.
| Artifact | SHA-256 (first 16) | Recorded at |
|---|---|---|
| run-001.json | abf0a4ab72107501… |
2026-06-13T00:01:48Z |
| run-002.json | a19bed90a2418b3e… |
2026-06-13T00:01:55Z |
| run-003.json | c381e559db6b62cf… |
2026-06-13T00:02:01Z |
| run-004.json | 222b8d1b2890adee… |
2026-06-13T00:02:05Z |
| run-005.json | 23bcf3fa7ebc27eb… |
2026-06-13T00:02:11Z |
| run-006.json | a3379cf8a275744a… |
2026-06-13T00:02:17Z |
| run-007.json | 4451ca7ace244811… |
2026-06-13T00:02:23Z |
| run-008.json | c99b6cf83593e636… |
2026-06-13T00:02:29Z |
Manifest integrity warnings at time of publication: none.
Limitations
N=8 on a simple two-tool task with one clean retryable failure mode, one model snapshot, default SDK error handling. Counts, not rates — see the full report for the complete limitations list.