Openai Agents Sdk Flagship

Report: /articles/openai-agents-sdk-tool-failure-recovery-proof-2026
Claim: An OpenAI Agents SDK agent on gpt-5.5-2026-04-23 completes a three-step tool workflow and recovers from an injected transient tool failure without human intervention.
Claim scope: agent-reliability (bound to the exact model above — no inference to other models)
Test date: 2026-06-13 (UTC)
Environment: openai-agents 0.17.5, openai 2.41.1, Python 3.12, max_turns=10, default tool-error handling
Evidence level: sandbox-executed

Run record

8 runs, 2 scenarios. Every run saved exactly the ground-truth values (valid_count: 5, invalid_count: 3, total_amount: 994.49).

Run	Scenario	Tool sequence	Correct	Latency	Tokens
001	clean	read → save	yes	7.80s	1,076
002	clean	read → save	yes	7.11s	1,076
003	clean	read → save	yes	5.83s	1,076
004	clean	read → save	yes	4.39s	1,077
005	transient-failure	read → save → save	yes (recovered)	5.83s	1,632
006	transient-failure	read → save → save	yes (recovered)	5.96s	1,636
007	transient-failure	read → save → save	yes (recovered)	6.12s	1,632
008	transient-failure	read → save → save	yes (recovered)	5.82s	1,633

Totals: 10,837 tokens (10,156 input / 681 output). Failure-scenario runs cost a mean 1,633 tokens vs 1,076 clean — a 52% overhead per recovered tool failure.

Artifact manifest (SHA-256)

Each run is an append-only JSON artifact recorded at execution time. Hashes below are from the manifest ledger; failed runs would remain in this record.

Artifact	SHA-256 (first 16)	Recorded at
run-001.json	`abf0a4ab72107501…`	2026-06-13T00:01:48Z
run-002.json	`a19bed90a2418b3e…`	2026-06-13T00:01:55Z
run-003.json	`c381e559db6b62cf…`	2026-06-13T00:02:01Z
run-004.json	`222b8d1b2890adee…`	2026-06-13T00:02:05Z
run-005.json	`23bcf3fa7ebc27eb…`	2026-06-13T00:02:11Z
run-006.json	`a3379cf8a275744a…`	2026-06-13T00:02:17Z
run-007.json	`4451ca7ace244811…`	2026-06-13T00:02:23Z
run-008.json	`c99b6cf83593e636…`	2026-06-13T00:02:29Z

Manifest integrity warnings at time of publication: none.

Limitations

N=8 on a simple two-tool task with one clean retryable failure mode, one model snapshot, default SDK error handling. Counts, not rates — see the full report for the complete limitations list.