Can an AI Agent Finish Orders When a Tool Fails?
If your AI agent updates orders, tickets, invoices, or CRM records, the expensive failure is not a dramatic model mistake. It is a boring system hiccup: storage is briefly unavailable, the agent must decide whether to retry, and your customer still expects the job to finish.
So instead of writing another opinion piece about whether AI agents are "ready for business," we built one, gave it a real job, broke its tools on purpose, and recorded every run. This page is the record — written so you don't need an engineering background to follow it, with the full technical detail at the bottom for your team.
The job we gave the agent
Picture a small but real back-office task. The agent receives 8 order records. Three of them are bad data — a missing amount, a negative amount, and a zero amount. The agent has to:
- Read the orders,
- Work out which are valid and what the correct total is,
- Save the summary to a storage system.
There is exactly one right answer (5 valid orders, 3 invalid, total of 994.49), so there's no room for "close enough." The agent either gets it exactly right or it doesn't.
Then we made it harder. In half of the test runs, we sabotaged the storage system: the first time the agent tried to save, storage replied with a temporary failure — the kind of brief outage every real system has. We never told the agent this might happen, and we never told it to retry. We wanted to see what it would do on its own.
What actually happened
We ran the task 8 times: 4 normal runs and 4 runs with the sabotaged storage.
| Run | Conditions | What the agent did | Right answer? | Time |
|---|---|---|---|---|
| 1 | normal | read → save | yes | 7.8s |
| 2 | normal | read → save | yes | 7.1s |
| 3 | normal | read → save | yes | 5.8s |
| 4 | normal | read → save | yes | 4.4s |
| 5 | storage breaks once | read → save fails → tried again, saved | yes | 5.8s |
| 6 | storage breaks once | read → save fails → tried again, saved | yes | 6.0s |
| 7 | storage breaks once | read → save fails → tried again, saved | yes | 6.1s |
| 8 | storage breaks once | read → save fails → tried again, saved | yes | 5.8s |
Three things worth noticing, in plain terms:
It got the answer exactly right, every time. All 8 runs saved precisely 5 valid, 3 invalid, 994.49 — including correctly rejecting the missing, negative, and zero amounts. Not approximately right. Exactly right.
When storage broke, it just tried again — sensibly. In all 4 sabotaged runs, the agent saw the failure and immediately retried the save with the same correct numbers. It didn't panic, didn't start the task over, didn't ask a human for help, and didn't do anything weird. One failure, one clean retry, done.
Recovery isn't free — and now we know what it costs. The runs where storage broke consumed 52% more AI processing (1,633 tokens vs 1,076 on average). In everyday terms: one brief system hiccup cost about half a normal run's budget again. If several of your systems are flaky, those costs compound — that's a number to bring to your automation budget conversation, and it's the kind of number you only get by actually running things.
Can this survive your workflow?
The task above is small on purpose, but the failure pattern is the one your business actually faces. The same "tool briefly fails mid-task" moment shows up in:
- Order processing — payment or inventory systems time out mid-update
- Support tickets — the helpdesk API rejects a write during peak hours
- CRM updates — a record is locked when the agent tries to save
- Billing and admin — an invoice system returns a temporary error
- Internal workflows — any automation that touches more than one system
Whether an agent survives these moments is exactly the kind of claim that sounds good in a vendor demo and falls apart — or doesn't — in your real workflow. You don't have to guess. Send us the one claim you or your buyers doubt most, and we'll run it in a sandbox the same way: real executions, failures included, and a record you can quote. That's what a Proof Sprint is.
What this test does NOT prove
We keep this section in plain view because it's what makes the rest believable.
- 8 runs show the behavior exists and is consistent — they don't make it a guarantee. We report what happened, not a reliability percentage.
- We broke storage politely. The error message said it was temporary and safe to retry. Vague errors, hard crashes, or half-finished saves are different tests (and good candidates for a follow-up).
- This was a simple task. Two systems, one decision rule. Your workflow chains more steps, and longer chains have more ways to fail.
- One specific AI model. The results cover the exact model we ran (named below) and say nothing about others.
- The agent's retry instinct comes from a default setting. Change that setting carelessly and one hiccup kills the whole job instead — your engineers will want the appendix for this one.
For your engineers
Everything needed to verify or reproduce this proof.
Environment: model gpt-5.5-2026-04-23 (OpenAI API), openai-agents 0.17.5, openai 2.41.1, Python 3.12, max_turns=10, default tool-error handling. Test date 2026-06-13 (UTC).
Setup: Two function tools — read_orders() returns a fixed JSON fixture of 8 orders; save_result(valid_count, invalid_count, total_amount) validates against ground truth (5 / 3 / 994.49) at save time. In the failure scenario, save_result raises RuntimeError("storage backend temporarily unavailable (transient), please retry") on its first invocation per run, then works. The SDK's default failure_error_function surfaces the error text to the model as tool output — that default is what enables unprompted retry; replacing it with a raising handler turns one transient error into a dead run. Agent instructions are two sentences with no retry coaching.
Operational notes that transfer: write tool error messages for the model, not just logs (an opaque Error 500 gives it little to reason with); size max_turns for the happy path plus at least one retry per fragile tool; meter usage per run with the tool-call sequence attached, so a flaky tool shows up as a named cost (~550 tokens per recovery here) instead of an anonymous spend increase.
Reproduce:
python3 -m venv venv
./venv/bin/pip install openai-agents==0.17.5
OPENAI_API_KEY=... ./venv/bin/python harness.py --runs-per-scenario 4
Evidence record: every run is an append-only JSON artifact (tool-call sequence, saved values vs ground truth, token usage, latency) hashed into a SHA-256 manifest at execution time. Failed runs stay in the ledger by design — success-only reporting would be detectable. Totals: 10,837 tokens across 8 runs (10,156 input / 681 output). The public evidence note with the full run table and artifact hashes is at /lab-runs/openai-agents-sdk-flagship.
This is a flagship report from Effloow Proof Studio. We run AI tools' claims in controlled sandboxes for vendors who need proof their buyers can check — reproducible harness, full run record including failures, and a claim table your sales team can quote. Founding sprints: $1,000 for the first 3 vendors. How a Proof Sprint works →
Need a claim-safe article
for your tool?
Send one product URL and the claim a buyer should believe. We will map whether source review, sandbox work, API evidence, or a paper-to-PoC note can support it.