AI DEVELOPMENT ARTICLES ·2026-06-13 ·UPDATED 2026-06-19 ·BY EFFLOOW EDITORIAL ·8 MIN READ

Can an AI Agent Finish Orders When a Tool Fails?

We gave an AI agent an order-processing job, broke its storage mid-task on purpose, and recorded what happened. 8 runs, all on the record.

agent-reliability business-automation proof-studio

Illustration for Can an AI Agent Finish Orders When a Tool Fails? — Illustration: AI-assisted. Editorial policy

If your AI agent updates orders, tickets, invoices, or CRM records, the expensive failure is not a dramatic model mistake. It is a boring system hiccup: storage is briefly unavailable, the agent must decide whether to retry, and your customer still expects the job to finish.

So instead of writing another opinion piece about whether AI agents are "ready for business," we built one, gave it a real job, broke its tools on purpose, and recorded every run. This page is the record — written so you don't need an engineering background to follow it, with the full technical detail at the bottom for your team.

The job we gave the agent

Picture a small but real back-office task. The agent receives 8 order records. Three of them are bad data — a missing amount, a negative amount, and a zero amount. The agent has to:

Read the orders,
Work out which are valid and what the correct total is,
Save the summary to a storage system.

There is exactly one right answer (5 valid orders, 3 invalid, total of 994.49), so there's no room for "close enough." The agent either gets it exactly right or it doesn't.

Then we made it harder. In half of the test runs, we sabotaged the storage system: the first time the agent tried to save, storage replied with a temporary failure — the kind of brief outage every real system has. We never told the agent this might happen, and we never told it to retry. We wanted to see what it would do on its own.

What actually happened

We ran the task 8 times: 4 normal runs and 4 runs with the sabotaged storage.

Run	Conditions	What the agent did	Right answer?	Time
1	normal	read → save	yes	7.8s
2	normal	read → save	yes	7.1s
3	normal	read → save	yes	5.8s
4	normal	read → save	yes	4.4s
5	storage breaks once	read → save fails → tried again, saved	yes	5.8s
6	storage breaks once	read → save fails → tried again, saved	yes	6.0s
7	storage breaks once	read → save fails → tried again, saved	yes	6.1s
8	storage breaks once	read → save fails → tried again, saved	yes	5.8s

Three things worth noticing, in plain terms:

It got the answer exactly right, every time. All 8 runs saved precisely 5 valid, 3 invalid, 994.49 — including correctly rejecting the missing, negative, and zero amounts. Not approximately right. Exactly right.

When storage broke, it just tried again — sensibly. In all 4 sabotaged runs, the agent saw the failure and immediately retried the save with the same correct numbers. It didn't panic, didn't start the task over, didn't ask a human for help, and didn't do anything weird. One failure, one clean retry, done.

Recovery isn't free — and now we know what it costs. The runs where storage broke consumed 52% more AI processing (1,633 tokens vs 1,076 on average). In everyday terms: one brief system hiccup cost about half a normal run's budget again. If several of your systems are flaky, those costs compound — that's a number to bring to your automation budget conversation, and it's the kind of number you only get by actually running things.

Failure-mode decision table

The useful output is not "agents work." The useful output is knowing which failure pattern was actually covered and which pattern still needs its own proof run.

Failure pattern	Covered by this run?	Observed result	What to do differently before production
Clean order summary with fixed validation rules	Yes	4 of 4 clean runs saved the exact expected count and total	Keep deterministic validation in the write tool, not only in the prompt.
One clearly retryable storage failure before save	Yes	4 of 4 failure-injected runs retried once and saved the exact expected count and total	Return model-readable transient error text and leave enough turns for one retry.
Vague tool error such as `Error 500`	No	[DATA NOT AVAILABLE]	Test opaque errors separately; do not assume the agent will infer safe retry behavior.
Partial write where storage saves some fields before failing	No	[DATA NOT AVAILABLE]	Add idempotency keys, write-audit records, and a reconciliation step before using an agent in live operations.
Multiple downstream systems failing in sequence	No	[DATA NOT AVAILABLE]	Run a chained workflow proof; one successful retry does not prove multi-system resilience.
Human approval after failed save	No	[DATA NOT AVAILABLE]	Define escalation rules in the harness and test whether the agent stops rather than improvising.

That table is the adoption value of the article. A team can copy the left column, replace the rows with its own risky failure modes, and decide what must be proven before an agent is allowed to touch production records.

Can this survive your workflow?

The task above is small on purpose, but the failure pattern is the one your business actually faces. The same "tool briefly fails mid-task" moment shows up in:

Order processing — payment or inventory systems time out mid-update
Support tickets — the helpdesk API rejects a write during peak hours
CRM updates — a record is locked when the agent tries to save
Billing and admin — an invoice system returns a temporary error
Internal workflows — any automation that touches more than one system

Whether an agent survives these moments is exactly the kind of claim that sounds good in a vendor demo and falls apart — or doesn't — in your real workflow. You don't have to guess. Send us the one claim you or your buyers doubt most, and we'll run it in a sandbox the same way: real executions, failures included, and a record you can quote. That's what a Proof Sprint is.

When to use / when to skip this pattern

Use this pattern when the workflow has a small number of high-value tool calls, clear validation rules, and failures that can be described safely to the model. It is a good fit for order checks, ticket triage, CRM field updates, and admin tasks where the system can verify the final saved values before accepting them.

Skip this pattern when the workflow cannot tolerate duplicate writes, when the tool error may expose private data, when a failed write can leave external systems in an unknown state, or when the agent has permission to take irreversible action. Those cases need stricter orchestration: idempotency, compensating transactions, human approval, and a separate failure proof.

Also skip any rollout that treats this 8-run result as a universal reliability percentage. This proof says the exact harness survived the exact injected failure. It does not say a different model, SDK setting, prompt, tool schema, or business system will behave the same way.

What this test does NOT prove

We keep this section in plain view because it's what makes the rest believable.

8 runs show the behavior exists and is consistent — they don't make it a guarantee. We report what happened, not a reliability percentage.
We broke storage politely. The error message said it was temporary and safe to retry. Vague errors, hard crashes, or half-finished saves are different tests. We ran that kind of follow-up in our ToolMisuseBench recovery reproduction, which probes schema drift, rate limits, and fake-success replies instead.
This was a simple task. Two systems, one decision rule. Your workflow chains more steps, and longer chains have more ways to fail.
One specific AI model. The results cover the exact model we ran (named below) and say nothing about others. If you are still choosing a framework, our AI agent frameworks comparison covers the options.
The agent's retry instinct comes from a default setting. Change that setting carelessly and one hiccup kills the whole job instead — your engineers will want the appendix for this one.

The same restraint applies before a team spends budget validating a product idea. If the question is market demand rather than tool reliability, use AI output to create better human validation priorities, not market truth; our AI consumer research framework explains that workflow.

Decision checklist before you copy it

Before putting this retry behavior into a real workflow, answer these checks in writing:

Check	Pass condition	If it fails
Is the final answer machine-checkable?	The save tool can reject wrong counts, totals, states, or IDs.	Add validation outside the model before rollout.
Is the failure message safe to show the model?	It explains the retryable condition without exposing secrets, private records, or stack traces.	Replace raw exceptions with a sanitized `failure_error_function`.
Is the write idempotent?	A repeated save cannot duplicate orders, invoices, messages, or charges.	Add idempotency keys or block autonomous retry.
Is there a turn budget for recovery?	`max_turns` covers the happy path plus at least one retry for each fragile tool.	Raise the budget or route failure to a human.
Is the proof bound to the exact model and SDK version?	The production setup matches the tested model, SDK, tool schema, and error-handling mode.	Rerun the proof before citing this result.

What Effloow Added

Most "are AI agents ready for business" articles are opinion. This one is an execution. The added value is the evidence, not a take:

We ran the failure, we didn't describe it — a real order-processing task with storage broken on purpose, repeated across 8 runs with the results on the record.
The evidence is append-only and checkable — every run (including failures) is kept as a hashed artifact, so the table can't be quietly success-filtered. The full record is at /lab-runs/openai-agents-sdk-flagship.
We publish the method, not just the win — exact model, SDK version, commands, token cost, and the limits this test does not cover, so you can reproduce or refute it.
We turn the run into a production checklist — the failure-mode table and rollout checklist show which safeguards need to exist before a retrying agent touches live business records.

The value is a claim you can verify against your own workflow, which is the opposite of a vendor demo.

For your engineers

Everything needed to verify or reproduce this proof.

Environment: model gpt-5.5-2026-04-23 (OpenAI API), openai-agents 0.17.5, openai 2.41.1, Python 3.12, max_turns=10, default tool-error handling. Test date 2026-06-13 (UTC).

Setup: Two function tools — read_orders() returns a fixed JSON fixture of 8 orders; save_result(valid_count, invalid_count, total_amount) validates against ground truth (5 / 3 / 994.49) at save time. In the failure scenario, save_result raises RuntimeError("storage backend temporarily unavailable (transient), please retry") on its first invocation per run, then works. The SDK's default failure_error_function surfaces the error text to the model as tool output — that default is what enables unprompted retry; replacing it with a raising handler turns one transient error into a dead run. Agent instructions are two sentences with no retry coaching.

Operational notes that transfer: write tool error messages for the model, not just logs (an opaque Error 500 gives it little to reason with); size max_turns for the happy path plus at least one retry per fragile tool; meter usage per run with the tool-call sequence attached, so a flaky tool shows up as a named cost (~550 tokens per recovery here) instead of an anonymous spend increase. If you are building a multi-agent system on this SDK from scratch, our OpenAI Agents SDK multi-agent tutorial walks through the setup.

Reproduce:

python3 -m venv venv
./venv/bin/pip install openai-agents==0.17.5
OPENAI_API_KEY=... ./venv/bin/python harness.py --runs-per-scenario 4

Evidence record: every run is an append-only JSON artifact (tool-call sequence, saved values vs ground truth, token usage, latency) hashed into a SHA-256 manifest at execution time. Failed runs stay in the ledger by design — success-only reporting would be detectable. Totals: 10,837 tokens across 8 runs (10,156 input / 681 output). The public evidence note with the full run table and artifact hashes is at /lab-runs/openai-agents-sdk-flagship.

This is a flagship report from Effloow Proof Studio. We run AI tools' claims in controlled sandboxes for vendors who need proof their buyers can check — reproducible harness, full run record including failures, and a claim table your sales team can quote. Founding sprints: $1,000 for the first 3 vendors. How a Proof Sprint works →

Sell an AI tool with a claim like this?

We run your tool's claim in a sandbox and hand you proof assets your buyers can check — recorded runs, failures included, and a sales-ready claim table.

See Proof Studio →

The job we gave the agent#

What actually happened#

Failure-mode decision table#

Can this survive your workflow?#

When to use / when to skip this pattern#

What this test does NOT prove#

Decision checklist before you copy it#

What Effloow Added#

For your engineers#