OpenAI Agents SDK Guardrails: Local Sandbox PoC
OpenAI Agents SDK guardrails are worth testing before they sit in front of a real model. They are the code path that decides whether a user input is allowed to reach an agent, whether a final answer is allowed to leave the system, and whether a tripwire should stop the run.
The useful discovery from this sandbox is simple: you can test a meaningful part of that behavior locally without OpenAI API credits. Effloow Lab installed openai-agents==0.17.4, created input and output guardrails, ran direct guardrail checks, and confirmed that Runner.run_sync() can raise InputGuardrailTripwireTriggered for unsafe input while OPENAI_API_KEY is absent. The lab did not run a clean prompt through a model, did not test a hosted sandbox, and did not claim production-grade PII detection.
openai-agents==0.17.4, and no OPENAI_API_KEY. Lab notes: data/lab-runs/openai-agents-sdk-guardrails-local-sandbox-poc-2026.md.
Why Guardrails Deserve Unit Tests
The OpenAI Agents SDK guardrails docs describe three relevant layers: input guardrails, output guardrails, and tool guardrails. Input guardrails inspect incoming user input. Output guardrails inspect the final agent output. Tool guardrails wrap custom function-tool calls, which matters when a workflow has managers, handoffs, or delegated specialists.
That separation is practical. A prompt-injection defense that belongs at the input boundary should not be buried inside a tool function. A JSON schema check that belongs at the output boundary should not be phrased as a vague instruction in the system prompt. A policy that controls whether a tool may receive customer data should be attached to the tool path, not only to the top-level agent.
The official guardrail reference exposes the core contract: a guardrail returns GuardrailFunctionOutput with output_info and tripwire_triggered. When the tripwire is triggered, agent execution is halted. That return shape is small enough to test like normal application logic.
This is the engineering point: guardrails should not be treated as decorative safety copy. If a guardrail protects private data, payment actions, production deploys, or destructive tools, it needs tests that prove the tripwire fires on known-bad examples and stays quiet on known-good examples.
What Changed in the Current SDK
The current sandbox installed openai-agents==0.17.4, released on May 26, 2026 according to the OpenAI Agents SDK GitHub releases page. The package metadata in the project's pyproject.toml lists Python >=3.10, MIT licensing, and dependencies including openai>=2.36.0,<3, pydantic, mcp, requests, and websockets.
That version detail matters because older examples can drift. In this run, the installed package exposed:
from agents import (
Agent,
GuardrailFunctionOutput,
InputGuardrailTripwireTriggered,
Runner,
RunContextWrapper,
input_guardrail,
output_guardrail,
set_tracing_disabled,
)
The OpenAI Agents SDK quickstart still covers the basic Agent and Runner pattern. The agent reference shows that Agent accepts input_guardrails and output_guardrails lists. The tracing docs also matter for local tests: tracing is enabled by default, and can be disabled globally with OPENAI_AGENTS_DISABLE_TRACING=1 or in code with set_tracing_disabled(True).
For this PoC, tracing was disabled in code so the test remained credential-free and quiet.
What the Sandbox Proved
The sandbox created a minimal PII guardrail that looks for an email address or an sk- shaped token. That detector is intentionally simple. The goal was not to build a production PII classifier; the goal was to validate the SDK tripwire path.
The input guardrail used run_in_parallel=False:
@input_guardrail(run_in_parallel=False)
async def reject_pii_input(ctx, agent, user_input):
text = user_input if isinstance(user_input, str) else json.dumps(user_input)
found = bool(PII_PATTERN.search(text))
return GuardrailFunctionOutput(
output_info={"contains_pii": found, "agent": agent.name},
tripwire_triggered=found,
)
That setting is important for a pre-model policy gate. The reference says input guardrails can run either in parallel with the agent or before it starts. For a hard privacy block, the before-start behavior is easier to reason about because the unsafe input should not race a model call.
The output guardrail used the same regex on the final text:
@output_guardrail
async def reject_unredacted_output(ctx, agent, output):
text = str(output)
found = bool(PII_PATTERN.search(text))
return GuardrailFunctionOutput(
output_info={"contains_unredacted_pii": found, "agent": agent.name},
tripwire_triggered=found,
)
Direct guardrail checks produced the expected result:
{
"safe_input_tripwire": false,
"unsafe_input_tripwire": true,
"safe_output_tripwire": false,
"unsafe_output_tripwire": true
}
Then the lab attached the input and output guardrails to an Agent and ran an unsafe prompt through Runner.run_sync() with no API key in the environment. The runner raised InputGuardrailTripwireTriggered and reported that the input contained PII:
{
"runner_blocked_before_model": true,
"exception": "InputGuardrailTripwireTriggered",
"output_info": {
"agent": "policy-demo",
"contains_pii": true
}
}
That is the safe local testing pattern: use direct guardrail checks for true unit tests, then add one runner-level blocking test for inputs that should never reach a model.
Reproduce the Local PoC
Create an isolated virtualenv:
rm -rf /tmp/effloow-openai-agents-guardrails-poc
mkdir -p /tmp/effloow-openai-agents-guardrails-poc
python3 -m venv /tmp/effloow-openai-agents-guardrails-poc/.venv
/tmp/effloow-openai-agents-guardrails-poc/.venv/bin/python -V
Install the SDK:
/tmp/effloow-openai-agents-guardrails-poc/.venv/bin/python -m pip install --upgrade pip
/tmp/effloow-openai-agents-guardrails-poc/.venv/bin/python -m pip install openai-agents
Check the installed versions:
/tmp/effloow-openai-agents-guardrails-poc/.venv/bin/python -m pip show openai-agents openai
Expected package versions from this run:
openai-agents: 0.17.4
openai: 2.38.0
Save this as guardrail_poc.py:
import asyncio
import json
import os
import re
import agents
from agents import (
Agent,
GuardrailFunctionOutput,
InputGuardrailTripwireTriggered,
Runner,
RunContextWrapper,
input_guardrail,
output_guardrail,
set_tracing_disabled,
)
PII_PATTERN = re.compile(r"[\w.+-]+@[\w.-]+\.\w+|sk-[A-Za-z0-9_-]{8,}")
@input_guardrail(run_in_parallel=False)
async def reject_pii_input(ctx, agent, user_input):
text = user_input if isinstance(user_input, str) else json.dumps(user_input)
found = bool(PII_PATTERN.search(text))
return GuardrailFunctionOutput(
output_info={"contains_pii": found, "agent": agent.name},
tripwire_triggered=found,
)
@output_guardrail
async def reject_unredacted_output(ctx, agent, output):
text = str(output)
found = bool(PII_PATTERN.search(text))
return GuardrailFunctionOutput(
output_info={"contains_unredacted_pii": found, "agent": agent.name},
tripwire_triggered=found,
)
async def manual_guardrail_checks(agent):
ctx = RunContextWrapper(context=None)
safe_input = await reject_pii_input.run(agent, "Summarize the public changelog.", ctx)
unsafe_input = await reject_pii_input.run(agent, "Email jane@example.com the report.", ctx)
safe_output = await reject_unredacted_output.run(ctx, agent, "The report is ready.")
unsafe_output = await reject_unredacted_output.run(ctx, agent, "Send it to jane@example.com.")
return {
"safe_input_tripwire": safe_input.output.tripwire_triggered,
"unsafe_input_tripwire": unsafe_input.output.tripwire_triggered,
"safe_output_tripwire": safe_output.output.tripwire_triggered,
"unsafe_output_tripwire": unsafe_output.output.tripwire_triggered,
}
def runner_blocking_check(agent):
try:
Runner.run_sync(agent, "Send jane@example.com a copy of the report.")
except InputGuardrailTripwireTriggered as exc:
return {
"runner_blocked_before_model": True,
"exception": type(exc).__name__,
"output_info": exc.guardrail_result.output.output_info,
}
return {"runner_blocked_before_model": False}
def main():
os.environ.pop("OPENAI_API_KEY", None)
set_tracing_disabled(True)
agent = Agent(
name="policy-demo",
instructions="Return concise compliance summaries.",
input_guardrails=[reject_pii_input],
output_guardrails=[reject_unredacted_output],
)
payload = {
"agents_version": agents.__version__,
"openai_api_key_present": "OPENAI_API_KEY" in os.environ,
"manual_checks": asyncio.run(manual_guardrail_checks(agent)),
"runner_check": runner_blocking_check(agent),
}
print(json.dumps(payload, indent=2, sort_keys=True))
if __name__ == "__main__":
main()
Run it:
/tmp/effloow-openai-agents-guardrails-poc/.venv/bin/python guardrail_poc.py
The script should show openai_api_key_present: false, unsafe_input_tripwire: true, unsafe_output_tripwire: true, and runner_blocked_before_model: true.
Where This Pattern Fits
This pattern is useful for three kinds of agent projects.
First, it helps teams write regression tests for safety policy. If a future refactor weakens the regex, changes the guardrail return type, or forgets to attach the guardrail to the agent, the local test should fail before any API request is made.
Second, it gives developers a cheap way to document boundary behavior. A README can say, "PII input is blocked before model execution," but a test output proves the code path did it at least once.
Third, it separates deterministic checks from model-backed checks. Your local suite can test the guardrail function contract, the runner exception path, and fixture coverage. A smaller integration suite can then test clean prompts, model responses, output guardrails, traces, and tool behavior with real credentials.
This also pairs naturally with broader agent hardening work. Effloow's guide to OpenAI Agents SDK sandbox, memory, and MCP covers the wider platform surface. This article focuses only on the guardrail slice that can be validated locally.
What This Does Not Prove
The PoC is deliberately narrow.
It does not prove that the regex catches all PII. It does not prove that output guardrails catch a live model response, because no model response was generated. It does not test hosted sandbox agents, MCP servers, handoffs, streaming, tracing export, or deployment behavior. It also does not measure false positives, false negatives, latency, model quality, or cost.
Those limitations are not failures; they are the boundary of a responsible sandbox article. The local test proves the shape of the guardrail contract and the blocking input tripwire. Production readiness still needs stronger detectors, structured output validation, real model integration tests, trace review, tool-level guardrails, and red-team fixtures.
FAQ
Q: Can OpenAI Agents SDK guardrails be tested without an API key?
Yes, partially. Direct guardrail functions can be tested locally because they are normal Python functions wrapped by SDK decorators. A blocking input tripwire can also be exercised through Runner.run_sync() if the unsafe input halts execution before a model call. Clean end-to-end runs still need a model provider.
Q: Should input guardrails run in parallel?
It depends on the policy. The SDK supports parallel input guardrails by default, but this PoC used run_in_parallel=False for a hard privacy gate. If the input must not reach the model, a before-start guardrail is easier to audit.
Q: Are output guardrails enough to prevent data leakage?
No. Output guardrails are useful as a final boundary, but they run after the agent has produced an answer. Sensitive systems should combine input guardrails, tool guardrails, retrieval filtering, structured output validation, and least-privilege tool design.
Q: Why disable tracing in a local guardrail test?
The OpenAI tracing docs say tracing is enabled by default and can be disabled with set_tracing_disabled(True) or OPENAI_AGENTS_DISABLE_TRACING=1. For credential-free sandbox tests, disabling tracing avoids trace export attempts and keeps the run independent from OpenAI account configuration.
Key Takeaways
OpenAI Agents SDK guardrails are testable enough to deserve normal CI coverage. The local contract is small: return GuardrailFunctionOutput, set tripwire_triggered, attach the guardrail to the right agent boundary, and assert that unsafe fixtures trip.
Use this pattern for cheap confidence, not as a complete safety program. A local PoC can prove that the tripwire code path works without API credits. Production systems still need real model integration tests, stronger classifiers, tool-level policy, and operational trace review before guardrails become a release gate.
Start by testing guardrails as ordinary Python policy code. It is fast, credential-free, and catches wiring mistakes before an agent spends tokens or touches tools.
Need content like this
for your blog?
We run AI-powered technical blogs. Start with a free 3-article pilot.