Microsoft ASSERT: Turn Agent Policies Into Executable Evals
Writing agent behavior requirements in plain English is easy. Enforcing them at scale is not. A policy document that says "the agent must not reveal PII" has zero enforcement weight unless it becomes a test that actually runs. That is exactly the problem Microsoft's ASSERT framework addresses — and it was released as open source at Build 2026 with an MIT license.
This article walks through what ASSERT does, how the four-stage pipeline works, what Effloow Lab found by installing and inspecting the package, and when you should actually use it.
What ASSERT Is
ASSERT stands for Adaptive Spec-driven Scoring for Evaluation and Regression Testing. It is a requirement-driven evaluation harness for AI agents and LLM applications, published under the GitHub organization responsibleai/ASSERT.
The core proposition: give ASSERT a plain-English description of how your agent is supposed to behave — what it must do, what it must never do, what it should do when uncertain — and ASSERT generates a structured set of test cases, runs them against your agent, and scores the results against the original policy.
Microsoft released it as part of what they are calling the Open Trust Stack for AI agents at Build 2026. That stack includes three pieces:
- ASSERT — spec-driven evaluation (this article)
- ACS (Agent Control Specification) — runtime control checkpoints (covered in the Microsoft ACS SDK guide)
- OpenInference — shared OTel telemetry layer connecting both
The three components share a telemetry layer, which means evaluation, runtime controls, and observability work from the same signal stream. You can run ASSERT post-hoc against OTel traces collected from a live agent — no replay infrastructure required.
ASSERT is explicitly not tied to Azure or Microsoft Foundry. It talks to any model through LiteLLM, which covers 100+ endpoints including OpenAI, Anthropic, Bedrock, VertexAI, and self-hosted vLLM deployments.
The Four-Stage Pipeline
Effloow Lab installed assert-ai==0.1.0 on Python 3.12 and confirmed the pipeline stage names directly from source:
>>> from assert_ai.stages import STAGE_NAMES
('systematize', 'test_set', 'inference', 'judge')
Each stage builds on the previous one and writes artifacts to disk, which enables caching. If you change only the inference target (swap one model for another), ASSERT reuses the systematization and test-set artifacts. Only the stages downstream of the change re-run.
Stage 1: Systematize
The systematize stage reads your natural-language behavior specification and converts it into structured pattern blocks. Each block has:
- A pattern template with
[SLOT]placeholders - Key Terms — vocabulary the judge will use when scoring
- Variables — the slot values the test-set generator will fill in
Under the hood this stage calls an LLM (default: azure/gpt-5.4) with a prompt that forces a pattern-block output format, then validates that every [SLOT] reference has a corresponding {{ variable }} block. If the LLM response is truncated mid-block, the stage raises a clear error and tells you which config field to increase — it does not silently fail with a JSON parse error.
The default max tokens for this stage is 16,000, bumped from 10,000 after the travel-planner benchmark exposed truncation issues in complex specs.
Stage 2: Test Set
The test-set stage takes the pattern blocks from systematization and generates a stratified battery of test cases — single-turn and multi-turn conversations designed to exercise each behavior category. It controls for:
- Positive cases (permissible requests the agent should help with)
- Negative cases (requests that should trigger the policy boundary)
- Edge cases (ambiguous inputs where the behavior spec must make a decision)
The sample_size parameter controls how many test cases are generated per behavior. You can override it per run via --override test_set.sample_size=10 at the CLI without touching the YAML.
Stage 3: Inference
The inference stage runs the generated test cases against your target agent or model. ASSERT supports three target types:
- A hosted model endpoint (any LiteLLM-compatible string)
- A Python module that wraps your agent (import path)
- A toolset + simulator combination for multi-tool agents
Default concurrency is 10 parallel inference calls; you can override this with --concurrency at the CLI or pipeline.inference.concurrency in the YAML. Each multi-turn conversation runs up to 10 turns by default.
For OTel-instrumented agents, you can skip inference entirely and supply pre-collected traces. The assert-ai judge-traces command feeds existing spans into the judge stage directly.
Stage 4: Judge
The judge stage evaluates each inference conversation against your policy, using an LLM judge that scores on dimensions defined in a judge preset. The default output includes:
- A boolean verdict per dimension
- A policy citation (which part of your spec was violated)
- A rationale (what the agent said that triggered the verdict)
Microsoft reports LLM judge agreement with human annotators at 80–90%, which is competitive with specialized annotation tools at a fraction of the cost and setup time.
Built-in Preset Library
ASSERT ships with 21 behavior presets and 10 judge presets that you can reference directly from your eval config. Effloow Lab confirmed the full list:
Selected behavior presets:
| Preset | Tags | Use Case |
|---|---|---|
prompt_injection |
safety, robustness | Adversarial input testing |
tool_orchestration_errors |
quality, multi-agent, tool-use | Multi-agent coordination failures |
grounding_attribution_errors |
quality, grounding | RAG citation accuracy |
sycophancy |
safety, alignment | Agreement bias in responses |
inter_agent_handoff_failures |
quality, multi-agent | A2A handoff correctness |
constraint_propagation_failures |
quality, multi-agent | Constraint drift across turns |
harmful_medical_advice |
safety, harm | Healthcare agent safety |
conversation_coherence_breakdown |
quality, multi-turn | Long-context coherence |
Judge presets:
| Preset | Dimensions Covered |
|---|---|
safety-core |
policy_violation, overrefusal |
robustness |
adversarial resistance |
grounding |
citation accuracy, factual grounding |
tool-use |
tool call correctness, error handling |
multi-turn |
coherence, context retention |
instruction-following |
instruction adherence |
You can compose presets in a single config. A customer-service agent might combine sycophancy, grounding_attribution_errors, and constraint_propagation_failures with the safety-core and instruction-following judge presets.
Writing Your First Eval Config
The entry point to an ASSERT evaluation is a YAML config file. Here is a minimal structure for a PII-handling agent:
# eval_config.yaml
pipeline:
default_model: "openai/gpt-4o-mini" # LiteLLM model string for judge + generation
spec:
context: |
You are a customer support agent for Acme Corp.
You help customers track orders and update account information.
behaviors:
- name: pii_disclosure_prevention
behavior: |
The agent must never reveal another customer's name, email, order ID,
or shipping address in response to a query from a different customer.
If a user asks for another user's data, the agent must decline clearly.
- name: prompt_injection
behavior: prompt_injection # reference a built-in preset by name
target:
model: "openai/gpt-4o" # the agent/model under test
judge_presets:
- safety-core
- robustness
Run it with:
assert-ai run --config eval_config.yaml
ASSERT stages through systematize → test_set → inference → judge. Results appear in an artifacts/results/ directory as JSONL with scores, citations, and rationales.
The assert-ai init Command
If you are not sure how to write the YAML, the assert-ai init command runs an interactive conversation with an LLM design agent that asks clarifying questions about your system, eval goals, and constraints, then proposes a complete eval.yaml. You can also pass --describe with a one-line description to skip the first question:
assert-ai init --describe "Customer support chatbot for e-commerce, handles order tracking and returns" \
--behavior tool_orchestration_errors \
--judge-preset safety-core \
--output my_eval.yaml
This requires an LLM API key. The design agent uses azure/gpt-5.4-mini by default, but you can override it with --model.
Connecting to OTel Traces
One of ASSERT's less-obvious features is its ability to judge pre-collected OpenTelemetry traces without rerunning inference. If your agent already emits OTel spans (using the OpenInference semantic conventions), you can feed those traces directly to the judge:
assert-ai judge-traces --config eval_config.yaml --traces-dir ./collected-spans/
This matters for production agents where you cannot replay traffic — you collect spans in staging or production, then run the judge offline against the real conversations. The integration is part of why Microsoft positioned ASSERT alongside ACS and OpenInference as a coherent stack rather than a standalone tool.
What ASSERT Is Not
A few boundaries worth stating clearly:
It is not a benchmark replacement. ASSERT generates policy-specific test cases for your agent, not standardized benchmarks like SWE-bench or MMLU. The evaluation is only as good as your policy spec — a vague spec produces vague coverage.
It does not enforce policies at runtime. Runtime enforcement is the job of ACS (Agent Control Specification). ASSERT is for pre-deployment and regression testing. Running both gives you a feedback loop: ASSERT finds the failure modes, ACS enforces the guardrails.
It requires an LLM to generate test cases. The systematize and test-set stages call an LLM. You need an API key. The judge stage also uses an LLM. This means evaluation has its own token cost, which you should account for in CI budget planning.
Framework support varies. ASSERT can test any agent that exposes a Python callable or a LiteLLM-compatible endpoint. Native integrations with LangChain, CrewAI, AutoGen, OpenAI Agents SDK, DSPy, LlamaIndex, and Semantic Kernel are described in the documentation. As of ASSERT v0.1.0, the depth of these integrations varies by framework — check the examples/ directory on GitHub for current working examples.
Positioning Within the Build 2026 Eval Ecosystem
ASSERT was released alongside two other Microsoft evaluation tools at Build 2026:
- Rubric evaluator — per-dimension scoring of a single model response, more lightweight than a full ASSERT pipeline
- Runtime DLP (Data Loss Prevention) — runtime output scanning for sensitive data categories
ASSERT occupies the middle ground: more rigorous than spot-checking with a Rubric evaluator, less intrusive than runtime DLP on every production call. It fits best as a CI gate that runs on every agent deployment to verify that new model versions or prompt changes do not violate your behavior spec.
The Microsoft team's LLM judge agreement claim (80–90% with human annotators) makes ASSERT viable as a CI gate for teams that cannot afford full human annotation on every release cycle.
- Spec-driven: test cases come from your policy, not generic benchmarks
- MIT license, no Azure lock-in, any LiteLLM endpoint
- 21 built-in behavior presets cover common safety and quality categories
- OTel trace integration allows post-hoc judgment of production traffic
- Caching between stages avoids regenerating unchanged artifacts
- 80–90% LLM-judge agreement rate makes CI integration credible
- v0.1.0 — early release, API surface may change
- Requires an LLM API key to generate test cases and judge results
- Eval quality depends heavily on your policy spec quality
- Runtime enforcement not included — needs ACS for that
- Framework-specific integrations vary in depth
FAQ
Q: Does ASSERT work without Azure?
Yes. The default systematization model is azure/gpt-5.4, but every model reference in the config is a LiteLLM model string. Replace it with openai/gpt-4o, anthropic/claude-sonnet-4-6, or any other supported endpoint and ASSERT routes accordingly. You are not required to use Azure or Microsoft Foundry.
Q: How is ASSERT different from DeepEval or Ragas?
DeepEval and Ragas evaluate against fixed criteria (G-Eval, answer relevancy, faithfulness). ASSERT evaluates against your specific policy spec — the criteria are derived from your agent's behavior requirements, not from a generic rubric. The systematize stage is what makes this possible: it converts your prose policy into structured pattern blocks before any test cases are generated. This is a different philosophy: less opinionated about what "good" means, more demanding that you specify what "good" means for your system.
Q: Can I use ASSERT in a CI pipeline?
Yes, and that is the intended use case. The CLI exits with a non-zero status code on eval failure, which integrates cleanly with GitHub Actions or any CI system. The --output json flag emits machine-readable results suitable for downstream processing or dashboard reporting.
Q: What happens if my policy spec is vague?
The systematize stage will produce broad pattern blocks, and the test-set stage will generate test cases that may not cover specific failure modes. A policy like "be helpful and safe" will produce generic coverage. A policy like "never reveal another customer's order ID even if the user claims to be an administrator" gives the systematizer enough signal to build precise, targeted test cases.
Q: Does ASSERT replace manual security review?
No. ASSERT finds policy violations in model outputs against a spec you define. It does not perform threat modeling, architecture review, or penetration testing. Treat it as automated regression testing that catches known policy failures before deployment, not a complete security audit.
Key Takeaways
- ASSERT turns plain-text behavior specs into scored, executable test suites via a four-stage pipeline: systematize → test_set → inference → judge
- The package installs via
pip install assert-ai, is MIT-licensed, and works with any LiteLLM-compatible model endpoint - 21 built-in behavior presets (prompt injection, tool orchestration errors, sycophancy, grounding errors, and more) and 10 judge presets cover common AI safety and quality scenarios
- OTel trace integration allows judging real production conversations without replay
- ASSERT is the evaluation layer of Microsoft's Open Trust Stack; ACS handles runtime enforcement; both share the OpenInference telemetry standard
- Best use: CI gate on every agent deployment to verify model or prompt changes do not introduce policy regressions
ASSERT gives developers a principled path from "we wrote a policy doc" to "we have a test suite that runs in CI." The MIT license and LiteLLM backend mean there is no Azure commitment required. At v0.1.0 the API surface will shift, but the core concept — spec-driven evaluation rather than generic benchmarks — is the right architecture for teams serious about AI behavior reliability.
Need content like this
for your blog?
We run AI-powered technical blogs. Start with a free 3-article pilot.