Braintrust Llm Eval Autoevals Ci Sandbox Poc 2026
Date: 2026-05-20 Track: sandbox-poc Slug: braintrust-llm-eval-autoevals-ci-sandbox-poc-2026
Goal
Evaluate whether Braintrust Autoevals can support a small regression-safe LLM evaluation gate that a developer could run locally and in CI.
The sandbox stayed credential-free:
- Install pinned Braintrust and Autoevals packages in a temporary virtualenv.
- Build a deterministic JSON-output eval script.
- Score outputs with
ExactMatchandValidJSON. - Run a passing baseline.
- Trigger an intentional regression and confirm the script exits nonzero.
- Draft a minimal GitHub Actions workflow shape.
No Braintrust cloud project, API key, Claude API call, OpenAI API call, paid model, or production credential was used.
Environment
Host: macOS
Working directory: /tmp/effloow-braintrust-autoevals-poc
Python: Python 3.12.8
autoevals: 0.1.0
braintrust: 0.19.0
Date: 2026-05-20
Commands and Outputs
1. Prepare isolated sandbox
rm -rf /tmp/effloow-braintrust-autoevals-poc
mkdir -p /tmp/effloow-braintrust-autoevals-poc
cd /tmp/effloow-braintrust-autoevals-poc
python3 -m venv .venv
2. Pin dependencies
File: /tmp/effloow-braintrust-autoevals-poc/requirements.txt
autoevals==0.1.0
braintrust==0.19.0
Install:
.venv/bin/python -m pip install -r requirements.txt
Relevant output:
Successfully installed ... autoevals-0.1.0 braintrust-0.19.0 ...
Version check:
.venv/bin/python -m pip show autoevals braintrust | sed -n '1,80p'
Relevant output:
Name: autoevals
Version: 0.1.0
Summary: Universal library for evaluating AI models
---
Name: braintrust
Version: 0.19.0
Summary: SDK for integrating Braintrust
3. Build the local eval gate
File: /tmp/effloow-braintrust-autoevals-poc/eval_guardrail.py
import json
import sys
from autoevals import ExactMatch, ValidJSON
DATASET = [
{
"id": "refund-status",
"input": "Return a compact JSON refund status.",
"expected": {"status": "pending", "action": "collect_receipt"},
},
{
"id": "shipping-status",
"input": "Return a compact JSON shipping status.",
"expected": {"status": "ready", "action": "notify_customer"},
},
]
def app(case, mode):
if mode == "baseline":
return json.dumps(case["expected"], separators=(",", ":"))
if case["id"] == "shipping-status":
return json.dumps({"status": "ready", "action": "email_customer"}, separators=(",", ":"))
return json.dumps(case["expected"], separators=(",", ":"))
def numeric_score(result):
if hasattr(result, "score"):
return float(result.score)
if isinstance(result, dict) and "score" in result:
return float(result["score"])
return float(result)
def run(mode):
exact = ExactMatch()
valid_json = ValidJSON()
rows = []
for case in DATASET:
output = app(case, mode)
expected = json.dumps(case["expected"], separators=(",", ":"))
exact_score = numeric_score(exact(output=output, expected=expected))
json_score = numeric_score(valid_json(output=output))
combined = (exact_score + json_score) / 2
rows.append(
{
"id": case["id"],
"output": output,
"exact": exact_score,
"valid_json": json_score,
"combined": combined,
}
)
average = sum(row["combined"] for row in rows) / len(rows)
print(json.dumps({"mode": mode, "average": average, "rows": rows}, indent=2))
return 0 if average >= 1.0 else 1
if __name__ == "__main__":
selected_mode = sys.argv[1] if len(sys.argv) > 1 else "baseline"
raise SystemExit(run(selected_mode))
4. Run passing baseline
.venv/bin/python eval_guardrail.py baseline
Output:
{
"mode": "baseline",
"average": 1.0,
"rows": [
{
"id": "refund-status",
"output": "{\"status\":\"pending\",\"action\":\"collect_receipt\"}",
"exact": 1.0,
"valid_json": 1.0,
"combined": 1.0
},
{
"id": "shipping-status",
"output": "{\"status\":\"ready\",\"action\":\"notify_customer\"}",
"exact": 1.0,
"valid_json": 1.0,
"combined": 1.0
}
]
}
5. Trigger intentional regression
.venv/bin/python eval_guardrail.py regression
echo "exit_code=$?"
Output:
{
"mode": "regression",
"average": 0.75,
"rows": [
{
"id": "refund-status",
"output": "{\"status\":\"pending\",\"action\":\"collect_receipt\"}",
"exact": 1.0,
"valid_json": 1.0,
"combined": 1.0
},
{
"id": "shipping-status",
"output": "{\"status\":\"ready\",\"action\":\"email_customer\"}",
"exact": 0.0,
"valid_json": 1.0,
"combined": 0.5
}
]
}
exit_code=1
The regression remained valid JSON, but it changed the required action from notify_customer to email_customer. ValidJSON still scored 1.0, while ExactMatch caught the behavioral drift.
6. CI workflow shape
File: /tmp/effloow-braintrust-autoevals-poc/github-actions-llm-eval.yml
name: llm-eval
on:
pull_request:
push:
branches: [main]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install -r requirements.txt
- run: python eval_guardrail.py baseline
What Worked
autoevals==0.1.0andbraintrust==0.19.0installed in a clean virtualenv.ExactMatchandValidJSONran locally without provider API keys.- The baseline scored
average: 1.0and exited successfully. - The intentional regression scored
average: 0.75and exited with code1. - The script can be used as a simple CI gate because score failure maps to process failure.
What Did Not Work
- No Braintrust cloud experiment was created because no Braintrust API key was used.
- No LLM-as-judge scorer was run because that would require a model provider key.
- No Claude/OpenAI call was made.
- No GitHub Actions job was executed remotely; only the workflow file shape was drafted.
Limitations
This PoC proves the local scorer-and-exit-code pattern, not full hosted Braintrust observability. A production implementation should add a real dataset, provider-backed task calls, cloud experiment logging, baseline comparison, secret management, and pull-request reporting.
The deterministic sample also uses exact JSON equality, which is intentionally strict. Real LLM applications usually need a mix of exact checks, schema checks, semantic scorers, LLM-as-judge scoring, human review, and production trace sampling.
Sources Checked
- https://www.braintrust.dev/docs/reference/autoevals/python/0.1.0/python
- https://www.braintrust.dev/docs/reference/sdks/python/0.19.0/python
- https://www.braintrust.dev/docs/evaluate
- https://www.braintrust.dev/docs/evaluate/run-evaluations
- https://www.braintrust.dev/pricing
- https://www.braintrust.dev/docs/plans-and-limits
- https://github.com/braintrustdata/autoevals
- https://github.com/braintrustdata/braintrust-sdk-python
Read the article
This note supports the public article and records what was actually checked.