Braintrust Llm Eval Autoevals Ci Sandbox Poc 2026

Date: 2026-05-20 Track: sandbox-poc Slug: braintrust-llm-eval-autoevals-ci-sandbox-poc-2026

Goal

Evaluate whether Braintrust Autoevals can support a small regression-safe LLM evaluation gate that a developer could run locally and in CI.

The sandbox stayed credential-free:

Install pinned Braintrust and Autoevals packages in a temporary virtualenv.
Build a deterministic JSON-output eval script.
Score outputs with ExactMatch and ValidJSON.
Run a passing baseline.
Trigger an intentional regression and confirm the script exits nonzero.
Draft a minimal GitHub Actions workflow shape.

No Braintrust cloud project, API key, Claude API call, OpenAI API call, paid model, or production credential was used.

Environment

Host: macOS
Working directory: /tmp/effloow-braintrust-autoevals-poc
Python: Python 3.12.8
autoevals: 0.1.0
braintrust: 0.19.0
Date: 2026-05-20

Commands and Outputs

1. Prepare isolated sandbox

rm -rf /tmp/effloow-braintrust-autoevals-poc
mkdir -p /tmp/effloow-braintrust-autoevals-poc
cd /tmp/effloow-braintrust-autoevals-poc
python3 -m venv .venv

2. Pin dependencies

File: /tmp/effloow-braintrust-autoevals-poc/requirements.txt

autoevals==0.1.0
braintrust==0.19.0

Install:

.venv/bin/python -m pip install -r requirements.txt

Relevant output:

Successfully installed ... autoevals-0.1.0 braintrust-0.19.0 ...

Version check:

.venv/bin/python -m pip show autoevals braintrust | sed -n '1,80p'

Relevant output:

Name: autoevals
Version: 0.1.0
Summary: Universal library for evaluating AI models
---
Name: braintrust
Version: 0.19.0
Summary: SDK for integrating Braintrust

3. Build the local eval gate

File: /tmp/effloow-braintrust-autoevals-poc/eval_guardrail.py

import json
import sys

from autoevals import ExactMatch, ValidJSON


DATASET = [
    {
        "id": "refund-status",
        "input": "Return a compact JSON refund status.",
        "expected": {"status": "pending", "action": "collect_receipt"},
    },
    {
        "id": "shipping-status",
        "input": "Return a compact JSON shipping status.",
        "expected": {"status": "ready", "action": "notify_customer"},
    },
]


def app(case, mode):
    if mode == "baseline":
        return json.dumps(case["expected"], separators=(",", ":"))
    if case["id"] == "shipping-status":
        return json.dumps({"status": "ready", "action": "email_customer"}, separators=(",", ":"))
    return json.dumps(case["expected"], separators=(",", ":"))


def numeric_score(result):
    if hasattr(result, "score"):
        return float(result.score)
    if isinstance(result, dict) and "score" in result:
        return float(result["score"])
    return float(result)


def run(mode):
    exact = ExactMatch()
    valid_json = ValidJSON()
    rows = []

    for case in DATASET:
        output = app(case, mode)
        expected = json.dumps(case["expected"], separators=(",", ":"))
        exact_score = numeric_score(exact(output=output, expected=expected))
        json_score = numeric_score(valid_json(output=output))
        combined = (exact_score + json_score) / 2
        rows.append(
            {
                "id": case["id"],
                "output": output,
                "exact": exact_score,
                "valid_json": json_score,
                "combined": combined,
            }
        )

    average = sum(row["combined"] for row in rows) / len(rows)
    print(json.dumps({"mode": mode, "average": average, "rows": rows}, indent=2))
    return 0 if average >= 1.0 else 1


if __name__ == "__main__":
    selected_mode = sys.argv[1] if len(sys.argv) > 1 else "baseline"
    raise SystemExit(run(selected_mode))

4. Run passing baseline

.venv/bin/python eval_guardrail.py baseline

Output:

{
  "mode": "baseline",
  "average": 1.0,
  "rows": [
    {
      "id": "refund-status",
      "output": "{\"status\":\"pending\",\"action\":\"collect_receipt\"}",
      "exact": 1.0,
      "valid_json": 1.0,
      "combined": 1.0
    },
    {
      "id": "shipping-status",
      "output": "{\"status\":\"ready\",\"action\":\"notify_customer\"}",
      "exact": 1.0,
      "valid_json": 1.0,
      "combined": 1.0
    }
  ]
}

5. Trigger intentional regression

.venv/bin/python eval_guardrail.py regression
echo "exit_code=$?"

Output:

{
  "mode": "regression",
  "average": 0.75,
  "rows": [
    {
      "id": "refund-status",
      "output": "{\"status\":\"pending\",\"action\":\"collect_receipt\"}",
      "exact": 1.0,
      "valid_json": 1.0,
      "combined": 1.0
    },
    {
      "id": "shipping-status",
      "output": "{\"status\":\"ready\",\"action\":\"email_customer\"}",
      "exact": 0.0,
      "valid_json": 1.0,
      "combined": 0.5
    }
  ]
}
exit_code=1

The regression remained valid JSON, but it changed the required action from notify_customer to email_customer. ValidJSON still scored 1.0, while ExactMatch caught the behavioral drift.

6. CI workflow shape

File: /tmp/effloow-braintrust-autoevals-poc/github-actions-llm-eval.yml

name: llm-eval

on:
  pull_request:
  push:
    branches: [main]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install -r requirements.txt
      - run: python eval_guardrail.py baseline

What Worked

autoevals==0.1.0 and braintrust==0.19.0 installed in a clean virtualenv.
ExactMatch and ValidJSON ran locally without provider API keys.
The baseline scored average: 1.0 and exited successfully.
The intentional regression scored average: 0.75 and exited with code 1.
The script can be used as a simple CI gate because score failure maps to process failure.

What Did Not Work

No Braintrust cloud experiment was created because no Braintrust API key was used.
No LLM-as-judge scorer was run because that would require a model provider key.
No Claude/OpenAI call was made.
No GitHub Actions job was executed remotely; only the workflow file shape was drafted.

Limitations

This PoC proves the local scorer-and-exit-code pattern, not full hosted Braintrust observability. A production implementation should add a real dataset, provider-backed task calls, cloud experiment logging, baseline comparison, secret management, and pull-request reporting.

The deterministic sample also uses exact JSON equality, which is intentionally strict. Real LLM applications usually need a mix of exact checks, schema checks, semantic scorers, LLM-as-judge scoring, human review, and production trace sampling.