Constraint Decay: Why LLM Agents Fail at Real Backend Code
Your AI coding agent just built a REST API endpoint. It passes all unit tests. The code looks clean. Then you add an ORM constraint, an architectural pattern requirement, and an auth middleware spec — and the next three tasks start failing in ways that are hard to explain. That sequence has a name now: constraint decay.
A May 2026 paper from arXiv (2605.06445) titled "Constraint Decay: The Fragility of LLM Agents in Backend Code Generation" puts hard numbers on something many developers have noticed informally. This article walks through what the paper found, why it matters for teams shipping production code with AI agents, and how Effloow Lab reproduced the decay curve from the paper using a pure-Python PoC.
Why This Matters
Benchmark scores for LLM coding agents have climbed fast. Models like Qwen3-Coder-Next, MiniMax-M2.5, and Kimi-K2.5 now exceed 85% on assertion pass rates when tasks are given full architectural freedom — no prescribed database schema, no forced ORM, no required architectural pattern. Those numbers get cited in model release announcements and leaderboards.
The problem is that unconstrained freedom describes almost none of your real backend work.
Production code operates inside a web of structural requirements: a specific ORM, an existing auth middleware pattern, a database schema your team maintains, an architectural convention from a decision three years ago. The paper tests what happens when agents face those constraints, and the results are harder to dismiss than a blog post hot take. This is an empirical study: 80 greenfield generation tasks and 20 feature-implementation tasks, eight web frameworks (Flask, FastAPI, Django, aiohttp, Express, Fastify, Hono, Koa), evaluated with end-to-end behavioral tests and static verifiers.
The headline finding: assertion pass rates drop by an average of 30 percentage points from baseline to fully constrained scenarios — a 40% relative loss of baseline performance. That is not a marginal degradation. It is a collapse.
For developers evaluating whether to trust an AI agent with backend code, understanding why this happens is more useful than knowing the number. That is what this article focuses on.
Core Concepts
What "Constraint Decay" Means
The term is precise. "Decay" is not a metaphor here — the paper fits the performance drop to an exponential model. As the number of structural constraints increases from zero (bare task, architectural freedom) to five (ORM layer, architectural pattern, DB schema, auth middleware, full API contract), pass rates fall along a curve that looks like radioactive decay: steep early, flattening later, but always lower.
Effloow Lab ran a sandbox PoC to reproduce this numerically. Using the paper's reported summary statistics (~50% baseline, ~20% at full constraints for minimal frameworks), the lab fitted an exponential decay model:
pass_rate = baseline × exp(−0.1888 × n_constraints)
The fitted decay rate of 0.1888 means each additional structural constraint multiplies the remaining pass rate by roughly 0.83. Add five constraints and you are at about 39% of your starting performance.
Here is the PoC's output table across three framework profiles:
Constraints Flask/Koa (minimal) FastAPI (moderate) Django (convention-heavy)
----------------------------------------------------------------------------------------------
None (baseline) 50.0% 45.0% 22.0%
ORM layer 41.4% 36.2% 17.2%
Arch pattern + ORM 34.3% 29.1% 13.5%
DB schema + Arch + ORM 28.4% 23.5% 10.5%
Auth middleware added 23.5% 18.9% 8.2%
Full API contract spec 19.4% 15.2% 6.4%
The numbers are reconstructed from the paper's aggregate statistics, not a raw replay of the evaluation pipeline. What they demonstrate is that the decay shape is consistent with an exponential model across all three framework tiers.
The Framework-Tier Gap
The second major finding is the baseline gap between minimal and convention-heavy frameworks. Flask and Koa start around 49–51% assertion pass rate. Django and FastAPI trail by 25–32 percentage points at baseline — before any additional constraints are layered on.
The reason is structural. Flask and Koa are explicit about almost everything: routing, ORM choice, middleware order. An LLM agent building a Flask endpoint must make concrete, visible decisions. Those decisions show up in code that is easy to test.
Django and FastAPI impose conventions. Django's ORM, its admin interface, its migration system, its signal architecture — these are not visible in a task prompt. They live in the framework's implicit contract with the developer. When an LLM agent generates code for a Django project, it needs to know which conventions apply, which ones the project has overridden, and how the framework's default behaviors interact with the task at hand. The paper's data suggests agents are much worse at navigating that implicit contract than they are at following explicit specifications.
FastAPI occupies a middle position. It is explicit in its HTTP routing (Pythonic type annotations drive a lot of behavior), but its dependency injection system and SQLAlchemy integration patterns carry real convention overhead. The paper's data and the PoC's modeled results put FastAPI between Flask and Django in baseline performance.
Data-Layer Defects as the Root Cause
The paper's error analysis identifies data-layer defects as the leading root cause of failures across all tested configurations. Two specific failure modes dominate:
-
Incorrect query composition — agents generate queries that are syntactically valid and pass simple mocks but fail under real data conditions: missing joins, wrong filter logic, or subquery structure that works in isolation but not against the schema.
-
ORM runtime violations — agents produce code that violates ORM usage rules at runtime. These often pass static analysis (the code is valid Python or JavaScript) but raise exceptions when the ORM tries to execute the generated query plan against the database.
Both categories share a common pattern: the agent generates code that looks correct at the level of syntax and surface behavior but fails at the boundary between application logic and the persistence layer. This is where structural constraints bite hardest, because ORM behavior is exactly the kind of implicit convention that does not show up clearly in a task prompt.
What Existing Benchmarks Miss
SWE-bench tests whether an agent can resolve real GitHub issues. HumanEval tests isolated function completion. Neither benchmark systematically measures whether the generated code satisfies non-functional structural requirements: "use the project's ORM", "follow the repository's auth middleware pattern", "match this DB schema". Existing benchmarks reward functional correctness while being blind to structural compliance.
The constraint decay paper argues this gap is not incidental. Benchmarks are designed to be automatable, and structural compliance checks require knowledge of the project's conventions — which means they require per-project setup that is expensive to scale. The result is a systematic bias: models optimize for benchmark tasks that do not test the property that matters most in production environments. You can read more about the general limits of coding benchmarks in our guide to AI coding market share and agent evaluation.
Practical Application
Designing Tasks to Reduce Constraint Pressure
The paper's findings suggest a practical heuristic: if you are delegating a backend task to an AI agent, make every structural constraint explicit in the prompt.
"Build a user authentication endpoint" is a minimal-constraint task. The agent will make reasonable choices about ORM, schema, and middleware — choices that may conflict with the rest of your codebase.
A better prompt makes the constraints explicit:
Build a POST /auth/login endpoint using:
- SQLAlchemy ORM (Session pattern, not async)
- User model defined in app/models/user.py
- Password verification via the existing verify_password() in app/utils/auth.py
- Return a JSON response with {token: str, expires_at: ISO8601}
- No new dependencies
That prompt encodes four structural constraints explicitly. The paper's data says you will still see degraded performance compared to an unconstrained task, but the agent is at least working from the right specification rather than inferring conventions it may not know.
Using Minimal Frameworks Strategically
The framework-tier gap the paper documents has a concrete implication: if your team is choosing a framework for a new service and plans to use AI agents heavily in development, minimal frameworks (Flask, Express, Koa, Hono) produce significantly better agent performance at baseline than convention-heavy ones.
This does not mean avoid Django or FastAPI — those frameworks carry real productivity advantages for humans. But the tradeoff is real. Teams that use AI agents for high-volume boilerplate generation on convention-heavy stacks will see lower pass rates and more manual correction work.
Testing for Structural Compliance, Not Just Functional Correctness
The paper's evaluation methodology is itself a pattern worth adopting. They use static verifiers alongside behavioral tests — checking that code satisfies structural requirements (imports, ORM usage patterns, architectural conventions) rather than only testing whether the endpoint returns the right HTTP response.
Adding a structural compliance check to your CI pipeline for agent-generated code costs real setup time, but it catches the ORM violations and incorrect query composition that functional tests miss. For a team running agent-generated code through automated review, this is the most direct mitigation the paper's findings suggest.
For a deeper look at how AI code review tools approach similar problems, see our roundup of the best AI code review tools in 2026.
Common Mistakes
Treating Benchmark Scores as Production Predictors
The most common mistake when evaluating AI coding agents is reading a benchmark score and projecting it onto your production codebase. An agent scoring 85%+ on unconstrained generation tasks may score 20–30% on your fully specified backend tasks. The paper makes this quantitative: a 40% relative performance loss from benchmark to production-like conditions is the paper's central finding, not an edge case.
Assuming "Passes Tests" Means "Structurally Correct"
A generated endpoint that passes your unit tests may still contain ORM usage violations that only surface under production load, or query composition errors that appear when the data gets large enough. "Green tests" is a necessary but not sufficient condition for structurally correct agent-generated backend code.
Using a Single Prompt to Load All Constraints
A related failure mode: developers pack every structural constraint into a single, complex prompt and wonder why agent performance drops. The constraint decay model suggests that accumulation is the problem. Splitting complex tasks into smaller steps — each with fewer simultaneous constraints — should reduce the compounding decay effect, even if total task count increases.
Not Accounting for the Framework's Implicit Contract
Assigning Django tasks to agents without providing explicit documentation of the project's ORM patterns, migration conventions, and signal usage is asking the agent to infer that implicit contract from context. Some models are better at this than others, but the paper's data shows that even the best-performing models suffer significant degradation on convention-heavy stacks.
FAQ
Does constraint decay affect all LLMs equally?
The paper tested multiple capable models, including Qwen3-Coder-Next (80B), MiniMax-M2.5, Kimi-K2.5, and GPT-5.2. The decay pattern appears across all of them — no model is immune. The best-performing models under unconstrained conditions (85%+ baseline) still lose roughly 30 percentage points when all structural constraints are applied. The relative ranking of models may shift under constraint pressure, but the decay itself is universal in the paper's data.
Is constraint decay the same as context window degradation?
No, though the two can interact. Context window degradation (also called "lost in the middle" failure) refers to models losing attention to information placed in the middle of long prompts. Constraint decay is a different phenomenon: it measures performance loss as the number of structural requirements increases, independent of prompt length. A fully constrained task specification can be shorter than an unconstrained one if the constraints are explicit. Constraint decay is about the cognitive complexity of satisfying multiple structural requirements simultaneously, not about prompt length or token position.
Why do minimal frameworks like Flask outperform Django at baseline?
The paper frames this as a convention overhead problem. Flask is explicit by design — almost everything that happens in a Flask application is written in the application code. There is no hidden ORM layer, no admin interface convention, no magic migration system. An LLM agent generating Flask code makes visible, auditable decisions. Django's conventions are not written in the application code; they live in the framework's documentation and the project's accumulated patterns. Agents that have not internalized the specific project's Django conventions generate code that is structurally incorrect even when it is functionally reasonable. FastAPI occupies a middle position because its HTTP routing is explicit (type annotations are visible) but its dependency injection and ORM integration patterns carry convention overhead comparable to Django.
What does this mean for AI coding agents in production deployments?
The practical implication is that AI coding agents in their current state should not be trusted as autonomous backends generators for constrained, production-grade tasks without structural compliance checks in the review pipeline. The paper is not arguing that AI agents are useless for backend development — unconstrained generation at 85%+ is genuinely useful for scaffolding and boilerplate. The argument is that the last mile — making generated code conform to your project's structural requirements — is where current agents fail most, and where current benchmarks provide the least signal.
Key Takeaways
The constraint decay paper is notable because it quantifies a failure mode that practitioners have observed informally for the past two years. The key numbers to keep in mind:
- 30 percentage point average drop in assertion pass rates from baseline to fully constrained tasks (40% relative performance loss)
- 25–32 point baseline gap between minimal frameworks (Flask, Koa) and convention-heavy ones (Django, FastAPI)
- Data-layer defects — bad query composition and ORM violations — are the leading root cause across all frameworks and models
- Existing benchmarks (HumanEval, SWE-bench) do not measure non-functional structural compliance, which means they systematically overstate agent readiness for production-constrained tasks
For teams actively using AI coding agents on backend work, the immediate practical actions are: make structural constraints explicit in every prompt, add structural compliance verification to the CI pipeline, and avoid projecting unconstrained benchmark scores onto constrained production tasks.
The PoC Effloow Lab ran confirms the exponential decay shape fits the paper's reported summary statistics cleanly. With a fitted decay rate of ~0.19, each new structural constraint multiplies remaining pass rate by roughly 0.83 — compounding quickly across the five constraint levels the paper tests. That is not a quirk of a specific model or framework. It is a structural property of the problem, and it will not disappear as models get larger.
Source: Constraint Decay: The Fragility of LLM Agents in Backend Code Generation — arXiv 2605.06445 (May 2026)
Need content like this
for your blog?
We run AI-powered technical blogs. Start with a free 3-article pilot.