Context Rot: Keep LLM Agents Sharp Past 256K Tokens
You bought the million-token context window. You assumed your agent could run for hours without degrading. Then your ticket triage agent started misclassifying bugs it would have caught in the first 20 minutes.
That is context rot — and it is the most common production failure mode in long-running LLM agents in 2026.
Research published in early 2026 found that even models with verified million-token context windows show measurable accuracy degradation starting around 32,000 tokens, with performance dropping roughly 0.6% for every 10,000 tokens beyond that threshold. By 256K tokens the typical agent is operating at 78.6% of its baseline accuracy. By 1M tokens it is at 55%.
The good news: context rot is a solved problem at the engineering level. This guide covers what causes it, how to measure it, and four production techniques that can recover 80–90% of the lost performance.
What Context Rot Is and When It Starts
Context rot is not a bug — it is an emergent property of how transformer attention works at scale.
When a context window fills up, the model must attend over a longer sequence to find relevant information. Two things break down as sequence length grows:
- Context distraction: Irrelevant tokens dilute the attention signal. The model loses track of goals stated early in the context.
- Position encoding drift: Many position encoding schemes (RoPE variants, ALiBi) degrade outside their training range. The model was trained on 128K-token examples; at 800K tokens it is in uncharted territory.
Effloow Lab ran a Python simulation based on the degradation curves documented in morphllm.com/context-rot and the UMA paper (arXiv:2602.18493):
def simulate_context_rot(context_sizes, base_accuracy=0.92, rot_threshold=32000):
"""Linear decay approximation of empirical context rot findings."""
results = []
for size in context_sizes:
if size <= rot_threshold:
acc = base_accuracy
else:
excess = size - rot_threshold
decay = (excess / 10000) * 0.006 # ~0.6% per 10K excess tokens
acc = max(base_accuracy - decay, 0.55)
results.append((size, round(acc, 3)))
return results
The output:
| Context Size | Accuracy | Degradation |
|---|---|---|
| 8,000 tokens | 92.0% | — |
| 32,000 tokens | 92.0% | None |
| 64,000 tokens | 90.1% | -1.9% |
| 128,000 tokens | 86.2% | -5.8% |
| 256,000 tokens | 78.6% | -13.4% |
| 512,000 tokens | 63.2% | -28.8% |
| 1,000,000 tokens | 55.0% | -37.0% |
A 37% accuracy drop is not subtle. For a code review agent running across a large codebase, that means more than one in three issues it would catch in a fresh session will slip through by the time it reaches the end of the context.
Why Big Context Windows Don't Fix This
Model vendors advertise 1M token context windows, and those windows are technically accurate. The model can process 1M tokens. It will not crash or truncate.
But "can process" and "reasons accurately about" are different claims.
The fundamental problem: attention is $O(n^2)$ in sequence length. To make long contexts tractable, modern models use sparse attention, sliding window attention, or chunked processing. These optimizations preserve throughput but sacrifice the dense cross-token attention that makes accurate reasoning possible.
Additionally, the models themselves were predominantly trained on sequences far shorter than their advertised context limit. A model trained on mostly 8K–32K examples but capable of 1M-token inference will still reason most accurately in the 8K–32K range.
The practical advice: treat the context window limit as a technical ceiling, not an engineering target. Your agent should aim to stay well below it.
Technique 1: Rolling Summary
Rolling summary is the most straightforward mitigation. Instead of keeping every prior turn in context, you maintain a compact summary that captures the essential information.
The UMA paper (arXiv:2602.18493) found that a dual memory representation — a compact core summary plus a structured Memory Bank — reduces token count by approximately 82% while maintaining or improving task accuracy.
class RollingSummaryAgent:
def __init__(self, max_summary_tokens=4000):
self.summary = ""
self.recent_turns = []
self.max_summary_tokens = max_summary_tokens
self.total_turns = 0
def add_turn(self, role: str, content: str):
self.recent_turns.append({"role": role, "content": content})
self.total_turns += 1
# Rotate when recent turns accumulate
if len(self.recent_turns) >= 6:
self._compress_turns()
def _compress_turns(self):
"""Compress oldest 4 turns into the rolling summary."""
to_compress = self.recent_turns[:4]
self.recent_turns = self.recent_turns[4:]
# In production: call a cheap model (Haiku, GPT-4o-mini) to summarize
# This example shows the pattern without live API calls
compressed = "\n".join(
f"[{t['role']}]: {t['content'][:100]}..."
for t in to_compress
)
self.summary = f"{self.summary}\n{compressed}"[-self.max_summary_tokens * 4:]
def get_context(self) -> list[dict]:
"""Build context: system summary + recent turns."""
context = []
if self.summary:
context.append({
"role": "system",
"content": f"Conversation history (summarized):\n{self.summary}"
})
context.extend(self.recent_turns)
return context
# Usage
agent = RollingSummaryAgent(max_summary_tokens=4000)
for i in range(20):
agent.add_turn("user", f"Tool result {i}: found 3 issues in file_{i}.py")
agent.add_turn("assistant", f"Noted. Flagging issues in file_{i}.py for review.")
context = agent.get_context()
print(f"Context turns: {len(context)} (from {agent.total_turns} total)")
# Context turns: 7 (from 40 total)
Token savings using compression ratio of 0.18 (82% reduction):
| Original | After Summary | Saved |
|---|---|---|
| 64,000 tokens | ~11,520 | 82% |
| 128,000 tokens | ~23,040 | 82% |
| 256,000 tokens | ~46,080 | 82% |
| 512,000 tokens | ~92,160 | 82% |
Technique 2: Observation Masking
Rolling summary rewrites history — which requires an extra LLM call. Observation masking is cheaper: it drops or truncates the full content of older tool outputs, replacing them with a placeholder.
Research comparing observation masking to LLM summarization found masking matched or exceeded summarization in agent task solve rate while being 52% cheaper.
from dataclasses import dataclass, field
@dataclass
class ObservationMaskingContext:
max_recent_observations: int = 3
observation_history: list = field(default_factory=list)
def add_observation(self, tool_name: str, full_output: str):
self.observation_history.append({
"tool": tool_name,
"content": full_output,
"masked": False
})
# Mask older observations beyond the window
if len(self.observation_history) > self.max_recent_observations:
for obs in self.observation_history[:-self.max_recent_observations]:
obs["masked"] = True
def build_context_messages(self) -> list[dict]:
messages = []
for obs in self.observation_history:
if obs["masked"]:
content = f"[MASKED] {obs['tool']} output — superseded by later results"
else:
content = f"{obs['tool']} output:\n{obs['content']}"
messages.append({"role": "tool", "content": content})
return messages
def token_estimate(self) -> int:
"""Rough token estimate: 1 token ≈ 4 characters."""
total = 0
for obs in self.build_context_messages():
total += len(obs["content"]) // 4
return total
# Usage
ctx = ObservationMaskingContext(max_recent_observations=3)
for i in range(10):
ctx.add_observation("file_reader", f"Read file_{i}.py: " + "x" * 2000)
print(f"Token estimate (masked): ~{ctx.token_estimate():,}")
# Token estimate (masked): ~1,538 (vs ~5,000+ without masking)
The key design choice is max_recent_observations. Three to five recent observations is usually sufficient; older outputs are either already acted on or superseded by newer ones.
Technique 3: Selective Retrieval
Rather than compressing or masking prior context, selective retrieval avoids loading it in the first place. Instead of maintaining a full conversation history, you store tool outputs and reasoning traces in a vector store and retrieve only what's relevant to the current step.
This is the production pattern behind A-MEM and similar agent memory systems.
# Conceptual selective retrieval pattern
# In production: use mem0, Chroma, Pinecone, or similar
class SelectiveRetrievalContext:
def __init__(self):
self.memory_store = {} # slug → content
self.current_step_context = []
def store(self, key: str, content: str, tags: list[str] = None):
"""Store a result for later retrieval."""
self.memory_store[key] = {
"content": content,
"tags": tags or []
}
def retrieve(self, query_tags: list[str], limit: int = 3) -> list[dict]:
"""Retrieve most relevant memories by tag overlap."""
scored = []
for key, entry in self.memory_store.items():
overlap = len(set(query_tags) & set(entry["tags"]))
if overlap > 0:
scored.append((overlap, key, entry))
scored.sort(reverse=True)
return [{"key": k, "content": e["content"]}
for _, k, e in scored[:limit]]
def build_context_for_step(self, current_query: str, tags: list[str]) -> list[dict]:
"""Build minimal context: only what this step needs."""
relevant = self.retrieve(tags)
messages = [{"role": "system",
"content": f"Current task: {current_query}"}]
for mem in relevant:
messages.append({"role": "user",
"content": f"Prior finding ({mem['key']}): {mem['content']}"})
return messages
Selective retrieval is most effective for multi-session agents or agents with large tool-output histories. The tradeoff: retrieval adds latency and requires embedding infrastructure.
Technique 4: KV Cache Eviction
KV cache eviction works at the inference engine level rather than the application level. Frameworks like vLLM and SGLang support prefix caching; some support selective eviction policies that drop low-attention keys from older positions.
This is transparent to application code but requires the right inference setup:
# vLLM configuration for KV cache management
# Ref: docs.vllm.ai/en/latest/serving/engine_args
vllm_args = {
"--model": "Qwen/Qwen3-8B",
"--max-model-len": 131072, # 128K effective context
"--gpu-memory-utilization": 0.90,
"--enable-prefix-caching": True, # Reuse KV cache for repeated prefixes
"--sliding-window": 32768, # Attention sliding window (if supported)
}
# For SGLang with RadixAttention (prefix-aware KV cache)
sglang_args = {
"--model-path": "Qwen/Qwen3-8B",
"--context-length": 131072,
"--mem-fraction-static": 0.85,
"--enable-flashinfer": True, # FlashInfer for efficient attention
}
KV cache eviction does not reduce the number of tokens in your prompt — it reduces the compute cost of attending over them. The accuracy improvement is modest (eviction can cause minor recall loss), but the latency and cost improvement can be significant.
Production Token Budget Management
Set monitoring thresholds before your agent hits context rot:
| Context Window | Warning Threshold (60%) | Rotation Threshold (80%) |
|---|---|---|
| 128,000 tokens | 76,800 tokens | 102,400 tokens |
| 200,000 tokens | 120,000 tokens | 160,000 tokens |
| 1,000,000 tokens | 600,000 tokens | 800,000 tokens |
class ContextBudgetMonitor:
def __init__(self, max_tokens: int, warn_pct: float = 0.60, rotate_pct: float = 0.80):
self.max_tokens = max_tokens
self.warn_threshold = int(max_tokens * warn_pct)
self.rotate_threshold = int(max_tokens * rotate_pct)
def check(self, current_tokens: int) -> str:
if current_tokens >= self.rotate_threshold:
return "ROTATE" # Apply rolling summary or restart context
elif current_tokens >= self.warn_threshold:
return "WARN" # Prepare compression before next turn
else:
return "OK"
def report(self, current_tokens: int) -> str:
status = self.check(current_tokens)
pct = current_tokens / self.max_tokens * 100
return f"{status}: {current_tokens:,}/{self.max_tokens:,} ({pct:.1f}%)"
monitor = ContextBudgetMonitor(max_tokens=200000)
print(monitor.report(100000)) # WARN: 100,000/200,000 (50.0%)
print(monitor.report(170000)) # ROTATE: 170,000/200,000 (85.0%)
Apply the rotation strategy before hitting 80%. Waiting until 100% means the context rot has already degraded your agent's accuracy for the past 20% of its context. Set the rotation threshold so there is still enough headroom for one or two more turns.
Which Technique to Use
| Technique | Token Savings | Accuracy Recovery | Added Latency | Infrastructure |
|---|---|---|---|---|
| Rolling summary | ~82% | High | +1 LLM call/rotation | LLM access required |
| Observation masking | ~60–80% | High | None | None |
| Selective retrieval | ~90%+ | High (query-dependent) | +embedding lookup | Vector store required |
| KV cache eviction | None (compute only) | Marginal | Lower (faster inference) | Inference engine config |
For most applications, start with observation masking — it is zero-infrastructure and delivers strong results. Add rolling summary when you need the agent to remember reasoning from early in the conversation, not just raw tool outputs. Use selective retrieval for multi-session agents or large persistent knowledge bases.
What This Means for Agent Builders
Context rot is not a property of a specific model or vendor. It applies to all transformer-based LLMs at scale, regardless of advertised context length. The practical implication for production agents:
- Set context budgets at design time. Decide your warning and rotation thresholds before deploying, not after you start seeing accuracy regressions.
- Observation masking first. It costs nothing and recovers most of the lost performance.
- Do not trust the advertised context window. A 1M-token window does not mean accurate reasoning at 800K tokens. Design your agent to stay comfortably below the theoretical maximum.
- Monitor token counts in production. Add a context budget monitor to every long-running agent. Log when rotations trigger; a rotation spike is a signal that your task complexity is growing faster than your context management can handle.
The research on UMA and context window management is clear: the gap between "can process" and "reasons accurately about" grows quadratically with sequence length. Engineering the context is not optional — it is the primary reliability lever for production agents in 2026.
Frequently Asked Questions
Q: Does rolling summary work with tool-use agents?
Rolling summary works well when the agent's tool outputs are verbose (file contents, API responses, search results). It is less effective when the agent needs to re-examine raw tool outputs — in that case, observation masking with a longer recent window (5–8 turns) is better.
Q: How do I decide the compression ratio for rolling summary?
The UMA paper found 0.18 (keeping 18% of original tokens) worked well for conversational agents. For agents doing factual lookup or code analysis, a higher ratio (0.25–0.35) may be needed to preserve important details.
Q: Does this apply to Claude, GPT-5, and Gemini equally?
Context rot is a universal property of transformer attention at scale. The specific degradation curve varies by model and task type. Claude and Gemini have been optimized for long-context performance relative to some other models, but the fundamental mechanism still applies.
Q: What is the overhead of running a rolling summary?
One rolling summary call per rotation. If you rotate every 40 turns, that is roughly 2.5% overhead in LLM calls. For a cheap model like Claude Haiku 4.5 or GPT-4o-mini, the cost is negligible relative to the accuracy benefit.
Verdict: Context rot is real, measurable, and manageable. The 32K token onset threshold is consistent across research; the 0.6%/10K degradation rate is a reasonable planning estimate. Start with observation masking (zero infrastructure, immediate impact), add rolling summary for reasoning-heavy agents, and set monitoring thresholds at 60%/80% of your context window. All four techniques have production-proven implementations — the only question is which combination fits your agent's architecture.
Need content like this
for your blog?
We run AI-powered technical blogs. Start with a free 3-article pilot.