ARTICLES ·2026-05-19 ·BY EFFLOOW CONTENT FACTORY

Atomic Facts Fix LLM Agent Planning: ICML 2025 Paper PoC

How atomic fact accumulation + lookahead search fixes long-horizon LLM agent failures — no fine-tuning required. ICML 2025 paper PoC with measured results.

llm-agents planning in-context-learning icml-2025 paper-poc agent-memory langraph

Atomic Facts Fix LLM Agent Planning: ICML 2025 Paper PoC

Your LangGraph agent is 12 steps into a multi-step task and it forgets why it started. It picks an action that made sense two steps ago but now contradicts what it just learned. It falls into the same error it made in episode 3 because it has no memory of episode 3.

This is not an LLM intelligence problem. It is a planning architecture problem, and a Cambridge research team presented a solution at ICML 2025 that requires zero fine-tuning.

The technique is called Atomic Fact Augmentation with Lookahead Search. Effloow Lab reproduced the core pattern in a minimal Python PoC and measured a 0% → 50% improvement in task completion over 10 episodes on a TextFrozenLake environment — with no model weight updates, no external memory database, and no API calls beyond what your agent already makes.

Why LLM Agents Fail at Long-Horizon Planning

Before looking at the fix, it helps to be precise about the failure mode.

LLM agents using ReAct, chain-of-thought, or similar patterns execute step-by-step using a greedy local policy: at each state, pick the action that looks best right now. This works fine for short tasks. It breaks down when:

Early mistakes compound: an agent that takes a wrong turn at step 3 may not realize the error until step 15, by which point recovery is expensive or impossible.
Context drift: over many steps, the original goal gets buried under accumulated interaction history. The model starts optimizing for the most recent context, not the task objective.
No cross-episode learning: each new task run starts cold. The agent re-discovers the same dead ends it found last time.

Research on long-horizon decision making (see arXiv:2601.22311, "Why Reasoning Fails to Plan") argues this is structural — chain-of-thought is a step-wise greedy policy that selects locally plausible actions but cannot reshape early decisions based on their long-term consequences. The gap between "reasoning" and "planning" is not a model capability gap; it is an architectural one.

The Paper: arXiv:2506.09171 (ICML 2025)

Samuel Holt, Max Ruiz Luyten, Thomas Pouplin, and Mihaela van der Schaar from the University of Cambridge published "Improving LLM Agent Planning with In-Context Learning via Atomic Fact Augmentation and Lookahead Search" on arXiv in June 2025. It was accepted for an oral presentation at the First Workshop on Computer Use Agents at ICML 2025 in Vancouver.

The paper proposes LWM-Planner (Latent World Model Planner), a framework with three LLM-driven components that all share one central resource: a growing set of atomic facts extracted from past experience.

What is an Atomic Fact?

An atomic fact is a minimal, precise textual statement extracted from an agent's interaction trajectory. Examples from the paper:

"object X is in receptacle_Y"
"action Z leads_to_failure_condition"
"cell (1,3) is impassable — moving right from (1,2) ends the episode"

Each fact is small enough to fit naturally in a prompt, specific enough to change decision behavior, and general enough to apply across multiple future steps. The key constraint is "atomic": one claim per fact, no compound statements.

The Three Components

1. Action Proposal — given the current state and accumulated atomic facts, the LLM proposes a set of candidate actions. The facts condition which actions are worth considering. An action that the fact set marks as leading to failure gets lower probability.

2. Latent World Model (simulate_step) — the LLM simulates what would happen if each candidate action were taken. This is not a real environment step; it is the LLM predicting the next state. The atomic facts ground this simulation: the model knows which states are terminal, which transitions are dangerous, and what the goal configuration looks like.

3. State-Value Estimation — the LLM scores each simulated next-state, estimating how close it is to the goal. Combined with the world model output, this gives the planner enough signal to select actions with longer-horizon reasoning rather than greedy local scoring.

These three components run inside a depth-limited lookahead search (temperature=0 for deterministic planning). At each real step, the agent expands a small search tree using the world model, evaluates the leaves with the value estimator, and picks the action at the root of the best subtree.

Fact Extraction and Lifecycle

After each episode, a fourth LLM call runs extract_facts on the full trajectory. The output is a list of atomic fact candidates. These go through a predictive-consistency filter: a candidate fact is retained only if it would have correctly predicted an observed outcome in the same trajectory. Optionally, a compression step summarizes redundant facts for large fact sets.

The resulting fact store persists across episodes and is injected into all three component prompts at the start of the next run. The agent learns online, purely in-context — no gradient updates, no fine-tuning.

Effloow Lab PoC: Measured Results

Effloow Lab ran a minimal reproduction of this pattern. Full evidence in data/lab-runs/atomic-fact-lookahead-llm-agent-planning-paper-poc-2026.md.

Environment: TextFrozenLake 5×5, 5 hidden holes, goal at (4,4)
Setup: Python 3.12, stdlib only (no API calls — actions and fact extraction are deterministic heuristics simulating the LLM roles)
Episodes: 10, max 30 steps each, fixed seed per episode

Baseline agent:       Success 0/10, avg steps 3.3
Fact-augmented agent: Success 5/10, avg steps 9.2
Facts accumulated by ep 10: 6 unique facts

The per-episode breakdown shows the learning curve clearly:

Episode	Baseline	Fact Agent	Facts known
Ep 1	hole (6 steps)	hole (6 steps)	0
Ep 2	hole (5 steps)	hole (5 steps)	1
Ep 3	hole (2 steps)	hole (2 steps)	2
Ep 4	hole (2 steps)	hole (12 steps)	3
Ep 5	hole (2 steps)	goal (12 steps)	4
Ep 6	hole (2 steps)	hole (3 steps)	4
Ep 7	hole (2 steps)	goal (24 steps)	4
Ep 8	hole (5 steps)	goal (10 steps)	5
Ep 9	hole (5 steps)	goal (10 steps)	5
Ep 10	hole (2 steps)	goal (8 steps)	6

Three things stand out:

Episodes 1–3 are identical: no facts yet, both agents behave the same. This confirms the technique adds no overhead when the fact store is empty.
Episode 4 divergence: the fact agent starts taking longer routes (12 vs 2 steps). It is avoiding known holes at the cost of more steps — trading step count for survival. The baseline takes 2 steps because it charges straight at a hole.
Late-episode consistency (Ep8–10): goal reached 3 times in a row as the hole map becomes nearly complete. The baseline never adapts.

The higher average step count for the fact agent (9.2 vs 3.3) is not inefficiency — it reflects safer route selection. The baseline's low step count comes entirely from hitting obstacles immediately.

PoC Limitations

This PoC simulates the LLM components with deterministic heuristics. The actual paper uses GPT-4-class models for action proposal, world model simulation, and value estimation. The lookahead in this PoC is depth=1 only; the paper uses depth-limited tree search. Results should be treated as a structural confirmation of the pattern, not a quantitative replication of the paper's numbers.

How to Apply This to Your Agent

The technique maps directly onto frameworks like LangGraph or Pydantic AI. No new infrastructure is required. Here is the minimal implementation pattern:

Step 1: Define the Fact Store

from dataclasses import dataclass, field
from typing import Literal

@dataclass
class AtomicFact:
    fact_type: Literal["obstacle", "path", "state", "rule"]
    description: str  # e.g. "calling tool X with empty input raises ValueError"
    source_step: int
    confidence: float = 1.0

Step 2: Inject Facts into the System Prompt

def build_system_prompt(base_prompt: str, facts: list[AtomicFact]) -> str:
    if not facts:
        return base_prompt
    
    fact_block = "\n".join(
        f"- [{f.fact_type.upper()}] {f.description}"
        for f in facts
    )
    return f"""{base_prompt}

## Known facts from prior experience
{fact_block}

Apply these facts when proposing actions and evaluating outcomes.
"""

Step 3: Extract Facts After Each Episode

EXTRACT_PROMPT = """
Review this agent trajectory and extract atomic facts — minimal, precise statements 
that would help a future agent avoid failures or reach the goal faster.

Trajectory:
{trajectory_text}

Output a JSON list of {{"fact_type": ..., "description": ...}} objects.
Only include facts that are clearly supported by the trajectory.
"""

async def extract_facts_from_trajectory(
    trajectory: list[dict],
    llm_client,
) -> list[AtomicFact]:
    text = "\n".join(
        f"Step {i}: action={t['action']}, outcome={t['outcome']}"
        for i, t in enumerate(trajectory)
    )
    response = await llm_client.complete(
        EXTRACT_PROMPT.format(trajectory_text=text)
    )
    raw = json.loads(response.content)
    return [AtomicFact(**r, source_step=len(trajectory)) for r in raw]

Step 4: Depth-1 Lookahead Filter (Optional but Effective)

Before committing to an action, simulate it with the world model and discard candidates the model predicts will fail:

async def filter_actions_with_lookahead(
    state: str,
    candidate_actions: list[str],
    facts: list[AtomicFact],
    llm_client,
) -> list[str]:
    """Remove actions predicted to fail based on current facts + world model."""
    safe = []
    for action in candidate_actions:
        prediction = await llm_client.complete(
            f"Given state: {state}\nFacts: {[f.description for f in facts]}\n"
            f"Predict outcome of action '{action}'. Reply: safe / failure / goal"
        )
        if "failure" not in prediction.content.lower():
            safe.append(action)
    return safe if safe else candidate_actions  # fallback to all actions

In LangGraph, these steps plug into the graph node that runs before action selection. In Pydantic AI, the facts go into the agent's system_prompt parameter at construction time (or dynamically via a deps object updated between runs). See our LangGraph agent tutorial and Pydantic AI graph agent guide for integration patterns.

Comparison with Related Approaches

Approach	Fine-tuning?	Cross-episode?	Structured facts?	Lookahead?
ReAct (baseline)	No	No	No	No
A-Mem (agentic memory)	No	Yes	Partial	No
MemMachine	No	Yes	Yes (ground-truth)	No
Chain of Draft	No	No	No	No
LWM-Planner (this paper)	No	Yes	Yes (atomic)	Yes

A-Mem (covered in our A-Mem guide) builds a memory graph from agent notes but doesn't combine it with lookahead search. MemMachine (MemMachine guide) focuses on preserving ground truth in long-term memory but does not model future states. Chain of Draft (Chain of Draft guide) reduces token usage in reasoning but doesn't address cross-episode learning.

LWM-Planner's distinguishing feature is the combination: facts carry forward across episodes and they directly condition a lookahead search at each decision point. The two mechanisms are synergistic — facts make the world model more accurate, and the world model validates which facts matter.

What the Paper Does Not Claim

Reading the paper carefully, a few caveats are worth noting before putting this in production:

Benchmarks are TextFrozenLake and ALFWorld, both household/navigation tasks with clear success criteria. Complex knowledge-work tasks (coding, document editing, research) were not tested.
Performance improvement is proportional to episode count: the technique requires multiple runs over the same or similar environment to build a useful fact store. A one-shot agent gets no benefit.
Fact quality depends on the extractor LLM: if the LLM hallucinates facts or extracts overly specific statements ("use action X in state Y at step 3"), the store degrades rather than improves.
No formal guarantee on fact set size: without active compression, long-running agents accumulate facts indefinitely, eventually overwhelming the context window.

For production use, adding a maximum fact store size (evict by confidence or recency) and periodic compression via summarization are the most important engineering additions beyond the paper's baseline.

FAQ

Q: Does this work with any LLM, or only GPT-4-class models?

The paper uses GPT-4-class models but the mechanism is model-agnostic — it relies on prompting, not specific model capabilities. Smaller models (7B–13B) would need clearer extraction prompts and may produce noisier facts. Claude Sonnet 4.5 and Claude Opus 4.7 are both good candidates given their instruction-following consistency on long prompts.

Q: How many episodes does it take before facts start helping?

In the PoC, meaningful improvement appeared at episode 4–5, when 3–4 facts had accumulated. In the paper's ALFWorld results, performance improved steadily across the first ~15 episodes. The warmup period depends on environment size and how quickly the agent encounters critical states.

Q: Can this work for single-session, long-horizon tasks?

Yes, with a modification: instead of extracting facts after each episode, extract after each completed sub-task within the same session. For a coding agent, each tool call that returns an error is a fact extraction trigger. For a document editing agent, each section completion is a natural checkpoint.

Q: Does this conflict with other memory systems like A-Mem or MemMachine?

They are complementary. Atomic facts are small, precise, task-specific statements best used as short-term working knowledge within an agent run or task family. A-Mem and MemMachine are better for long-term semantic memory across many different tasks. A hybrid architecture — atomic facts for current-task planning, a semantic memory system for cross-task knowledge — is a sensible production design.

Key Takeaways

The root cause of long-horizon planning failure is greedy local policy, not model intelligence. Atomic fact augmentation addresses this structurally.
No fine-tuning required: the entire learning loop runs in-context through prompt injection. The technique works with any LLM API.
Three components, one shared resource: action proposal, world model simulation, and value estimation all improve when conditioned on the same atomic fact store.
The PoC measured 0% → 50% task success over 10 episodes on TextFrozenLake with zero weight updates.
Practical integration requires four additions to an existing agent: a Fact dataclass, a prompt injection function, a post-episode extractor prompt, and optionally a depth-1 lookahead filter before action commitment.
Engineering additions for production: fact store size cap, compression via summarization, and a consistency filter to prevent hallucinated facts from degrading performance.

Bottom Line

Atomic fact augmentation is one of the most practical planning improvements available to developers building LangGraph or Pydantic AI agents today — it requires no model changes, adds minimal latency overhead, and shows measurable results within a handful of episodes. The ICML 2025 paper (arXiv:2506.09171) is worth reading in full; the core extraction and injection loop can be added to an existing agent in under 100 lines.

Need content like this
for your blog?

We run AI-powered technical blogs. Start with a free 3-article pilot.

Learn more →