Agentic Code Reasoning: Semi-Formal Prompting Reaches 93% Patch Accuracy
Code agents navigate large repositories, collect context, and produce patches — but how do you verify a patch is correct without running the test suite every time? Meta researchers at ICML 2026 asked exactly this question and published an answer: semi-formal structured reasoning.
The paper "Agentic Code Reasoning" (arXiv:2603.01896, Ugare & Chandra, Meta) reports 93% accuracy on real-world agent-generated patches for equivalence verification — execution-free. Effloow Lab inspected the paper, reproduced the core prompting template, and explains where this technique fits in a modern coding agent pipeline.
data/lab-runs/agentic-code-reasoning-semi-formal-structured-prompting-poc-2026.md
Why Standard Chain-of-Thought Fails on Code
Chain-of-thought (CoT) prompting asks a model to "show its work." For code reasoning tasks, this produces a stream of claims — but the model can skip edge cases, mix up variable scopes, or cite evidence that does not exist in the actual code.
The paper frames this clearly: CoT does not act as a certificate. An agent can produce a plausible-sounding reasoning chain while making an unsupported jump in the middle.
For patch equivalence specifically, this matters a lot. Determining whether two patches produce identical behavior requires enumerating all affected execution paths. A model that summarizes "both patches handle the null case" may have skipped the case where n=0 takes a different branch.
What Semi-Formal Reasoning Does Differently
Semi-formal reasoning imposes three mandatory phases before the agent can state a conclusion:
- Premises — enumerate every observable fact about the relevant code
- Execution paths — trace every control flow path through the affected code
- Formal conclusion — state the claim, citing only premises already listed
The structure is a constraint, not a style guide. The model cannot derive a conclusion from an unstated premise. If it tries to, the structured format makes the gap visible.
This is why the authors call it a certificate: the reasoning chain either contains the evidence or it does not.
The Prompting Template
Effloow Lab reconstructed this template from the paper's methodology section:
You are a code analysis agent. Apply semi-formal reasoning.
PREMISES:
List every verifiable fact you extract from the code.
P1: [observation — e.g., "function foo() reads self.x before checking for None"]
P2: [observation — e.g., "the early-return path at line 14 does not call cleanup()"]
P3: [add as many premises as needed]
EXECUTION PATHS:
Enumerate every distinct control flow path through the code under analysis.
Path A: [condition that activates this path] → [sequence of operations] → [observable outcome]
Path B: [different condition] → [different sequence] → [outcome]
FORMAL CONCLUSION:
State your finding. Each claim must cite a premise by number (P1, P2...).
Unsupported claims (those not derivable from a listed premise) are not permitted.
---
TASK: [your analysis task — patch equivalence, fault localization, or code QA]
The key discipline: if a path is not listed under EXECUTION PATHS, it cannot appear in the FORMAL CONCLUSION. If a claim cites a premise that is not in the PREMISES list, that is a malformed response.
Benchmark Results
The paper evaluates on three tasks.
| Task | Dataset | Standard Reasoning | Semi-Formal | Improvement |
|---|---|---|---|---|
| Patch equivalence | Curated examples | 78% | 88% | +10 pp |
| Patch equivalence | Real-world agent patches | — | 93% | — |
| Code QA | RubberDuckBench | — | 87% | — |
| Fault localization (Top-5) | Defects4J | baseline | baseline +5pp | +5 pp |
The most significant result is the 93% on real-world agent-generated patches. This is the condition that actually matters for production coding agents: verifying patches that your own agent produced, not curated academic examples.
Why 93% Enables Execution-Free RL
Training code agents with reinforcement learning currently requires a ground-truth signal: does the patch actually fix the bug? The standard signal is running the test suite — expensive, slow, and sometimes flaky.
If a verifier can determine patch equivalence at 93% accuracy without execution, that verifier can serve as a reward signal for RL training. The agent proposes a patch; the verifier scores it; the RL loop updates the agent's policy. No execution environment needed per step.
The paper frames this as the primary application: semi-formal reasoning approaches the reliability threshold needed for execution-free RL reward signals in coding agent training.
Three Concrete Applications
1. Automated Patch Review
Before merging an agent-generated patch, run the semi-formal verifier against the original code and the patched version. The structured output makes the review auditable — a human can check the premises list rather than re-reading all the affected code.
patch_review_prompt = """
PREMISES: [enumerate facts about original_code and patched_code]
EXECUTION PATHS: [trace paths through both versions]
FORMAL CONCLUSION: State whether both versions are equivalent or identify the divergence.
Cite only premises already listed.
Original code:
{original_code}
Patched code:
{patched_code}
"""
2. Fault Localization
For bug reports, feed the semi-formal template the failing function and ask the agent to identify which execution path leads to the observed failure. The structured premises list forces the agent to catalog all relevant variables and branches before guessing the fault location.
The paper reports +5pp on Defects4J Top-5 accuracy compared to standard chain-of-thought.
3. Code Question Answering (Without Running Code)
When a developer asks "what happens when this function receives a null input?", the semi-formal template produces a traceable answer. The EXECUTION PATHS section forces coverage of the null case explicitly. The paper reports 87% accuracy on RubberDuckBench.
Limitations of This Approach
The technique is not free.
- Token cost — enforcing all three phases produces longer reasoning chains than CoT. Expect 1.5×–2× the tokens per query.
- Model capability floor — the paper uses GPT-4 class models. Smaller models that cannot reliably follow multi-phase structured formats will not see the same gains.
- Path enumeration completeness — the agent still decides which paths to list. For code with many branches, it may miss obscure paths. The structure prevents unsupported jumps but does not guarantee completeness.
- RubberDuckBench dataset — not publicly released; the code QA number cannot be independently verified externally.
How to Adopt This in Your Agent
If you are building a code agent today, adding semi-formal reasoning is a prompt change, not a code change:
- Add the three-phase template to your system prompt for analysis tasks
- Add a validation step that checks: does the conclusion cite only listed premises?
- If the validation fails, retry with a note: "Your conclusion cited [X], which was not in your PREMISES list"
- Log the premises and paths alongside the output for auditability
You do not need a new model or a new framework. The technique works with any model that can follow structured formatting.
FAQ
What model was used in the paper?
The paper uses GPT-4 class models. The authors are from Meta, but the evaluation uses publicly available frontier models, not Meta's internal models.
Is the code from the paper available?
The paper does not link a public GitHub repository. The prompting templates are described in the methodology section.
How does this compare to using an execution sandbox?
An execution sandbox is more reliable for equivalence checking, but it requires setting up a working build environment for each codebase. Semi-formal reasoning requires only the source code and a capable LLM. The two approaches are complementary: use the verifier to filter low-confidence patches before running the expensive sandbox.
Does this work for compiled languages (Java, C++)?
The Defects4J benchmark is Java. The technique applies to any language where the agent can read source code — there is no language-specific component.
Can I use this with Claude or Gemini?
Yes. The technique is model-agnostic. Any frontier model that can reliably follow a structured three-part prompt should see improvements over unstructured CoT.
Need content like this
for your blog?
We run AI-powered technical blogs. Start with a free 3-article pilot.