Skip to content
Effloow
← Back to article
EFFLOOW LAB LAB-RUN ·1778803200

LangGraph 1.2 Fault Tolerance Sandbox PoC

Evidence notes document the bounded local or source-based checks behind an Effloow article. They are not product endorsements, legal advice, or benchmark claims.

Goal

Verify the current LangGraph fault-tolerance APIs that matter for production agent workflows:

  • langgraph==1.2.0 installability.
  • TimeoutPolicy and NodeTimeoutError import path.
  • Per-node timeout recovery through error_handler=.
  • Documented limitation that sync nodes cannot safely use timeout=.

No production credentials, model API keys, database credentials, or external LLM calls were used.

Commands Run

rm -rf /tmp/effloow-langgraph-timeout-poc
mkdir -p /tmp/effloow-langgraph-timeout-poc
python3 -m venv /tmp/effloow-langgraph-timeout-poc/.venv
/tmp/effloow-langgraph-timeout-poc/.venv/bin/python -m pip install --upgrade pip
/tmp/effloow-langgraph-timeout-poc/.venv/bin/python -m pip install 'langgraph==1.2.0'

Relevant install output:

Successfully installed ... langchain-core-1.4.0 langgraph-1.2.0 langgraph-checkpoint-4.1.0 langgraph-prebuilt-1.1.0 langgraph-sdk-0.3.14 ...

Version/import check:

/tmp/effloow-langgraph-timeout-poc/.venv/bin/python - <<'PY'
import sys, importlib.metadata
print('python', sys.version.split()[0])
print('langgraph', importlib.metadata.version('langgraph'))
print('langgraph-checkpoint', importlib.metadata.version('langgraph-checkpoint'))
print('langchain-core', importlib.metadata.version('langchain-core'))
from langgraph.types import RetryPolicy, TimeoutPolicy, Command
from langgraph.errors import NodeTimeoutError, NodeError, GraphDrained
from langgraph.graph import StateGraph, START, END
print('imports ok', RetryPolicy.__name__, TimeoutPolicy.__name__, Command.__name__, NodeTimeoutError.__name__, NodeError.__name__, GraphDrained.__name__)
PY

Output:

python 3.12.8
langgraph 1.2.0
langgraph-checkpoint 4.1.0
langchain-core 1.4.0
imports ok RetryPolicy TimeoutPolicy Command NodeTimeoutError NodeError GraphDrained

Timeout Recovery PoC

The graph below defines an async node that sleeps longer than its run_timeout. The node has a retry policy and an error handler. The handler receives NodeError, writes a recovery status, and routes to finalize.

import asyncio, time
from typing_extensions import TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.types import RetryPolicy, TimeoutPolicy, Command
from langgraph.errors import NodeError, NodeTimeoutError

class State(TypedDict, total=False):
    attempts: int
    status: str
    error: str
    elapsed: float

async def slow_vendor_api(state: State) -> State:
    await asyncio.sleep(0.2)
    return {"status": "vendor_finished"}

async def finalize(state: State) -> State:
    return state

def timeout_handler(state: State, error: NodeError) -> Command:
    return Command(
        update={
            "status": "recovered_by_error_handler",
            "error": type(error.error).__name__,
            "attempts": state.get("attempts", 0) + 1,
        },
        goto="finalize",
    )

builder = StateGraph(State)
builder.add_node(
    "slow_vendor_api",
    slow_vendor_api,
    timeout=TimeoutPolicy(run_timeout=0.05),
    retry_policy=RetryPolicy(max_attempts=1, retry_on=NodeTimeoutError),
    error_handler=timeout_handler,
)
builder.add_node("finalize", finalize)
builder.add_edge(START, "slow_vendor_api")
builder.add_edge("finalize", END)
graph = builder.compile()

async def main():
    started = time.perf_counter()
    result = await graph.ainvoke({"attempts": 0})
    result["elapsed"] = round(time.perf_counter() - started, 3)
    print(result)

asyncio.run(main())

Output:

{'attempts': 1, 'status': 'recovered_by_error_handler', 'error': 'NodeTimeoutError', 'elapsed': 0.055}

Sync Node Limitation Check

from typing_extensions import TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.types import TimeoutPolicy

class State(TypedDict, total=False):
    status: str

def sync_node(state: State) -> State:
    return {"status": "sync"}

builder = StateGraph(State)
builder.add_node("sync_node", sync_node, timeout=TimeoutPolicy(run_timeout=0.05))
builder.add_edge(START, "sync_node")
builder.add_edge("sync_node", END)
builder.compile()

Output:

ValueError: Node timeouts are only supported for async nodes because sync Python execution cannot be safely cancelled in-process. Node 'sync_node' is sync.

What Worked

  • langgraph==1.2.0 installed successfully in a clean temporary virtual environment.
  • TimeoutPolicy, RetryPolicy, Command, NodeError, NodeTimeoutError, and GraphDrained were available from the documented import paths.
  • A timed-out async node produced NodeTimeoutError and reached the configured recovery handler.
  • error_handler= returned a Command that updated state and routed the graph to a final node.
  • A sync node with timeout= failed at compile time with a clear ValueError, matching the documentation's async-only limitation.

What Failed or Was Not Tested

  • No LLM provider call was made. The slow node used asyncio.sleep() to simulate an external API stall.
  • Graceful shutdown was import-checked but not fully exercised with an OS-level SIGTERM process supervisor.
  • DeltaChannel was researched from official LangChain release/blog material but not benchmarked locally.
  • Type-safe streaming v3 was researched from the GitHub release notes but not reproduced in the sandbox.

Sources Checked

  • GitHub releases: https://github.com/langchain-ai/langgraph/releases
  • PyPI package page: https://pypi.org/project/langgraph/
  • Fault tolerance docs: https://docs.langchain.com/oss/python/langgraph/fault-tolerance
  • Durable execution docs: https://docs.langchain.com/oss/python/langgraph/durable-execution
  • Delta Channels blog: https://www.langchain.com/blog/delta-channels-evolving-agent-runtime
  • LangChain/LangGraph 1.0 milestone blog: https://www.langchain.com/blog/langchain-langgraph-1dot0

Read the article

This note supports the public article and records what was actually checked.

Open article →