Skip to content
Effloow
← Back to Articles
ARTICLES ·2026-05-30 ·BY EFFLOOW CONTENT FACTORY

LlamaIndex + Google Agents API: Document Agents with ParseBench

A tool-scout guide to ParseBench, the antigravity-demo CLI, and Google Managed Agents sandboxed document parsing with LlamaIndex and LiteParse.
llamaindex google-agents-api parsebench document-agents ai-development tool-scout
SHARE
LlamaIndex + Google Agents API: Document Agents with ParseBench

Document parsing has always been the unglamorous bottleneck in RAG pipelines. A beautifully architected retrieval system only reaches its ceiling if the text it retrieves is garbled, missing table structure, or missing chart data entirely. In April and May 2026, LlamaIndex published three related releases that change how teams should think about this problem: ParseBench (a rigorous benchmark for evaluating parsing quality), the antigravity-demo CLI template (connecting LlamaParse and LiteParse to Google's Managed Agents sandboxed environment), and LiteParse v2.0 (a Rust rewrite with no LLM dependency). This guide covers all three.

Effloow Lab note: This is a source-based tool-scout article. Effloow Lab inspected primary sources — the ParseBench arXiv paper (2604.08538), GitHub repos, LlamaIndex newsletters, and Google Cloud documentation — but did not run antigravity-demo or sandboxed-lit locally. Code examples are taken directly from official documentation and READMEs.

Why Document Parsing Is the Real Bottleneck for Agents

A language model can only reason about what reaches its context window. Most production documents — insurance filings, financial reports, corporate disclosures — contain information in tables, charts, multi-column layouts, and structures with meaningful formatting like strikethrough amounts or superscript footnotes. When a parser drops that structure, the agent's reasoning is working from corrupted inputs.

Until recently, teams had no principled way to measure this. Benchmarks like DocVQA and DeepDive measure OCR character accuracy or question-answering over documents, but neither directly tells you whether a parsed table is structurally intact or whether a chart's data points survive the transformation. LlamaIndex built ParseBench to address that gap.

The Google Managed Agents API, released in preview in 2026, creates a separate problem worth understanding: running an LLM agent inside an ephemeral Linux VM, where it can execute code and manage files without requiring a vector database. The antigravity-demo template connects these two ideas: a sandboxed agent with real document understanding capabilities.

ParseBench: What It Measures and Why the Methodology Matters

ParseBench (arXiv 2604.08538) was published April 9, 2026 by Boyang Zhang and colleagues at LlamaIndex. The dataset covers 2,078 human-verified pages from 1,211 documents, with 169,011 rule-based test cases. License is Apache 2.0. The documents skew toward real-world enterprise content: insurance regulatory filings make up roughly 54.5% of the dataset, public financial reports around 31%, with government documents and corporate reports filling the remainder.

The benchmark design has one significant advantage over prior art: scoring is entirely deterministic and rule-based. There is no LLM-as-a-judge step and no text similarity metric like BLEU or ROUGE. Ground truth was generated via a two-pass process: frontier vision-language models generated initial labels, then human reviewers verified and corrected them. The 169,011 test rules evaluate parsing outputs without introducing another model's biases into the assessment.

ParseBench covers five dimensions:

Dimension Pages Core Metric
Tables 503 TableRecordMatch — treats tables as bags of records keyed by column headers; handles merged cells and hierarchical headers; column and row order do not affect the score
Charts 568 ChartDataPointMatch — 10 spot-checked data points per chart; exact match for labeled values; 1% tolerance for axis-read values
Content Faithfulness 506 Rule-based detection of omissions, hallucinations, and reading-order violations at word, sentence, and digit granularity
Semantic Formatting 476 Preservation of meaning-bearing formatting: strikethrough amounts, superscript/subscript notation, bold emphasis, and title hierarchy
Visual Grounding 500 Element Pass Rate — joint evaluation of bounding box localization, element classification (Text/Table/Picture/Header/Footer), and content attribution

The Table metric is worth a closer look. Rather than requiring character-for-character match, TableRecordMatch (GTRM) evaluates each row as a set of values keyed by column headers. This makes it insensitive to cosmetic differences in presentation while still catching dropped cells, misread numbers, and broken hierarchy.

The Chart metric is the benchmark's most discriminating dimension. Only 4 of the 14 methods evaluated in the paper score above 50% on charts. This is the clearest signal in the paper: chart data extraction is largely unsolved for the methods tested.

The Visual Grounding dimension reveals a different pattern. Standard vision-language models score below 8% on this dimension. LlamaParse Agentic leads at 80.62%. The gap is so large it suggests the dimension is measuring something most general-purpose VLMs are not optimized for: returning element bounding boxes and semantic classifications alongside extracted content.

Benchmark Results: The Numbers Teams Should Know

The paper evaluated 14 methods. Here are the key results for systems that agents are likely to use in production in 2026:

Method Overall Tables Charts Content Formatting Grounding
LlamaParse Agentic (~1.2¢/page) 84.88% 90.74% 78.11% 89.68% 85.24% 80.62%
LlamaParse Cost Effective (<0.4¢/page) 71.89% 73.16% 66.66% 88.02% 73.04% 58.56%
Gemini 3 Flash 71.0% 89.9% 64.8% 86.2% 58.4% 56.0%
Reducto 67.8% 70.3% 57.0% 86.4% 56.8% 68.7%
Azure Document Intelligence 59.6% 86.0% 1.6% 84.9% 51.9% 73.8%
AWS Textract 47.9% 84.6% 6.0% 74.8% 3.7% 70.4%
GPT-5 Mini 46.8% 69.8% 30.1% 82.3% 45.8% 6.2%
Anthropic Haiku 4.5 45.2% 77.2% 13.8% 78.7% 49.4% 6.7%

Two numbers deserve attention beyond the overall score. Azure Document Intelligence scores 86.0% on tables but 1.6% on charts — a 84-point gap between its best and worst dimensions. AWS Textract shows a similar pattern: strong on tables (84.6%), 3.7% on semantic formatting. These are not bad systems; they are systems optimized for OCR and tabular extraction that have not addressed chart and formatting dimensions. Teams using them for chart-heavy documents are working with incomplete data.

Gemini 3 Flash's 58.4% on semantic formatting and 56.0% on visual grounding are the constraints worth knowing before deploying it as a standalone parser for structured financial documents. It handles tables and basic content well.

The community leaderboard at parsebench.ai has expanded since the paper was published. As of the scout date, infly/Infinity-Parser2-Pro leads the community submission list at 74.3%. The leaderboard has 23+ entries; evaluations from the community and the paper-reported scores are not directly comparable since the paper evaluated a fixed 14-method set.

The antigravity-demo: LlamaIndex + Google Managed Agents

The antigravity-demo repository (github.com/run-llama/antigravity-demo) is a CLI template that connects LlamaParse and LiteParse to Google's Managed Agents API. It was announced in the LlamaIndex Newsletter dated May 26, 2026. The CLI installs as llamagrav and has four sequential commands:

# Install the llamagrav CLI
uv tool install .

# Step 1: Create a local git repo and GitHub remote (stores documents)
llamagrav git-wiz

# Step 2a: Provision the Antigravity environment with LlamaParse support
llamagrav setup

# Step 2b: Or provision in LiteParse-only mode (no API key needed)
llamagrav setup --no-send-api-key

# Step 3: Run a query against the documents
llamagrav run --prompt "Summarize the key risks in these financial filings."

# Reset to start fresh
llamagrav reset

Requirements:

  • Python 3.13+
  • uv package manager
  • gh CLI (authenticated)
  • GOOGLE_API_KEY or GEMINI_API_KEY — required
  • LLAMA_CLOUD_API_KEY or LLAMA_PARSE_API_KEY — optional (LiteParse mode works without it)

State is tracked in a .config.json file, which stores GitHub repository URLs and environment IDs between runs. This means once provisioned, a team can run repeated queries against the same document set without reprovisioning.

The --no-send-api-key flag is worth noting specifically. It lets you evaluate the sandbox agent integration without a LlamaParse subscription, using LiteParse (the open-source Rust parser) instead. For teams evaluating whether the Google Managed Agents integration pattern is worth adopting before committing to cloud parsing costs, this is the right starting point.

Inside the Google Managed Agents Sandbox Environment

The sandboxed environment that antigravity-demo provisions is Google's Managed Agents API (referred to internally as "Antigravity"). Based on the official Google documentation, each environment is an ephemeral Linux VM with the following characteristics:

  • Fresh filesystem on each invocation — no leftover state between runs
  • Pre-installed: standard Linux utilities (curl, git, jq, wget), Python 3.11 with numpy, pandas, requests, and BeautifulSoup4; Node.js 20
  • Built-in agent tools: bash (executes shell commands) and file_system (read, write, delete, list)
  • Network isolated by default; additional domains must be explicitly allowlisted
  • Additional packages installable via pip or npm when network is enabled
  • 7-day TTL per environment
  • Context auto-compacts at approximately 135k tokens

The architecture is different from standard LlamaIndex RAG in an important way. Standard RAG embeds documents, builds a vector index, and retrieves by similarity. The Managed Agents integration runs the agent inside the VM itself: it can execute code, read files, call bash commands, and use LlamaParse or LiteParse to parse documents as part of the reasoning loop. There is no separate vector database required. The parsed output — including table structure, chart data, bounding boxes, and semantic formatting from ParseBench's dimensions — feeds directly into the agent's reasoning.

This is distinct from another LlamaIndex release from the same period: sandboxed-lit (github.com/run-llama/sandboxed-lit), a Rust CLI that runs an LLM agent inside a microsandbox VM (2 CPUs, 1 GB RAM). That project uses OpenAI GPT models and is not connected to the Google Managed Agents API. The two projects are part of the same product wave — sandboxed document agents — but use different infrastructure and different LLM providers.

LiteParse v2.0 and LlamaParse: The Parsing Layer

The antigravity-demo template exposes two parsing options, and the distinction matters for cost and capability.

LiteParse v2.0 was announced May 27, 2026, as a complete Rust rewrite. It requires no LLM and no cloud API key. It processes a 457-page, 100 MB document in 0.777 seconds on reported benchmarks — roughly 3x faster than v1 for large documents, and 5–100x faster for small documents. It is open source (github.com/run-llama/liteparse) and available as a native binary (cargo install liteparse), Python package (pip install liteparse), Node package (npm i @llamaindex/liteparse), and WebAssembly module for edge runtimes (npm i @llamaindex/liteparse-wasm). For documents where raw text and basic structure extraction are sufficient, LiteParse v2.0 is the zero-cost option.

LlamaParse is the cloud service. The Cost Effective tier runs below 0.4 cents per page and scored 71.89% on ParseBench overall. The Agentic tier runs around 1.2 cents per page and scored 84.88% overall — the current top score on the benchmark's 14-method evaluation. For documents with complex charts, structured tables, or visual elements where agent-quality parsing matters, the Agentic tier's 80.62% on visual grounding and 78.11% on charts represent a meaningful improvement over what any general-purpose VLM currently achieves.

The LlamaParse + Gemini integration is documented separately via a Google Developers Blog post on building a financial assistant. That integration uses parse_mode="parse_page_with_agent" with model="gemini-3.1-pro" and achieves roughly 13–15% improvement over raw LLM document processing according to the Google Developers Blog. This is a distinct workflow from the antigravity-demo CLI — it is a standalone Python integration, not the sandboxed agent template.

Common Mistakes to Avoid

Treating overall ParseBench scores as a single ranking. The benchmark's key finding is that no method is consistently strong across all five dimensions. Azure Document Intelligence at 59.6% overall is genuinely better than GPT-5 Mini at 46.8% for tables (86.0% vs 69.8%), but worse for charts (1.6% vs 30.1%). Match the parser to your document type, not to a headline score.

Treating the paper leaderboard and the community leaderboard as the same list. The paper evaluated 14 methods under controlled conditions. The parsebench.ai community leaderboard reflects separate community submissions and includes models not in the paper. Infinity-Parser2-Pro's 74.3% community score and LlamaParse Agentic's 84.88% paper score were measured differently. They are not directly comparable.

Conflating sandboxed-lit and antigravity-demo. Both are LlamaIndex CLI tools for sandboxed document agents, but they are separate projects with different infrastructure. sandboxed-lit uses OpenAI GPT and a microsandbox Rust VM. antigravity-demo uses Google Managed Agents (Antigravity) and Gemini API keys. Running the wrong one will fail to authenticate for the expected provider.

Skipping the --no-send-api-key evaluation path. The LiteParse-only mode in antigravity-demo exists for exactly this scenario: validating the Google Managed Agents integration and the agent behavior before deciding whether LlamaParse cloud parsing is worth adding. Use it first.

Assuming Google I/O 2026 announced the LlamaIndex + Antigravity integration. The antigravity-demo announcement came from LlamaIndex (newsletter dated May 26, 2026). ParseBench was published April 9, 2026, more than a month before Google I/O. The products were developed independently and happen to align with the Google I/O timing; there is no verified primary Google source confirming LlamaIndex by name in official I/O announcements.

FAQ

Q: Is the Google Managed Agents API (Antigravity) publicly available?

The API is in preview as of the scout date. Access is available through the Google AI Studio and Gemini API. The antigravity-demo template is publicly available on GitHub and uses the antigravity-preview-05-2026 agent ID. Preview status means the API surface and access requirements may change. Check the current Google documentation at docs.cloud.google.com/gemini-enterprise-agent-platform before building production workflows on top of it.

Q: Can I run ParseBench evaluation without a LlamaParse API key?

Yes. ParseBench includes 90+ pre-configured evaluation pipelines, and many of them cover open-weight models and freely available services. The dataset downloads automatically from HuggingFace. To run a full evaluation against a pipeline that does require LlamaParse, you need a LLAMA_CLOUD_API_KEY. For open-weight model pipelines, no API key is needed. The quickstart is:

git clone https://github.com/run-llama/ParseBench
cd ParseBench
uv sync --extra runners
uv run parse-bench run <pipeline_name> --test   # small test run
uv run parse-bench run <pipeline_name>           # full evaluation
uv run parse-bench serve <pipeline_name>         # browser-based report

Q: How does the Google Managed Agents sandbox compare to E2B or other sandboxes?

Google Managed Agents provides ephemeral Linux VMs with a 7-day TTL, pre-installed Python 3.11 and Node.js 20, bash and file_system as built-in agent tools, and network isolation by default. E2B uses Firecracker microVMs with approximately 150ms cold starts and a 24-hour session maximum. The Google Managed Agents design is optimized for agent workflows tied to the Gemini model family; E2B is LLM-agnostic and purpose-built for executing untrusted code. If your agent architecture is already on Google Cloud and uses Gemini models, the Managed Agents environment fits natively. If you need LLM-agnostic code execution with hardware-level isolation for untrusted user input, E2B is the better match.

Q: What document types does ParseBench cover, and how does that affect generalizability?

ParseBench's 1,211 documents skew heavily toward insurance regulatory filings (~54.5%) and public financial reports (~31%), with government documents and corporate reports in the remainder. These are dense, structured documents with tables, charts, multi-column layouts, and legal formatting. ParseBench scores may not transfer directly to use cases like scientific papers, slide decks, or handwritten forms — though the visual grounding and semantic formatting dimensions are general-purpose enough to remain informative. If your target domain differs significantly from insurance and finance, treat ParseBench as directionally informative rather than precise for your workload.

Key Takeaways

The three LlamaIndex releases from April–May 2026 address a single underlying problem from different angles: agents need document understanding that preserves structure, not just text extraction.

ParseBench fills a genuine gap in how the industry measures parsing quality. The deterministic, rule-based evaluation across five dimensions — tables, charts, content faithfulness, semantic formatting, and visual grounding — reveals failure modes that headline OCR accuracy metrics hide. The main finding holds up across the 14-method evaluation: chart extraction and visual grounding are where most methods still fail, and semantic formatting is nearly universally neglected. Teams building agents that work with structured financial or regulatory documents have, for the first time, a benchmark aligned with what agents actually need.

The antigravity-demo template is early-stage tooling (the Google Managed Agents API remains in preview), but the pattern it demonstrates is worth tracking. Running an LLM agent inside an ephemeral Linux VM with document parsing as a native capability — no vector database, fresh state per run, citation-grade source attribution — is a different architecture from standard RAG. Whether it becomes the dominant model depends partly on how broadly the Managed Agents API becomes accessible and partly on whether the quality improvement justifies the per-page cost.

LiteParse v2.0 is the immediately deployable piece. A Rust parser with zero LLM dependency, 5–100x speedup for small documents, and WebAssembly support for edge runtimes is useful now regardless of whether the rest of the stack is Google, Anthropic, or OpenAI. Use LiteParse where raw extraction is sufficient; use LlamaParse Agentic where chart structure, visual grounding, or semantic formatting are load-bearing for agent quality.

Scout Verdict

ParseBench is a genuine contribution to document AI evaluation — deterministic, multi-dimensional, and built around what agents actually need rather than raw OCR accuracy. The antigravity-demo template is worth watching but depends on the Google Managed Agents API reaching general availability. LiteParse v2.0 is usable today. If your agents work with tables, charts, or structured financial documents, the ParseBench leaderboard at parsebench.ai is now the right first stop before selecting a parsing layer.

Need content like this
for your blog?

We run AI-powered technical blogs. Start with a free 3-article pilot.

Learn more →

More in Articles

Stay in the loop.

One dispatch every Friday. New articles, tool releases, and a short note from the editor.

Get weekly AI tool reviews & automation tips

Join our newsletter. No spam, unsubscribe anytime.