Gemini 3.1 Pro Developer Guide: Benchmarks, API, and Pricing
Gemini 3.1 Pro scores 77.1% on ARC-AGI-2 at $2/$12 per million tokens. Developer guide with API setup, benchmarks, and pricing comparison.
Gemini 3.1 Pro Developer Guide: Benchmarks, API, and Pricing
Google released Gemini 3.1 Pro in preview on February 19, 2026, and it immediately reshaped the frontier model conversation. With a 77.1% score on ARC-AGI-2, it surpassed Claude Opus 4.6 (68.8%) on the benchmark most associated with general reasoning ability. It did this while pricing input tokens at $2.00 per million -- cheaper than both Claude Opus 4.6 and GPT-5.4.
This guide covers what Gemini 3.1 Pro offers developers, how it performs against the competition, what it costs, and how to integrate it into your applications using the Google GenAI SDK.
For broader context on AI framework choices in 2026, see our AI agent frameworks comparison. If you are evaluating terminal-based coding agents that use these models, our terminal AI coding agents guide covers the full landscape.
Why This Matters
The AI model market in early 2026 has three serious contenders: Anthropic's Claude family, OpenAI's GPT series, and Google's Gemini line. Each release shifts the balance of power.
Gemini 3.1 Pro matters for three reasons:
-
ARC-AGI-2 leadership. Scoring 77.1% on ARC-AGI-2 makes it the best-performing model on a benchmark designed to test genuine reasoning, not pattern matching. This is not a narrow win -- the gap over Claude Opus 4.6 (68.8%) is over 8 percentage points.
-
Aggressive pricing. At $2.00 per million input tokens and $12.00 per million output tokens (under 200K context), Gemini 3.1 Pro undercuts GPT-5.4 ($2.50/$15.00) and is dramatically cheaper than Claude Opus 4.6. For teams spending five or six figures monthly on API calls, this difference compounds fast.
-
1M token context with multimodal input. The context window accepts text, audio, images, video, PDFs, and entire code repositories -- all within a single 1-million-token window. This is not incremental. It means you can feed an entire codebase, its documentation, and a video walkthrough into a single prompt.
Benchmark Comparison: Gemini 3.1 Pro vs Claude Opus 4.6 vs GPT-5.4
Numbers tell the real story. The following table compares Gemini 3.1 Pro against its two closest competitors across the benchmarks that matter most for developer workloads.
Frontier Benchmark Scores
| Benchmark | Gemini 3.1 Pro | Claude Opus 4.6 | GPT-5.4 |
|---|---|---|---|
| ARC-AGI-2 | 77.1% | 68.8% | -- |
| GPQA Diamond | 94.3% | 87.0 | 92.4 |
| SWE-Bench Verified | 78.8% | 80.8% | 80.0 |
What the Numbers Tell Us
Where Gemini 3.1 Pro leads: ARC-AGI-2 and GPQA Diamond are the headliners. The 77.1% ARC-AGI-2 score represents the best general reasoning result from any commercial model. GPQA Diamond at 94.3% leads both Claude Opus 4.6 (87.0) and GPT-5.4 (92.4), establishing Gemini 3.1 Pro as the strongest model on graduate-level science questions.
Where Gemini 3.1 Pro trails: SWE-Bench Verified at 78.8% falls behind Claude Opus 4.6 (80.8%) and Claude Sonnet 4.6 (79.6%). For pure software engineering tasks -- fixing real-world GitHub issues -- the Claude models still hold a measurable edge. GPT-5.4 at 80.0% also outperforms Gemini on this benchmark.
The practical takeaway: If your primary use case is reasoning, research, or scientific analysis, Gemini 3.1 Pro is the current leader. If your workload is primarily code generation and automated software engineering, Claude Opus 4.6 retains a slight advantage.
Pricing Comparison: What Gemini 3.1 Pro Actually Costs
Cost per token is only part of the equation. Gemini 3.1 Pro uses tiered pricing based on context window usage.
Gemini 3.1 Pro Pricing Tiers
| Context Usage | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Under 200K tokens | $2.00 | $12.00 |
| Over 200K tokens | $4.00 | $18.00 |
The price doubles for input and increases 50% for output when you exceed the 200K-token context threshold. This is important to factor in if you plan to use the full 1M context window regularly.
Cross-Model Pricing Comparison
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|---|---|---|
| Gemini 3.1 Pro (under 200K) | $2.00 | $12.00 | 1M tokens |
| Gemini 3.1 Pro (over 200K) | $4.00 | $18.00 | 1M tokens |
| GPT-5.4 (OpenAI) | $2.50 | $15.00 | 256K tokens |
| Claude Opus 4.6 (Anthropic) | $5.00 | $25.00 | 200K tokens |
| Claude Sonnet 4.6 (Anthropic) | $3.00 | $15.00 | 200K tokens |
Under 200K context, Gemini 3.1 Pro is the cheapest frontier model by a meaningful margin. Input costs are 20% less than GPT-5.4 and 60% less than Claude Opus 4.6. Output costs are 20% less than GPT-5.4 and 52% less than Claude Opus 4.6.
Over 200K context, the picture shifts. At $4.00 input, Gemini 3.1 Pro becomes more expensive than GPT-5.4 ($2.50) on input tokens. However, no other model in this comparison offers a 1M token context window, so there is no direct alternative if you need that capacity.
Getting Started: API Setup with the Google GenAI SDK
Google provides a unified Python SDK (google-genai) for accessing Gemini models. Here is how to get from zero to working API calls.
Step 1: Installation and Authentication
pip install google-genai
Get your API key from Google AI Studio. Then set it as an environment variable:
export GEMINI_API_KEY="your-api-key-here"
Step 2: Basic Text Generation
from google import genai
client = genai.Client(api_key="your-api-key-here")
response = client.models.generate_content(
model="gemini-3.1-pro",
contents="Explain the transformer attention mechanism in three sentences.",
)
print(response.text)
Step 3: Using the Thinking Level Parameter
Gemini 3.1 Pro introduces the thinking_level parameter, which controls how much internal reasoning the model performs before generating a response. This is similar to extended thinking in other models, but with explicit developer control over the tradeoff between speed and reasoning depth.
from google import genai
from google.genai.types import GenerateContentConfig
client = genai.Client(api_key="your-api-key-here")
# LOW: fastest response, minimal internal reasoning
response_fast = client.models.generate_content(
model="gemini-3.1-pro",
contents="What is 2 + 2?",
config=GenerateContentConfig(
thinking_level="LOW",
),
)
# HIGH: deepest reasoning, slower but most accurate
response_deep = client.models.generate_content(
model="gemini-3.1-pro",
contents="Prove that the square root of 2 is irrational.",
config=GenerateContentConfig(
thinking_level="HIGH",
),
)
print(response_deep.text)
The three levels -- LOW, MEDIUM, and HIGH -- let you optimize per-request. Use LOW for simple lookups and formatting tasks where speed matters. Use MEDIUM for standard generation. Use HIGH for complex reasoning, math proofs, and multi-step analysis where accuracy justifies the latency cost.
Step 4: Multimodal Input (Image + Text)
Gemini 3.1 Pro natively handles multimodal prompts. You can pass images, audio, video, and PDFs alongside text.
from google import genai
from google.genai.types import Part
client = genai.Client(api_key="your-api-key-here")
# Upload an image file
image_file = client.files.upload(path="architecture-diagram.png")
response = client.models.generate_content(
model="gemini-3.1-pro",
contents=[
"Analyze this system architecture diagram. Identify potential single points of failure and suggest improvements.",
image_file,
],
)
print(response.text)
Step 5: Processing an Entire Code Repository
The 1M token context window makes it practical to feed entire codebases for analysis:
from google import genai
from pathlib import Path
client = genai.Client(api_key="your-api-key-here")
# Collect all Python files from a project
code_parts = []
for py_file in sorted(Path("./my-project/src").rglob("*.py")):
content = py_file.read_text()
code_parts.append(f"### {py_file}\n```python\n{content}\n```\n")
full_context = "\n".join(code_parts)
response = client.models.generate_content(
model="gemini-3.1-pro",
contents=[
f"Here is the complete source code for a Python project:\n\n{full_context}\n\n"
"Review this codebase for security vulnerabilities, focusing on "
"input validation, authentication, and data handling. "
"Prioritize findings by severity.",
],
)
print(response.text)
This approach works well for projects up to roughly 500K-700K tokens of source code, leaving room for the prompt and response within the 1M context window.
Key Features for Developers
Thinking Level Parameter
The thinking_level parameter (LOW, MEDIUM, HIGH) is the most developer-relevant feature in Gemini 3.1 Pro. Unlike models that always use maximum reasoning depth (and charge accordingly), Gemini 3.1 Pro lets you choose the right level per request.
In practice, this means you can build applications that route different types of queries to different thinking levels. A customer support bot might use LOW for FAQ responses and HIGH for complex troubleshooting, optimizing both latency and cost without switching models.
Agentic Improvements
Gemini 3.1 Pro includes targeted improvements for agentic workflows -- multi-step tasks where the model plans, uses tools, and iterates. While specific agentic benchmark numbers beyond ARC-AGI-2 are still being published, the model's architecture was optimized for:
- Multi-step tool calling with state management
- Long-horizon planning across dozens of intermediate steps
- Recovery from errors mid-execution without restarting the entire chain
If you are building agents with frameworks like LangGraph or CrewAI, these improvements translate directly into more reliable multi-step execution.
1M Token Context Window
The 1M token context window is not new to the Gemini line, but Gemini 3.1 Pro maintains it while improving quality on long-context tasks. For reference, 1M tokens is approximately:
- 750,000 words of text
- An entire medium-sized codebase (thousands of files)
- Multiple hours of audio transcripts
- Hundreds of pages of PDF documents combined with code and images
The multimodal nature means you can mix these modalities in a single request. Feed a PDF specification, a video demo, and the existing codebase, then ask the model to identify gaps between the spec and the implementation.
Common Mistakes to Avoid
1. Ignoring the 200K Context Pricing Threshold
The most common cost surprise with Gemini 3.1 Pro is the pricing jump at 200K tokens. Developers who routinely send large context payloads can see their effective cost double without realizing it. Monitor your average context size. If you are consistently above 200K tokens, factor in the $4.00/$18.00 tier when budgeting.
2. Using HIGH Thinking Level for Everything
Setting thinking_level="HIGH" on every request wastes latency and money. The model spends more time reasoning internally, which increases response time and token consumption. Profile your use cases and match the thinking level to the task complexity.
3. Treating Benchmark Scores as Absolutes
Gemini 3.1 Pro leads on ARC-AGI-2 and GPQA Diamond but trails on SWE-Bench Verified. Benchmark rankings do not translate directly into "better for my use case." Run your own evaluations on tasks that represent your actual workload. A model that scores 2 points lower on a general benchmark might perform better on your specific domain.
4. Neglecting Streaming for Long Responses
With HIGH thinking level and complex prompts, Gemini 3.1 Pro responses can take significant time to generate. Always implement streaming for user-facing applications:
response = client.models.generate_content_stream(
model="gemini-3.1-pro",
contents="Write a comprehensive technical design document for a distributed task queue.",
)
for chunk in response:
print(chunk.text, end="", flush=True)
5. Not Using the Files API for Large Inputs
For multimodal inputs (images, video, PDFs), upload files via the Files API rather than embedding base64-encoded data directly in the prompt. Direct embedding bloats request size and increases latency.
Frequently Asked Questions
How does Gemini 3.1 Pro compare to Gemini 3 Pro?
Gemini 3.1 Pro is the direct successor to Gemini 3 Pro, released as a preview on February 19, 2026. The primary improvements are higher benchmark scores (particularly the 77.1% ARC-AGI-2 result), the new thinking_level parameter for controlling reasoning depth, enhanced agentic capabilities, and improved multimodal processing. Pricing remained unchanged from Gemini 3 Pro at the same tiers.
Is Gemini 3.1 Pro better than Claude Opus 4.6 for coding?
Not across the board. Claude Opus 4.6 scores 80.8% on SWE-Bench Verified versus Gemini 3.1 Pro's 78.8%. Claude Sonnet 4.6 also outperforms at 79.6%. For automated code generation and real-world software engineering tasks, the Claude models maintain a measurable lead. However, Gemini 3.1 Pro leads on general reasoning (ARC-AGI-2) and scientific reasoning (GPQA Diamond), so the better choice depends on your specific workload balance between coding and reasoning tasks.
What is the thinking_level parameter and should I use it?
The thinking_level parameter controls how much internal reasoning Gemini 3.1 Pro performs before generating a response. It accepts three values: LOW, MEDIUM, and HIGH. LOW is fastest and cheapest, suitable for simple tasks. HIGH provides the deepest reasoning at the cost of higher latency and token usage. You should use it by matching the level to the task complexity -- there is no reason to pay for HIGH reasoning on tasks that LOW handles well.
Can I use Gemini 3.1 Pro with existing OpenAI SDK code?
Google provides the google-genai SDK, which has a different API surface than the OpenAI SDK. However, you can access Gemini models through platforms like OpenRouter or Vertex AI that provide OpenAI-compatible API endpoints, allowing you to reuse existing code with minimal changes. The native Google SDK is recommended for access to Gemini-specific features like thinking_level.
How much does Gemini 3.1 Pro cost compared to GPT-5.4?
Under 200K context, Gemini 3.1 Pro costs $2.00 input / $12.00 output per million tokens. GPT-5.4 costs $2.50 input / $15.00 output per million tokens. That makes Gemini 3.1 Pro 20% cheaper on both input and output. Above 200K context, Gemini 3.1 Pro's pricing jumps to $4.00/$18.00, which makes it more expensive than GPT-5.4 on input tokens but still cheaper on output.
Key Takeaways
-
Gemini 3.1 Pro leads on general and scientific reasoning. The 77.1% ARC-AGI-2 score surpasses Claude Opus 4.6 (68.8%) by over 8 points. GPQA Diamond at 94.3% is the highest among frontier models.
-
Coding benchmarks favor Claude -- for now. SWE-Bench Verified shows Claude Opus 4.6 at 80.8% and Claude Sonnet 4.6 at 79.6%, both ahead of Gemini 3.1 Pro's 78.8%. If automated software engineering is your primary use case, Claude still has the edge.
-
Pricing is the most competitive among frontier models. At $2.00/$12.00 per million tokens (under 200K context), Gemini 3.1 Pro undercuts GPT-5.4 by 20% and Claude Opus 4.6 by 52-60%. Watch the 200K context threshold where prices jump to $4.00/$18.00.
-
The thinking_level parameter is a genuine differentiator. Explicit control over reasoning depth (LOW/MEDIUM/HIGH) lets developers optimize the cost-latency-accuracy tradeoff per request rather than per model.
-
1M token multimodal context is production-ready. Text, audio, images, video, PDFs, and code repositories all fit in a single context window. This enables workflows that were previously impossible without complex chunking and retrieval architectures.
-
The model is in preview. Released February 19, 2026 as a preview, some capabilities may change before general availability. Build with the API but plan for potential adjustments.
-
Model selection should be workload-specific. No single model dominates every benchmark. Use Gemini 3.1 Pro for reasoning-heavy and multimodal workloads. Use Claude for code-heavy workloads. Use GPT-5.4 as a balanced middle option. Evaluate on your own data before committing.
Conclusion
Gemini 3.1 Pro represents Google's strongest entry into the frontier model competition. The ARC-AGI-2 and GPQA Diamond scores are not incremental improvements -- they represent clear leadership on the benchmarks most associated with genuine reasoning ability. The pricing, undercutting GPT-5.4 while offering a 4x larger context window, forces a conversation that benefits every developer.
The practical decision comes down to workload mix. Reasoning and multimodal workloads favor Gemini 3.1 Pro. Automated software engineering favors Claude. The thinking_level parameter adds a dimension of per-request control that other models do not yet offer. If you have not benchmarked Gemini 3.1 Pro against your specific workload, now is the time.
Get weekly AI tool reviews & automation tips
Join our newsletter. No spam, unsubscribe anytime.