Blog / Model Launch

Gemini 3.1 Pro: What It Is, What Changed, and What It Means for You

Written by Maya Collins

Mon Feb 23 2026

Experience Chatly's groundbreaking features now, use Gemini 3.1 Pro or any other model inside Chatly.

Gemini 3.1 Pro: What It Is, What Changed, and What It Means for You

Google doesn't usually move this fast. Gemini 3 Pro launched in November 2025, and by February 2026, a meaningfully better version was already shipping. But Google is not the only moving at a lighting pace.

Anthropic recently released their Opus 4.6 and Sonnet 4.6 models back to back. These models raised the bar for benchmark performance across different areas.

Gemini 3.1 Pro is Google’s response to recent advances from Anthropic and OpenAI.

This article is for developers, product teams, and technical professionals who want a clear picture of what Gemini 3.1 Pro actually is. We offer a grounded look at what changed, what it's good for, and where it still falls short.

What Is Gemini 3.1 Pro?

Gemini 3.1 Pro is the latest model in Google DeepMind's Pro lineup, released in preview on February 19, 2026. It sits above the standard Gemini models and below the invite-only Deep Think tier.

It covers the broad middle of Google's AI stack where most real-world workloads live.

The naming convention is worth noting. This is the first time Google has used a ".1" increment rather than a ".5," which signals that the capability jump was meaningful but not a full generational leap. Think of it as a focused upgrade rather than a rebuild.

What Actually Changed from Gemini 3 Pro

The gap between Gemini 3 Pro and 3.1 Pro is not cosmetic. The most cited number is the ARC-AGI-2 score:

31.1% for Gemini 3 Pro
77.1% for 3.1 Pro

That benchmark tests a model's ability to solve logic patterns it has never seen before, which is a meaningful proxy for genuine reasoning rather than memorization.

Beyond the headline number, there are three specific changes that affect how you'd actually use this model:

Three-tier Thinking levels: You can now set reasoning depth to Low, Medium, or High per request. High mode behaves like a scaled-down version of Deep Think, applying significantly more computation to hard problems.
Stronger Agentic Performance: The score on MCP Atlas, which tests multi-step tool use, improved by 15 percentage points over Gemini 3 Pro.
Dedicated Endpoint for Custom Tools: If your workflow combines bash execution with your own defined tools, gemini-3.1-pro-preview-customtools prioritizes those tools more reliably than the standard endpoint.

The three-tier thinking system is the change with the widest practical impact. It means you can run lightweight queries cheaply and reserve High-mode compute for the requests that actually need it, all within a single model integration.

The Benchmark Picture

Putting raw numbers in front of people without context is not particularly useful. So here is what the key scores actually mean in practice.

On reasoning: A score of 77.1% on ARC-AGI-2 puts 3.1 Pro well ahead of GPT-5.2 (52.9%) and Claude Opus 4.6 (68.8%) on this specific test.
On science: GPQA Diamond consists of expert-authored questions across physics, chemistry, and biology. The 94.3% score represents the top position across all publicly evaluated models, which matters for research-adjacent and technical writing workflows.
On coding: LiveCodeBench Pro uses live contest problems that don't exist in training data. An Elo of 2887, compared to Gemini 3 Pro's 2439, is a large jump. SWE-Bench Verified (real-world software engineering tasks) came in at 80.6%.

Where it doesn't lead is worth noting too:

On MMMU Pro (advanced multimodal reasoning), Gemini 3 Pro (81.0%) slightly outperforms 3.1 Pro (80.5%).
Claude Sonnet 4.6 holds the top spot on GDPval-AA Elo with 1633, an expert task evaluation benchmark. Gemini 3.1 Pro scored 1317 which is higher than Gemini 3 Pro but lower than Claude and OpenAI.
GPT-5.3-Codex (xhigh) leads on Terminal-Bench 2.0 and SWE-Bench Pro, which are more specialized coding evaluations.

Overall, 3.1 Pro holds the #1 position on 12 of 16 tracked benchmarks. That is a strong record.

Who This Model Is Actually Built For

While benchmark scores give a good idea of a model’s capabilities, it does not mean it translates well into every role and use case. So, while picking a model, your consideration shouldn’t be scores, but its ability to fulfill your needs.

Developers

The clearest fit is for developers building agentic pipelines. The jump in multi-step tool use reliability, the dedicated customtools endpoint, and the adjustable reasoning depth make it a practical choice for workflows that require sustained decision-making across many steps rather than single-shot responses.

Engineering Teams

It is also a strong option for engineering teams with serious coding workloads. The LiveCodeBench Pro score reflects performance on problems with no training data overlap, which means the model generalizes rather than recites.

For code review, refactoring, or architectural reasoning across a full codebase, the 1M token context window removes the need to split work into chunks.

Researchers

Research and knowledge-intensive workflows are a third strong fit. The GPQA Diamond score reflects the model's ability to hold domain knowledge and apply it to problems that specialists themselves find difficult.

Low-Budget Teams

Finally, cost-sensitive teams have a real argument here. At $2 per million input tokens, this is significantly cheaper than Claude Opus 4.6 ($15 per million) while delivering comparable or better results across most benchmarks.

How to Get Started

Getting access is straightforward depending on how you want to work with it.

Google AI Studio: This is the fastest starting point. It requires no API key setup, it's free to use within usage limits, and it lets you test the model directly in a browser interface before writing a single line of code.

Gemini API: This is the path for production use. The two model IDs to know are:

gemini-3.1-pro-preview for general use
gemini-3.1-pro-preview-customtools for agentic workflows with custom tool definitions

One cost consideration worth acting on early: context caching. If your application repeatedly references the same long context, whether a knowledge base, a system prompt, or a large document, caching it can reduce costs by up to 75%.

How It Compares to the Competition

Comparisons between frontier models are genuinely useful only when they're specific. Here is what the data actually shows.

Against Claude Opus 4.6

Gemini 3.1 Pro leads on:

ARC-AGI-2
GPQA Diamond
MCP Atlas
BrowseComp
LiveCodeBench Pro

Opus 4.6 leads on GDPval-AA Elo and some specialized coding evaluations. If human-preference scoring on expert tasks is your primary signal, Opus 4.6 still has an edge. If price-adjusted benchmark performance is what you're optimizing for,

Against GPT-5.2

Gemini 3.1 Pro leads on reasoning and agentic benchmarks. GPT-5.3-Codex, a specialized variant, leads on terminal-based and advanced software engineering benchmarks. For general-purpose work, the gap favors Gemini. For deep coding infrastructure work, GPT-5.3-Codex may be worth testing.

The practical answer is that no single model wins across all use cases. What Gemini 3.1 Pro offers is the strongest aggregate benchmark performance at its price point currently available.

Limitations Worth Knowing Before You Commit

The knowledge cutoff is January 2025. For tasks involving current events or recent developments, you will need Search Grounding enabled or an external retrieval layer.

The preview status has real implications. Google is still refining behavior based on developer feedback, which means you may see inconsistencies in output quality or API behavior before the GA release. For production applications where consistency is critical, building against a preview model carries some risk.

There is also a known behavior difference between the standard API endpoint and Google AI Studio. In AI Studio, the model may spread output across multiple files, producing higher-quality results for certain tasks. Under single-file constraints, such as a single script output, the same model can underperform relative to what it's capable of in a less constrained environment.

Conclusion

Gemini 3.1 Pro is a genuine improvement over its predecessor. The reasoning gains are backed by one of the largest benchmark jumps in the model's history, the pricing is unchanged, and the new thinking system adds meaningful control over cost and quality per request.

It is not the answer to every use case. For some coding benchmarks, GPT-5.3-Codex is the better tool. For human-preference tasks, Claude Opus 4.6 still holds ground. But for teams building agentic systems, processing large documents, or looking for frontier-class reasoning at a cost that doesn't require justification to a finance team, this is the most practical option available right now.

Frequently Asked Question

Google just launched Gemini 3.1 Pro. Learn what fellow AI enthusiasts are asking.

Gemini 2.5 Pro vs Gemini 3 Pro: Cost Analysis

Faisal Saeed

Gemini 3 Pro for Video Analysis: Revolutionizing YouTube Content Understanding

Faisal Saeed

Gemini 3 Flash vs Gemini 3 Pro: Key Performance Differences

Faisal Saeed

Gemini 3 Pro Overview: Features, Pricing, and Use Cases

Faisal Saeed

Gemini 3 Pro vs GPT-5.2 vs Claude Opus 4.5: Benchmark Performance Breakdown

Faisal Saeed

Blog / Model Launch

Gemini 3.1 Pro: What It Is, What Changed, and What It Means for You

Written by Maya Collins

Mon Feb 23 2026

Experience Chatly's groundbreaking features now, use Gemini 3.1 Pro or any other model inside Chatly.

Gemini 3.1 Pro: What It Is, What Changed, and What It Means for You

Anthropic recently released their Opus 4.6 and Sonnet 4.6 models back to back. These models raised the bar for benchmark performance across different areas.

Gemini 3.1 Pro is Google’s response to recent advances from Anthropic and OpenAI.

What Is Gemini 3.1 Pro?

Gemini 3.1 Pro is the latest model in Google DeepMind's Pro lineup, released in preview on February 19, 2026. It sits above the standard Gemini models and below the invite-only Deep Think tier.

It covers the broad middle of Google's AI stack where most real-world workloads live.

What Actually Changed from Gemini 3 Pro

The gap between Gemini 3 Pro and 3.1 Pro is not cosmetic. The most cited number is the ARC-AGI-2 score:

31.1% for Gemini 3 Pro
77.1% for 3.1 Pro

That benchmark tests a model's ability to solve logic patterns it has never seen before, which is a meaningful proxy for genuine reasoning rather than memorization.

Beyond the headline number, there are three specific changes that affect how you'd actually use this model:

Three-tier Thinking levels: You can now set reasoning depth to Low, Medium, or High per request. High mode behaves like a scaled-down version of Deep Think, applying significantly more computation to hard problems.
Stronger Agentic Performance: The score on MCP Atlas, which tests multi-step tool use, improved by 15 percentage points over Gemini 3 Pro.
Dedicated Endpoint for Custom Tools: If your workflow combines bash execution with your own defined tools, gemini-3.1-pro-preview-customtools prioritizes those tools more reliably than the standard endpoint.

The Benchmark Picture

Putting raw numbers in front of people without context is not particularly useful. So here is what the key scores actually mean in practice.

On reasoning: A score of 77.1% on ARC-AGI-2 puts 3.1 Pro well ahead of GPT-5.2 (52.9%) and Claude Opus 4.6 (68.8%) on this specific test.
On science: GPQA Diamond consists of expert-authored questions across physics, chemistry, and biology. The 94.3% score represents the top position across all publicly evaluated models, which matters for research-adjacent and technical writing workflows.
On coding: LiveCodeBench Pro uses live contest problems that don't exist in training data. An Elo of 2887, compared to Gemini 3 Pro's 2439, is a large jump. SWE-Bench Verified (real-world software engineering tasks) came in at 80.6%.

Where it doesn't lead is worth noting too:

On MMMU Pro (advanced multimodal reasoning), Gemini 3 Pro (81.0%) slightly outperforms 3.1 Pro (80.5%).
Claude Sonnet 4.6 holds the top spot on GDPval-AA Elo with 1633, an expert task evaluation benchmark. Gemini 3.1 Pro scored 1317 which is higher than Gemini 3 Pro but lower than Claude and OpenAI.
GPT-5.3-Codex (xhigh) leads on Terminal-Bench 2.0 and SWE-Bench Pro, which are more specialized coding evaluations.

Overall, 3.1 Pro holds the #1 position on 12 of 16 tracked benchmarks. That is a strong record.