Blog / Model Launch

Claude Opus 4.8: Faster, More Efficient, and More Honest with Better Judgment

Written by Faisal Saeed

Mon Jun 01 2026

Experience Chatly's groundbreaking features now, use Claude Opus 4.8, GPT-5.5 or any other model inside Chatly.

Claude Opus 4.8: Faster, More Efficient, and More Honest with Better Judgment

Anthropic shipped Claude Opus 4.8 on May 28, 2026, roughly six weeks after 4.7. The version number barely moved, and the spec sheet mostly held:

Same $5/$25 pricing
Same million-token context window
Same model family

So what's new? The model bluffs less, and when it isn't sure whether it actually finished a job, it says so instead of pretending that it did.

That sounds soft next to a coding score, yet it's the change you feel first in real work. Opus 4.8 builds on Claude Opus 4.7 with sharper judgment and more honesty about its own output. We will discuss what that means along with what other new features you should test in Opus 4.8

What Claude Opus 4.8 is

Opus 4.8 is Anthropic's most capable generally available model, sitting above Sonnet and Haiku at the top of the lineup. Developers call it with the model ID claude-opus-4-8.

It's aimed at the demanding end of the work:

production software engineering
multi-step agentic workflows
high-stakes professional tasks where a wrong answer carries real cost

It keeps the 1M-token context window, 128k maximum output, and adaptive thinking from 4.7, and the training cutoff is still January 2026. None of that is new.

What's new is behavioral, and Anthropic put it up front rather than under the benchmark table. You can run it now inside Chatly next to the other frontier models if you'd rather test the difference against your own prompts first.

What Changed Since Claude Opus 4.7

Set the behavior aside for a second, and Opus 4.8 is a tighter Claude Opus 4.7 with a few new features around it. Pricing held flat, which kills the usual is-it-worth-the-cost math before you start.

The specs that define the model carried over untouched, so the upgrade is low-risk by default.

The same-price part matters more than it sounds. A bump that holds cost flat turns the upgrade question from "is it worth paying more" into just "does it regress anything of mine," which is a much easier yes.

The behavioral targets Anthropic names in the docs are practical ones. Long-horizon agentic coding holds context better across long sessions, with fewer compactions and cleaner recovery when they happen.

Tool triggering improved too, fixing a 4.7 complaint where the model would skip a tool call the task clearly needed. And because adaptive thinking now decides per turn whether to reason at length, it wastes fewer thinking tokens on simple steps at the same effort.

Effort calibration got steadier as well, so each level behaves more predictably across different kinds of work.

A few things genuinely are new this release:

Fast mode runs Opus 4.8 at up to 2.5x the output speed and now costs roughly three times less than fast mode did on earlier models, at $10/$50 per million tokens.
Effort control landed in claude.ai and Cowork for every plan, not only the API. A control beside the model picker lets you trade speed for depth on a given task; higher effort thinks harder, lower effort answers quicker and burns your rate limits slower. The default is high.
Mid-conversation system messages let developers update instructions partway through a long task without breaking the prompt cache or faking a user turn.
Dynamic Workflows in Claude Code (research preview) lets the model spin up hundreds of parallel subagents in one session for codebase-scale jobs.
Minimum cacheable prompt dropped to 1,024 tokens from 4,096, so shorter prompts can now hit the cache.

On raw benchmarks, the gains that matter aren't spread evenly. A few stand out:

Agentic coding (SWE-bench Pro): 64.3% to 69.2%.
Long-context retrieval (GraphWalks at 1M tokens): 40.3% to 68.1%. The largest jump by far, and the kind of leap that decides whether you trust the model across a full codebase.
Math (USAMO 2026 proofs): 69.3% to 96.7% in a single cycle.

Tuned your prompts around 4.7's quirks? A couple of these behavior changes are worth a second look before you switch, and our guide to system prompts for the Opus line covers the habits that hold up across versions.

Now the honest part, since a post about a more honest model should hold itself to the same bar. Opus 4.8 isn't strictly better than 4.7 on everything:

GPQA Diamond slipped from 94.2% to 93.6% on a near-saturated science benchmark. Inside normal trial variance, but still a dip.
Prompt-injection resistance got worse. Red-team attack success climbed from about 6% to roughly 9.6% in Anthropic's own testing. Feed the model web pages or user uploads in production, and you'll want to review your sandboxing before migrating.
Multilingual work still goes to Gemini 3.1 Pro and GPT-5.5.

Claude Opus 4.8 is More Honest

Honesty has a precise meaning at Anthropic: the model should avoid claims it can't support.

Easy to say, hard to train.

The failure mode they're fighting is one every heavy LLM user has hit.

You ask for a fix
The model works a while
Then it announces success with total confidence
Later you find it skipped the task completely

Opus 4.8 does that less. A lot less in one case Anthropic measured: it's about four times less likely than Claude Opus 4.7 to let a flaw in code it wrote pass without flagging it.

So when it finishes a function and something's off, it's far likelier to acknowledge it than to report a clean win. For anyone reviewing AI-written code, that quietly removes a whole class of silent defect you used to catch yourself.

The way it gets there is worth understanding, because there's a trade-off. Opus 4.8 posts the lowest factual-error rate of any model Anthropic tested, and it manages that mostly by abstaining: when it doesn't know, it declines to guess.

Independent reviewers who ran it on release day saw the same thing.

A model that says "I'm not certain" costs you a follow-up question; a model that hands you a confident wrong answer costs you an afternoon of debugging on a false premise.

Anthropic's system card puts numbers on it.

On the check for whether the model raises problems it noticed, Claude Opus 4.8 stays quiet only about 3.7% of the time, and it's the first Claude model to score zero on uncritically reporting flawed results.
A related gain shows up on what Anthropic calls lazy investigation: digging into a problem instead of stopping early with a plausible-sounding answer. Opus 4.8 scored cleanly there, where Opus 4.7 returned a wrong answer about a quarter of the time.
On overconfidence, the card reports more than a tenfold improvement over Opus 4.7.

Claude Opus 4.8 Offers Better Judgment

Honesty is about not overstating. Judgment is about knowing what to do, and the two feed each other.

Testers who ran Opus 4.8 on agentic coding described a model that acts less like an eager intern and more like a careful colleague.

A staff engineer testing it in Claude Code was specific:

“In Claude Code, it asks the right questions, catches its own mistakes, pushes back when a plan isn’t sound, and builds up confidence around complex, multi-service explorations before making big changes.”

Instead of executing a flawed instruction and handing back a confidently broken result, it stops and questions the instruction. That shifts what you can safely delegate.

You can give it a fuzzy, half-specified task and trust it to surface the ambiguity rather than guess wrong and burn an hour.

An investment analyst who ran it on long research tasks said the standout was 4.8 proactively flagging problems in both the inputs and outputs of an analysis, the kind of thing other models leave for you to find later.

There's a structural side too.

Anthropic's alignment team, which assesses every model before release, found 4.8 hitting new highs on prosocial traits: backing your autonomy and acting in your interest instead of telling you what you want to hear.

Misaligned behavior like deception or playing along with misuse dropped sharply from Opus 4.7. Their internal misalignment score fell from roughly 2.5 to about 1.9, close to Claude Mythos Preview, the best-aligned model Anthropic has built and one it hasn't released widely.

Moving the alignment needle that far on a point upgrade, in six weeks, is unusual.

Why this Matters for Everyday Work

Benchmarks tell you what a model does on a good day. Honesty and judgment tell you what it does on a bad one, and that's where most of the real cost of AI hides.

The expensive failures aren't the tasks a model gets wrong; they're the ones it gets wrong while insisting it got them right.

For teams shipping AI-written work, the payoff is less review. When the model flags its own weak spots, your people can spend attention on the parts that truly need a second pair of eyes.

Picture a four-hour agentic run. On Claude Opus 4.7, one overconfident "done" three steps in could send the whole chain off a cliff, and you wouldn't know until you reviewed the final output.

With Opus 4.8 abstaining when unsure and flagging what it skipped, that failure surfaces while it's still cheap to fix. The honesty gain compounds with length: the longer the session, the more it's worth.

Independent reviewers settled on a fair read within a day:

Strong on greenfield builds and one-shot features.
Weaker on the last ten percent: edge cases inside an existing codebase and the gnarly integration work that still wants a human watching closely.

That's not a knock. It's the model being honest about its own ceiling, which is the whole theme of the release.

The fastest way to judge it is to run it on work you already understand, where a bluff is obvious the second it shows up. You can do that in Chatly next to the other frontier models without an API key.

Access and Migration

Opus 4.8 went live everywhere at launch: Claude for Pro, Max, Team, and Enterprise, the Claude API, plus Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Pricing matches 4.7 to the dollar.

For most teams the switch is close to drop-in, since the API constraints carry over from 4.7 with no breaking changes. The behavior shifts above can change what your existing prompts produce, so run a quick pass on your evals before you flip production traffic.

Two levers cut the bill at scale:

Prompt caching saves up to 90% on repeated context
Batch processing takes another 50% off.

Teams that have to keep inference in the US can run it there at 1.1x token pricing.

Want to try it for free first? Our writeup on ways to access Opus without paying still applies, and if you're weighing it against OpenAI's frontier, our Opus versus GPT comparison lays out the trade-offs.

Conclusion

Anthropic was unusually direct about the roadmap. Two things are coming: cheaper models with Opus-level capability, and a more powerful class above Opus.

That second one is Mythos, in limited release to a few organizations for cybersecurity work under Project Glasswing, held back until the safety guardrails catch up to the capability. Anthropic expects to open Mythos-class models to everyone in the coming weeks.

Gating release on safety rather than capability is the same restraint the honesty work shows at the model level: hold back the confident claim until it's earned.

Read against that roadmap, 4.8 looks less like a destination than a checkpoint: same price, more trustworthy, a preview of the judgment Anthropic wants its bigger models to carry before they ship. For the longer view, we wrote up what to expect from the next Claude generation.

The short version for anyone deciding today: if you've been burned by a model confidently telling you a job was done when it wasn't, that's the exact problem 4.8 targets. Run it on something you can verify, and you'll know inside an hour whether the honesty story holds.

Frequently Asked Question

Learn what's new and what has people talking about the new Claude Opus 4.8.

Claude Opus 4.7 Overview: Capabilities, Migration Challenges, & Pricing

Faisal Saeed

Claude Opus 4.7 vs GPT-5.4: Benchmarks, Pricing & Which to Use

Faisal Saeed

How to Use Claude Opus 4.7 for Free – 5 Ways in 2026

Faisal Saeed

15 Best System Prompts for Claude Opus 4.7 – Coding, Writing & Research

Faisal Saeed

11 Best AI Tools for Businesses in 2026

Umaima Shah

Here's upto $10 of credits for free, on us.

Not ready? Invite friends instead

Here's upto $10 of credits for free, on us.

Not ready? Invite friends instead

Claude Opus 4.8: Faster, More Efficient, and More Honest with Better Judgment

What Claude Opus 4.8 is

What Changed Since Claude Opus 4.7

Claude Opus 4.8 is More Honest

Claude Opus 4.8 Offers Better Judgment

Why this Matters for Everyday Work

Access and Migration

Conclusion

Frequently Asked Question

Claude Opus 4.7 Overview: Capabilities, Migration Challenges, & Pricing

Claude Opus 4.7 vs GPT-5.4: Benchmarks, Pricing & Which to Use

How to Use Claude Opus 4.7 for Free – 5 Ways in 2026

15 Best System Prompts for Claude Opus 4.7 – Coding, Writing & Research

11 Best AI Tools for Businesses in 2026

Here's upto $10 of credits for free, on us.

Not ready? Invite friends instead

Here's upto $10 of credits for free, on us.

Not ready? Invite friends instead

Claude Opus 4.8: Faster, More Efficient, and More Honest with Better Judgment

What Claude Opus 4.8 is

What Changed Since Claude Opus 4.7

Claude Opus 4.8 is More Honest

Claude Opus 4.8 Offers Better Judgment

Why this Matters for Everyday Work

Access and Migration

Conclusion

Frequently Asked Question

What's the difference between Claude Opus 4.8 and Sonnet?

Can Claude Opus 4.8 analyze images and PDFs?

Can Claude Opus 4.8 access the internet?

Does Anthropic train on my Claude conversations?

Is Claude Opus 4.7 still available?

Can I use Claude Opus 4.8 in Cursor or an OpenAI-compatible client?

Is Claude Opus 4.8 only good for coding?

Claude Opus 4.7 Overview: Capabilities, Migration Challenges, & Pricing

Claude Opus 4.7 vs GPT-5.4: Benchmarks, Pricing & Which to Use

How to Use Claude Opus 4.7 for Free – 5 Ways in 2026

15 Best System Prompts for Claude Opus 4.7 – Coding, Writing & Research

11 Best AI Tools for Businesses in 2026