Claude Sonnet 5 Benchmark: How It Compares to Opus 4.8 and Sonnet 4.6

The headline Claude Sonnet 5 benchmark number is a 63.2% score on agentic coding, which lands the model neatly between the older Sonnet 4.6 at 58.1% and the flagship Opus 4.8 at 69.2%. Anthropic released Sonnet 5 on June 30, 2026, and framed it in its launch announcement as its most agentic Sonnet model yet.

The takeaway for anyone reading a benchmark chart is simple: Sonnet 5 delivers close-to-Opus quality at a much lower price. On launch day it also became the default model for Free and Pro plans, so most people are already using it whether they compared the numbers or not.

Claude Sonnet 5 benchmark performance

The Headline Sonnet 5 Benchmark Numbers

Anthropic positions Sonnet 5 as a strict improvement over its predecessor across reasoning, tool use, coding, and knowledge work. The clearest single figure comes from the company’s agentic coding evaluation, where the model closes most of the distance to the flagship in one generation.

Agentic coding: 63.2%

Sonnet 5 scores 63.2% on Anthropic’s agentic coding benchmark, versus 69.2% for Opus 4.8 and 58.1% for the previous Sonnet 4.6. That is a 5.1-point jump over the last Sonnet in a single release, and it narrows the gap to the more expensive Opus tier to roughly six points. For a mid-tier model, closing that much ground while staying cheaper than Opus is the whole pitch.

Knowledge work: slightly ahead of Opus

Coding is not the only axis. On a knowledge-work benchmark, Sonnet 5 actually slightly outperforms Opus 4.8 — the model usually reserved for the hardest judgment calls and deep research. Opus 4.8 still wins on top-accuracy tasks, but the fact that a Sonnet-class model can edge the flagship anywhere is new.

ModelAgentic codingNotes
Claude Opus 4.869.2%Flagship, highest accuracy
Claude Sonnet 563.2%Near-Opus, slightly ahead on some knowledge work
Claude Sonnet 4.658.1%Previous Sonnet, now superseded

Sonnet 5 vs Opus 4.8: How Close Is It?

Benchmarks alone hide an important lever: effort. Sonnet 5 lets you dial reasoning depth up or down, which changes both the score and the cost of every request.

The effort-level lever

At its Extra High effort setting, Sonnet 5 lands roughly in line with Opus 4.8’s medium-to-high setting on the OSWorld-Verified computer-use benchmark and the BrowseComp agentic-search benchmark. The catch is that running Sonnet 5 that high can cost more than Opus 4.8 at a comparable level, so for top-accuracy work Opus 4.8 remains the better pick. The practical result is a spectrum: you tune effort to find the balance of cost and performance your task needs.

Where Anthropic draws the line

Anthropic is explicit that the flagship is not obsolete, but that the cheaper tier has caught up in quality:

Opus 4.8 is still the model of choice for higher accuracy on these tasks, but Sonnet 5 provides developers with lower-priced options that are of much higher quality than what was previously available.

Anthropic

That framing matters for reading any benchmark table: the two models are meant to overlap, not to replace each other.

Claude Sonnet 5 benchmark performance

Sonnet 5 vs Sonnet 4.6: A Strict Upgrade

Against the model it replaces, the comparison is one-directional. On every benchmark Anthropic has published so far, Sonnet 5 outperforms Sonnet 4.6 — in reasoning, tool use, software coding, and knowledge work alike.

The score story is backed by behavior that does not show up in a single percentage. Testers cited by Anthropic said the model finishes complex tasks where previous Sonnets would stop short, and that it checks its own output without being asked. This self-correction — reviewing an answer and fixing errors before you see them — is part of why the same task can feel more reliable even when the raw number moves only a few points.

One migration note sits behind the pricing: Sonnet 5 uses a new tokenizer, so the same text maps to roughly 30% more tokens than on Sonnet 4.6. Per-token pricing is unchanged, but a request can cost differently because it is counted differently.

Safety Benchmarks

Not every benchmark measures capability. Anthropic’s pre-deployment safety evaluations found Sonnet 5 to be an overall improvement on Sonnet 4.6, with lower rates of hallucination and sycophancy and better resistance to misuse.

Prompt injection and cyber

On agentic safety, Sonnet 5 is better at refusing malicious requests and resisting hijack attempts in prompt-injection attacks. The most concrete safety figure comes from a cyber benchmark built with Mozilla that tested whether models could develop exploits for vulnerabilities in Firefox 147. Both new Sonnet models scored 0.0% — Sonnet 5 never produced a full working exploit — and all the tested vulnerabilities were patched in Firefox 148. Anthropic still shipped the model with cyber safeguards enabled by default, and documents the full results in its Claude Sonnet 5 System Card.

Claude Sonnet 5 benchmark performance

Pricing and Cost-per-Benchmark

A benchmark score is only half the value; the other half is what each point costs. This is where Sonnet 5’s numbers become genuinely attractive.

The introductory window

API pricing is an introductory $2 per million input tokens and $10 per million output tokens through August 31, 2026, after which it moves to $3 per million input and $15 per million output. You can confirm the current rates on the official Claude Platform pricing page. By comparison, Opus 4.8 costs $5 per million input and $25 per million output — so Sonnet 5 delivers benchmarks within striking distance of the flagship at a fraction of the cost.

To read a cost-per-benchmark comparison for your own workload:

  1. Pick the benchmark closest to your task (agentic coding, computer use, or knowledge work).
  2. Note each model’s score on that benchmark.
  3. Multiply your expected input and output tokens by each model’s per-token price.
  4. Divide the cost by the benchmark score to get a rough cost-per-point.
  5. Repeat at a lower effort level for Sonnet 5 to see the cheaper end of its range.
ModelInput (per 1M)Output (per 1M)
Claude Sonnet 5 (intro to Aug 31)$2$10
Claude Sonnet 5 (standard)$3$15
Claude Opus 4.8$5$25

How Sonnet 5 Stacks Up Against GPT-5.6 and Gemini

Benchmarks are competitive signals as much as technical ones. Anthropic frames Sonnet 5 as a cheaper way to run agents against OpenAI’s GPT-5.6 Sol and Google’s Gemini 3.5 Flash, both pitched as agentic models that plan and act with minimal oversight.

Cheaper than most rivals, pricier than one. Sonnet 5 undercuts Opus 4.8, OpenAI’s GPT-5.5, and Google’s Gemini 3.1 Pro, while remaining more expensive than Gemini 3.5 Flash. The competitive read is that agentic capability is now the baseline at every price tier, so the differentiator shifts to how cheaply and how reliably a model can do the work — exactly the ground Sonnet 5’s benchmark-per-dollar story is built to win.

FAQ

keyboard_arrow_up