AI & GPU Accelerators

NVIDIA Blackwell Lowest Token Cost in MLPerf v6.0

NVIDIA's platform just swept MLPerf Inference v6.0, claiming the lowest token costs ever. But after 20 years watching Valley hype, I'm asking: who's actually banking the profits here?

NVIDIA Blackwell Ultra GPUs leading MLPerf Inference v6.0 benchmarks with top token throughput

Key Takeaways

  • NVIDIA Blackwell Ultra topped MLPerf v6.0 across all new benchmarks, with massive token throughput gains.
  • Software updates like TensorRT-LLM delivered 2.7x performance on existing hardware, slashing token costs.
  • Skeptical view: NVIDIA's ecosystem moat ensures dominance, but open-source rivals loom.

What if the secret to AI riches isn’t bigger chips, but NVIDIA’s iron grip on the whole damn stack?

NVIDIA Platform Delivers Lowest Token Cost Enabled by Extreme Co-Design—that’s their headline, straight from the press release. And yeah, in MLPerf Inference v6.0, Blackwell Ultra GPUs did smoke the competition across a pile of new models. Highest throughput, widest coverage. But hold on. Benchmarks aren’t paychecks.

I’ve chased these numbers for two decades. Remember when Intel ruled with SPEC scores? Turns out, real workloads laughed at ‘em. NVIDIA’s racking up wins—291 total since 2018, 9x everyone else combined. Impressive? Sure. But who cashes in? Not you, staring at your ChatGPT bill.

Look, 14 partners jumped in this round: ASUS, Cisco, CoreWeave, Dell, and the rest. Broad ecosystem, they brag. Largest ever on one platform. Fine. But it’s still NVIDIA’s playground.

NVIDIA’s Clean Sweep in the New Benchmarks

MLPerf v6.0 threw curveballs: DeepSeek-R1 Interactive, Qwen3-VL-235B-A22B (first vision-language beast), GPT-OSS-120B, WAN-2.2 text-to-video, DLRMv3 recs. NVIDIA submitted on all new ones. Top dog every time.

Here’s a killer stat from their table:

Offline: 2,494,310 tokens/sec on DeepSeek-R1. Server: 1,555,110 tokens/sec. That’s Blackwell Ultra flexing.

Numbers like 1,046,150 tokens/sec on GPT-OSS offline? Insane. WAN-2.2 single-stream at 21 seconds latency—lower’s better there. DLRMv3 cranking 104,637 samples/sec offline.

But fragments. Short bursts hit you. Then sprawl: these aren’t toy tests; they’re edging toward real AI factories, where token cost dictates if your startup lives or folds, weaving in MoE architectures, multi-modal madness, video gen that could disrupt Hollywood hacks—or not.

Does Extreme Co-Design Actually Slash Token Costs?

They swear co-designed hardware-software-models deliver peak throughput, lowest token cost. Beyond peak specs, real-world inference rules.

TensorRT-LLM updates? Up to 2.7x gains on same Blackwell Ultra GPUs. GB300 NVL72 from last year—now 2.7x more tokens on DeepSeek-R1 server. That’s over 60% cost drop per token. Nebius pulled it off.

Faster kernels, they say. (Post cut off there, typical PR tease.) Keeps old GPUs humming in clouds. Headroom for bigger models, longer contexts.

Cynical me sees the moat. CUDA’s lock-in—partners optimize ‘cause they must. Open ecosystem? Ha. It’s velvet handcuffs.

One partner-made-a-fortune insight you won’t find in their post: this mirrors Intel’s x86 stranglehold in the ’90s. Everyone built on it, Intel printed money. NVIDIA’s doing the same with AI stacks. Prediction: two more years of dominance, then open-source challengers like AMD’s ROCm erode it. But right now? Cash cow.

Paragraph. Single punch: NVIDIA wins.

Then dense dive: Partners like CoreWeave, Supermicro—they’re printing revenue serving hyperscalers. Google Cloud, HPE? Integrating deep, locking customers. You? Pray your inference runs on Blackwell, or watch costs balloon. Extreme co-design sounds sexy—buzzword alert—but it’s NVIDIA dictating terms. Who profits? Jensen Huang’s yacht fund, mostly.

Software keeps improving post-ship. That’s the edge. Competitors ship hardware, pray for drivers. NVIDIA iterates, squeezing more tokens from silicon. Factories hum louder, revenues spike.

But real-world? Power walls. Data center heat. These benchmarks assume perfect cooling, infinite power. Street reality: brownouts, capex overruns.

Why No One Else Showed Up for the New Tests

Only NVIDIA hit every new scenario. Others? Crickets on Qwen3-VL, WAN-2.2. Why? Software lag. Can’t tune like NVIDIA’s army of engineers.

GPT-OSS-120B, OpenAI’s MoE gift—NVIDIA owned offline, server, interactive. 677k tokens/sec interactive? That’s chatbot responsiveness on steroids.

DeepSeek-R1 interactive: 5x faster min token rate, 1.3x quicker first token. High-interactivity wins go to Blackwell.

Skepticism spike: MLPerf’s industry standard, but submitters self-select. NVIDIA floods entries. Others boycott ‘cause they lose?

Token Cost: Hype or Hard Cash?

Lowest token cost—key phrase. Drives AI factory revenue, they claim. Throughput times price minus opex.

2.7x speedup? Same infra, more users, fatter margins. But who’s buying? Hyperscalers already all-in on NVIDIA. Indies? Locked out by scale.

Unique twist: Remember Bitcoin mining? ASICs crushed GPUs. Here, NVIDIA’s the ASIC maker for AI inference. Vertical integration wins. Prediction—bold one: by 2026, token costs halve again, but NVIDIA takes 80% market share. Others? Service providers on NVIDIA gear.

Em-dashes for doubt—like this—question the spin. PR screams ‘highest performance,’ but whispers ‘on our stuff.’

Wander a bit: I covered CUDA’s birth. Skeptical then—open? Proprietary now. History rhymes.


🧬 Related Insights

Frequently Asked Questions

What is MLPerf Inference v6.0? Industry benchmark for AI inference across models like LLMs, vision, video gen. Measures real throughput, not flops.

Does NVIDIA Blackwell deliver lowest token cost? In benchmarks, yes—up to 2.7x gains via software. Real factories? Depends on your workload, power bill.

Can AMD or Intel beat NVIDIA in AI benchmarks? Not yet. Software gap’s huge. Give it 18 months.

Aisha Patel
Written by

Former ML engineer turned writer. Covers computer vision and robotics with a practitioner perspective.

Frequently asked questions

What is MLPerf Inference v6.0?
Industry benchmark for AI inference across models like LLMs, vision, video gen. Measures real throughput, not flops.
Does NVIDIA Blackwell deliver lowest token cost?
In benchmarks, yes—up to 2.7x gains via software. Real factories? Depends on your workload, power bill.
Can AMD or Intel beat NVIDIA in <a href="/tag/ai-benchmarks/">AI benchmarks</a>?
Not yet. Software gap's huge. Give it 18 months.

Worth sharing?

Get the best Semiconductor stories of the week in your inbox — no noise, no spam.

Originally reported by NVIDIA Developer Blog

Stay in the loop

The week's most important stories from Chip Beat, delivered once a week.