AI & GPU Accelerators

Nvidia FP64 Emulation Boosts HPC vs AMD

Nvidia's Rubin GPUs lean on FP64 emulation to claim HPC supremacy, but AMD calls bluff on real-world readiness. A workaround born of AI dominance?

Nvidia Rubin GPU rack in data center with FP64 emulation performance charts overlaid

Key Takeaways

  • Nvidia's Rubin uses FP64 emulation for 200 TFLOPS, 4.4x over prior hardware, targeting HPC.
  • AMD questions emulation's accuracy for real physics, prefers native FP64 hardware.
  • Emulation repurposes AI tensor cores, a pragmatic bridge in the AI-HPC convergence race.

Spotlights flicker in a dimly lit data center at Lawrence Livermore, where a rack of Nvidia Rubin prototypes hums, chasing FP64 precision through software tricks.

Nvidia’s FP64 emulation strategy hits the scene just as Rubin GPUs launch—promising up to 200 teraFLOPS in emulated double-precision matrix math, a 4.4x jump over Blackwell’s hardware limits. It’s a bold pivot. AI chips optimized for FP4 and FP8 now flexing for high-performance computing (HPC), the domain where AMD’s Instinct accelerators have ruled with native FP64 muscle.

Look, Nvidia didn’t bake in more hardware FP64—Rubin peaks at 33 teraFLOPS natively, dipping a tick below the four-year-old H100. Emulation via CUDA libraries unlocks the rest, decomposing FP64 ops into INT8 tensor core blitzes using the Ozaki scheme. Partners swear by it; Nvidia’s own tests back the accuracy.

“What we found is, through many studies with partners and with our own internal investigations, is that the accuracy that we get from emulation is at least as good as what we would get out of a tensor core piece of hardware,” Dan Ernst, senior director of supercomputing products at Nvidia, told El Reg.

Ernst’s got a point—tensor cores are beasts at low precision, idling at FP64 speeds 1,000x slower. Why waste ‘em? Emulation’s no stranger; it dates to the 1950s, predating hardware floaters. By the ’80s, dedicated FPUs killed the need. Fast-forward—the Tokyo researchers’ 2024 Ozaki paper resurrects it, tensorizing FP64 for speed gains.

Why Is Nvidia Doubling Down on FP64 Emulation for HPC?

Market dynamics scream answer. AI training guzzles single-precision and below, but HPC clings to FP64 like a life raft—18.44 quintillion values per number, versus FP8’s measly 256. Planes fly, nukes arm, vaccines model on this stuff. AI’s fuzzy tolerances? Fine for LLMs. Not for simulations where errors cascade into blowups, violating conservation laws.

AMD’s MI300X, for instance, cranks 163 teraFLOPS native FP64—hardware, no hacks. Nvidia’s Rubin? Hardware skimps, emulation fills. It’s pragmatic. Rubin ships as AI kingpin first (hello, 35 petaFLOPS FP4), HPC second. But here’s Nvidia’s spin: “It’s still FP64. Not mixed precision.” Ernst again. Smooth.

And yet.

Short paragraph punch: Skepticism brews.

Nicholas Malaya, AMD fellow, pokes holes. Emulation shines in benchmarks—well-conditioned matrices, sure. Real physics? Dicey. Errors creep in poorly conditioned systems, like turbulent flows or quantum sims. “It’s quite good in some of the benchmarks, it’s not obvious it’s good in real, physical scientific simulations,” Malaya told reporters.

Can Nvidia’s Emulation Actually Replace Hardware FP64?

Data says maybe, mostly. Nvidia’s libraries, dropped late last year, scale across Hopper, Blackwell, Rubin. Partners like ORNL report parity in apps like OpenFOAM, weather models. But Malaya’s caveat lingers—propagation risks. A single finite error snowballs.

My take? Nvidia’s playing catch-up. They’ve ceded HPC ground to AMD since Ampere, betting AI tsunami would pay dividends. It has—datacenter revenue exploded 400% YoY. But supercomputing bids (TOP500 lists) still favor AMD’s FP64 density. Emulation’s a bridge—cheap, software-upgradable. No respin required.

Unique angle: Echoes Intel’s x86 glory days. Pentium FDIV bug? Microcode patches. Missing SSE? Emulate till hardware catches up. Nvidia’s tensor cores mirror that—repurpose for legacy precision. Bold prediction: By 2026, if Ozaki evolves (adaptive rounding, error bounds), emulation standardizes. AMD counters with CDNA 4, hardware FP64 at 300+ TFLOPS. Race tightens.

But hype alert—Nvidia’s “most potent GPU for scientific computing in years”? On paper, yes. Real-world adoption? Labs will test. DOE contracts incoming; watch Frontier successor bids.

Numbers game: Rubin’s 200 TFLOPS emulated FP64 matrix beats AMD’s MI325X projections (262 TFLOPS native, but denser packaging). Throughput wins, but power? Efficiency? Unclear. Tensor cores sip watts at INT8; emulation piles ops. 2-3x slowdown possible versus native.

What Does This Mean for the AMD-Nvidia HPC Arms Race?

AMD leads hardware FP64—MI300 series owns 60% of TOP500 non-Nvidia share. Nvidia? 90% AI market. Crossover play: Emulation lets Rubin contest exascale sims without HPC-specific silicon. Smart economics—unified stack sells to hyperscalers moonlighting in science.

Critique time. Nvidia’s PR glosses risks. “At least as good”? Partners cherry-pick. Independent audits scarce. Malaya’s right—prime time? Not yet. But pressure mounts; China’s Huawei eyes domestic HPC, U.S. export curbs bite.

Wider view. FP64’s gold standard endures because physics doesn’t bend. AI approximations warp reality; simulations can’t. Emulation threads the needle—use AI infra for HPC without full redesign.

Still, a hack. Prioritize tensor FLOPS for trillion-dollar AI? Logical. Sacrifice native FP64? Risky. AMD bets opposite—balanced Instincts snag Euro HPC deals.

Prediction sharpens: Nvidia wins volume, AMD niches precision purists. Rubin Ultra variants might hardware-boost FP64 next gen. For now, emulation’s the squeeze—oomph from AI chips, indeed.

Market ripple: Nvidia stock +2% post-Rubin; AMD holds steady. TOP500 shifts? Q4 reveals.


🧬 Related Insights

Frequently Asked Questions

What is Nvidia’s FP64 emulation and the Ozaki scheme?

Nvidia’s CUDA emulation breaks FP64 matrix math into INT8 tensor ops, based on 2024 Ozaki research—boosting Rubin GPUs to 200 TFLOPS from 33 native.

Is FP64 emulation accurate enough for real HPC workloads?

Benchmarks say yes, partners agree—but AMD warns of error propagation in tough physics sims; more validation needed.

Will Nvidia’s emulation beat AMD in supercomputing rankings?

Possible on throughput, but AMD’s native FP64 density gives edge in power efficiency and proven sims—watch TOP500 lists.

Priya Sundaram
Written by

Hardware and infrastructure reporter. Tracks GPU wars, chip design, and the compute economy.

Frequently asked questions

What is Nvidia's FP64 emulation and the Ozaki scheme?
Nvidia's CUDA emulation breaks FP64 matrix math into INT8 tensor ops, based on 2024 Ozaki research—boosting Rubin GPUs to 200 TFLOPS from 33 native.
Is FP64 emulation accurate enough for real HPC workloads?
Benchmarks say yes, partners agree—but AMD warns of error propagation in tough physics sims; more validation needed.
Will Nvidia's emulation beat AMD in supercomputing rankings?
Possible on throughput, but AMD's native FP64 density gives edge in power efficiency and proven sims—watch TOP500 lists.

Worth sharing?

Get the best Semiconductor stories of the week in your inbox — no noise, no spam.

Originally reported by The Register HPC

Stay in the loop

The week's most important stories from Chip Beat, delivered once a week.