Tenstorrent Galaxy: 350 Tokens/s, 5x Lower AI TCO

Here’s the data point that should make you sit up: 350 tokens per second. Not on a cutting-edge NVIDIA Hopper or Blackwell, but on Tenstorrent’s new Galaxy Blackhole server, running the DeepSeek R1 model. This figure, tossed out during their TT-Deploy livestream, isn’t just a boast; it’s a gauntlet thrown at the feet of every established AI hardware vendor. Jim Keller’s crew is aiming to “crush everyone,” and they’re backing it with RISC-V architecture that, if their claims hold water, could redefine the economics of large-scale AI inference.

The Architecture of Ambition: RISC-V Meets the AI Workload

At the heart of Tenstorrent’s Galaxy servers lies the Blackhole chip, a beast built on the RISC-V instruction set architecture. This isn’t some experimental tinkerer’s project. RISC-V, an open-standard instruction set, has been slowly gaining traction, but its application in high-performance AI accelerators like this is a significant step. The Tensix core within Blackhole is described as a programmable unit featuring RISC processors, matrix-multiply units, and vector units, all speaking to each other over a high-bandwidth Network-on-Chip (NoC). It’s a modular design, chaining these Tensix cores together to create the immense processing power needed for today’s AI models.

And the scale they’re talking about is staggering. The Galaxy Blackhole server, in its air-cooled configuration, packs 32 Blackhole chips, spitting out 23 PFLOPs of FP8 compute. That’s a raw number, but when you layer in the claimed 2.9 PB/s of on-chip SRAM bandwidth and a whopping 16 TB/s of DRAM bandwidth, you begin to grasp the architectural shift they’re pushing. This isn’t just about cramming more cores onto a die; it’s about optimizing the entire data pipeline – from memory access to compute execution – for the ravenous appetite of AI.

Crushing Costs: The TCO Gambit

The performance numbers, while impressive, are only half the story. Tenstorrent is making an equally aggressive play on Total Cost of Ownership (TCO). They claim their Galaxy servers can achieve 5x lower AI TCO compared to NVIDIA’s GB300. This isn’t a minor quibble; it’s a fundamental challenge to the financial viability of current AI deployments. The company’s PR spin, if you can call it that, suggests that competitors achieve higher token throughput by drastically reducing the number of users their systems can support. Tenstorrent, conversely, aims to keep that token cost low while maintaining high throughput.

The company claims that to achieve higher Token throughput, the number of users is drastically decreased on competing platforms. That’s not the case with Tenstorrent’s Galaxy servers, which retain lower Token Cost ($6 vs ~$30), and achieve much lower TCO for firms using these servers.

This is the critical lever. For enterprises and hyperscalers running massive inference workloads, the per-token cost and the overall power and cooling requirements (often bundled into TCO) are paramount. If Tenstorrent can deliver on this 5x TCO reduction while meeting their performance claims, it could force a seismic reevaluation of their hardware procurement strategies. It’s a bold move, and one that’s likely to be met with intense scrutiny from NVIDIA and its ecosystem.

Beyond Inference: Video GenAI and Latency-Sensitive Workloads

Tenstorrent isn’t just stopping at LLM inference. They also showcased a 10x faster GenAI video performance on their Galaxy Supercluster, generating an 81-frame 720p video in just 2.4 seconds. That’s faster than real-time video creation, a feat that would have seemed like science fiction a year ago. This demonstrates the versatility of their hardware and the flexibility of the RISC-V architecture in handling diverse AI workloads.

Then there’s “Blitz Mode.” Optimized for premium, latency-sensitive AI tasks, this mode is what enables the eye-popping 350+ tokens/second on DeepSeek R1. Critically, they also claim sub-4-second time-to-first-token on a 100K context window. This is the kind of performance that matters for interactive AI applications, chatbots that feel genuinely responsive, and real-time analysis. When you couple this with the claim of supporting batch sizes from 8 to 64 and up to 128k context, you’re looking at a system designed to chew through complex, large-context AI problems with remarkable efficiency.

A RISC-V Uprising in the Datacenter?

The specter of Jim Keller, a figure synonymous with pioneering CPU designs and a deep understanding of silicon architecture, adds significant weight to Tenstorrent’s pronouncements. His involvement signals that this isn’t just about a new chip; it’s about a fundamental architectural approach aimed at disrupting the incumbent duopoly of x86 and ARM in the server space, with RISC-V as the chosen weapon. The open-source nature of RISC-V also appeals to a segment of the developer and enterprise community tired of vendor lock-in and proprietary ecosystems.

The initial A0 silicon is shipping, though with acknowledged software bugs. This is typical for early silicon, and the fact that they’re addressing them rather than hiding them is, frankly, a good sign. The real test will be the software stack – the compilers, the runtimes, the libraries that allow developers to actually use this hardware effectively. Tenstorrent’s commitment to an open-source software stack is a strategic move designed to foster an ecosystem, much like what happened with Linux.

The pricing is also aggressive. Starting at $110,000 for an air-cooled rack configuration (with 32 Blackhole chips), and scaling up to supercluster configurations, the entry point is substantial, but the performance and TCO claims suggest a compelling ROI proposition. It’s a gamble, for sure, but for any organization looking to significantly scale their AI operations without breaking the bank, Tenstorrent’s Galaxy Blackhole servers represent a new, potentially disruptive contender. The question is no longer if RISC-V can compete in AI, but how much it will shake up the established order.

🧬 Related Insights

Read more: Truck-Sized AI Data Centers: GPUs Delivered in Months, Not Years
Read more: Crimson Desert Patches In Intel Arc Support After Refund Fiasco — A Win for Battlemage Owners?

Frequently Asked Questions

What exactly is RISC-V and why is it important for AI?

RISC-V is an open-standard instruction set architecture (ISA). Unlike proprietary ISAs like ARM or x86, RISC-V is free to use and modify, fostering innovation and allowing specialized hardware designs for specific tasks like AI. Its modularity makes it ideal for creating custom processing cores.

Will Tenstorrent’s Galaxy servers replace NVIDIA GPUs?

It’s unlikely to be a complete replacement in the short term. NVIDIA has a massive software ecosystem (CUDA) and a deeply entrenched market position. However, Tenstorrent’s claims on performance and TCO could make them a very strong alternative or complementary solution, especially for large-scale inference workloads where cost is a major factor.

How does the 5x lower AI TCO claim translate to real-world savings?

The TCO claim suggests that over the lifespan of the hardware, Tenstorrent’s servers will cost 5 times less to operate and maintain for the same AI workload performance compared to NVIDIA’s offerings. This includes factors like purchase price, power consumption, cooling, and rack space.

Tenstorrent Galaxy: 350 Tokens/s, 5x Lower AI TCO

Key Takeaways

The Architecture of Ambition: RISC-V Meets the AI Workload

Crushing Costs: The TCO Gambit

Beyond Inference: Video GenAI and Latency-Sensitive Workloads

A RISC-V Uprising in the Datacenter?

🧬 Related Insights

Frequently asked questions

Worth sharing?

⚡ Key Takeaways

The Architecture of Ambition: RISC-V Meets the AI Workload

Crushing Costs: The TCO Gambit

Beyond Inference: Video GenAI and Latency-Sensitive Workloads

A RISC-V Uprising in the Datacenter?

🧬 Related Insights

Frequently asked questions

Share this article

Worth sharing?

Related Stories

SiPearl & Semidynamics: A New Rack-Scale AI Inference Play?

Fractile's $220M bet: Supercharging AI inference hardware

Semidynamics Funding Signals Shift to Memory-Centric AI

Scintil Photonics: AI's Network Bottleneck Solved?

Stay in the loop

Key Takeaways