Did you ever stop to think about how much energy your chip actually wastes just talking to itself?
It’s an almost absurd question when you’re marveling at the raw teraflops of a new GPU or the clock speed of the latest CPU. We’re so focused on the silicon’s doing that we forget the immense overhead of its communicating. And that’s where Broadcom’s gambit, quietly outlined in recent discussions, starts to look less like incremental improvement and more like a fundamental architectural pivot: the move to truly vertical compute.
For years, the industry has been inching its way skyward. We’ve seen High Bandwidth Memory (HBM) stacks, those dense towers of DRAM, conquer the vertical plane. It’s a relatively straightforward win, given memory’s comparatively sedate power demands. Then came 2.5D stacking, essentially a sophisticated interposer acting as a mini-highway to connect separate chiplets—think GPUs and those burgeoning XPUs—to those HBM stacks. AMD, bless its pioneering heart, even managed to stack L3 cache directly atop its Epyc CPUs, a form of 3D stacking that, frankly, I’ve always wondered why more companies didn’t adopt sooner. More cache, right on the CPU die? Why wouldn’t you?
The logic for going vertical is as obvious as the silicon itself. We’re not just building bigger chips anymore; we’re building denser, more integrated systems within the same physical footprint. Forget the sprawling server racks; the real unit of compute is increasingly the socket itself.
But let’s talk numbers, because that’s where this gets interesting. Harish Bharadwaj, a VP at Broadcom steering their 3.5D Extreme Dimension System in Package (XDSiP) initiative, dropped a pretty staggering statistic: connecting four or eight GPUs or XPUs on a standard system board burns between 3 and 5 picojoules per bit. That’s not zero, and in the world of high-performance computing, especially AI training, that’s a chasm. Now, collapse that same compute cluster into a single socket using Broadcom’s die-to-die interconnects, and that energy cost plummets to less than 0.2 picojoules per bit. Less distance, less latency, less power burn. It’s a direct consequence of keeping these high-speed conversations on-chip, rather than forcing them across motherboard traces.
This is why 3D stacking, despite its inherent complexities and the inevitable cost escalations, isn’t just a trend; it’s an inevitability. Broadcom’s 3.5D XPU designs aren’t just stacking a single compute chiplet. We’re talking multiple stacked compute dies plus multiple stacks of HBM memory. Their initial XDSiP could handle a dozen HBM stacks, and they’re pushing that number higher still. Why? Because the compute giants want to stay a generation behind on expensive HBM, opting for more, cheaper HBM to achieve the sheer capacity and bandwidth they crave.
We’ve seen this play out. Google’s latest TPU 8 XPUs lean on HBM3E instead of the bleeding edge. SambaNova Systems did something similar with their SN50 RDU, using HBM2E for cost-effective depth. While Google has Broadcom’s foundry assistance for their TPUs, they don’t appear to be leveraging the 3.5D XDSiP—at least, not yet.
Fujitsu, on the other hand? They’re all in. Their forthcoming “Monaka” Arm server CPU, slated for 2027, is a prime example. This beast will boast 144 Armv9-A cores, utilizing a mix of 2nm and 5nm chiplets, all integrated through Broadcom’s 3D compute stacking. Samples are already back from Broadcom’s labs, demonstrating the real-world application of this technology.
So, how does Fujitsu actually implement this complex XDSiP? The details are still under wraps, a calculated reveal planned for the Monaka launch. But Bharadwaj hinted at something fascinating: stacking a 2nm compute chiplet atop a 5nm compute tile. This isn’t just about making chips smaller; it’s about strategically placing the most advanced, power-hungry compute where it’s most efficient, while using older, less demanding nodes for memory and interconnect.
And Fujitsu isn’t alone. Bharadwaj revealed half a dozen other major players are embedding 3.5D XDSiP into their custom AI XPU designs. Amazon Web Services (AWS) with its upcoming Trainium4 and Meta Platforms with its MTIA 500 are rumored to be on this path, likely seeing volume in 2027. The strategy, as Bharadwaj articulates, is elegant:
“The key thing is that customers using 3.5D XDSiP is to keep the top die in the most advanced silicon node so that it can do the highest performance compute. There are customers doing 3 nanometer over 3 nanometer, 2 nanometer over 3 nanometer, and even 1.4 nanometer over 3 nanometer. That thing is kind of evolving. The point is, putting the high performance compute at the top makes it easier for the heat to escape, and then you put the SRAM and some low activity compute and the interconnect at the bottom so that the heat is less and but is still able to escape.”
This isn’t just about squeezing more cores into a socket; it’s a sophisticated dance of thermal management and node optimization. Placing the highest-performance, highest-heat-generating compute at the top makes thermal dissipation—always the bane of dense chip design—more manageable. The lower layers, housing less active compute and memory controllers, generate less heat, allowing for easier thermal escape. It’s an architectural solution to an engineering problem that’s been looming for a decade.
What’s the real differentiator here? It’s about moving beyond the limitations of traditional monolithic chip design. By enabling companies to go vertical, Broadcom is effectively allowing them to build custom, ultra-dense compute engines that are tailored for specific workloads, be it AI training, inference, or high-performance networking. This is the “vertical integration” of chiplets, giving XPU makers unprecedented control over their silicon’s destiny. It’s a subtle but profound shift, moving the power of chip design from the foundry back to the end-user architects.
This is the future of compute – not just faster, but smarter, denser, and a whole lot more power-efficient. And it’s happening, thanks to the unsung heroes of advanced packaging.
Why Does Stacking Matter for AI Compute?
AI workloads are notoriously hungry for both compute power and memory bandwidth, and they’re incredibly sensitive to latency. Traditional chip designs, even those with multiple GPUs on a motherboard, hit bottlenecks as data has to travel off-chip and back. Vertical stacking, as enabled by Broadcom’s XDSiP, collapses these distances. By integrating compute dies and memory directly atop each other within a single package, it slashes latency, significantly reduces the power consumed by data movement, and allows for a much higher density of processing elements. This translates directly into faster AI training times and more efficient AI inference, making advanced AI models more accessible and economically viable.
What is Broadcom’s ‘3.5D’ XDSiP Technology?
Broadcom’s 3.5D Extreme Dimension System in Package (XDSiP) is an advanced packaging technology that facilitates the vertical stacking of multiple chiplets—including compute dies (like GPUs or XPUs) and memory stacks (like HBM)—within a single package. The ‘3.5D’ designation implies a level of integration that goes beyond standard 2.5D interposer-based stacking, allowing for direct die-to-die interconnects and a higher degree of vertical density. This approach aims to significantly reduce inter-component latency and power consumption compared to traditional multi-chip configurations on a motherboard, enabling the creation of highly integrated, high-performance compute engines.
Is 3D Stacking the Only Way Forward?
While 3D stacking represents a significant architectural shift and a powerful solution for overcoming current limitations, it’s unlikely to be the only way forward. Monolithic designs will continue to evolve, and other advanced packaging techniques (like advanced 2.5D or novel interconnects) will also play a role. The industry often sees a diversification of approaches, with different solutions optimized for different market segments and performance requirements. However, for cutting-edge AI and HPC applications where latency, power, and density are paramount, 3D stacking, as exemplified by Broadcom’s efforts, is emerging as a critical enabler.
🧬 Related Insights
- Read more: FuriosaAI’s 2nm Chip: Can It Outgun GPUs?
- Read more: GUC & Wiwynn Forge AI Infrastructure Path
Frequently Asked Questions
What does Broadcom’s 3.5D XDSiP do? Broadcom’s 3.5D XDSiP technology allows multiple chiplets, like processors and memory, to be stacked vertically within a single package. This drastically reduces the distance data has to travel, lowering latency and power consumption for high-performance computing and AI applications.
Will this make my current computer obsolete? Not directly or immediately. This technology is primarily aimed at high-end server CPUs and specialized AI accelerators (XPUs) used in data centers and supercomputers. For your personal computer, the impact will be indirect, potentially leading to more powerful cloud services.
Why are companies using different sized chiplets in the same stack? Using different sized or node-generation chiplets allows for optimization. The highest-performance, most power-intensive compute can be placed on the newest, most advanced (and typically smaller) silicon nodes, while less demanding components like memory controllers or cache can be placed on older, more cost-effective nodes. This also helps with heat management, as the hottest components can be placed where heat can dissipate most effectively.