If we travel back 2 years in time, before the rapid rise in AI, much of the datacenter hardware world was chasing CXL. It was promised as the messiah to bring heterogenous compute, memory pooling, and composable server architectures. Existing players and a whole new host of startups were rushing to integrate CXL into their products, or create new CXL based products such as memory expanders, poolers, and switches. Fast forward to 2023 and early 2024, and many projects have been quietly shelved and many of the hyperscalers and large semiconductor companies have almost entirely pivoted away.
...
Currently, CXL availability is the main issue, as Nvidia GPUs don't support it, and with AMD the technology is limited to the MI300A. While the MI300X can theoretically support CXL in hardware, it is not exposed properly. Availability of CXL IP will improve in the future, but there exist deeper issues than just availability that render CXL irrelevant in the era of accelerated computing.
A H100 GPU has 3 IO formats: PCIe, NVlink, and C2C (for connecting to Grace). Nvidia has decided to include only the minimum 16 PCIe lanes, as Nvidia largely prefers the latter NVLink and C2C. Note that server CPUs, such as AMD’s Genoa, go up to 128 lanes of PCIe.
The main reason for this choice is bandwidth. A 16-lane PCIe interface has 64GB/s of bandwidth in each direction. Nvidia’s NVlink brings 450 GB/s bandwidth in each direction to other GPUs, roughly 7x higher. Nvidia’s C2C also brings 450GB/s bandwidth in each direction with the Grace CPU. To be fair, Nvidia dedicates far more beachfront area to NVLink, we so need to include silicon area in that equation; but even so, we estimate that per mm² across a large variety of SOCs, Ethernet-style SerDes such as Nvidia NVLink, Google ICI, etc have 3x more bandwidth per unit of shoreline area.
Therefore, if you are a chip designer in a bandwidth constrained world, you are making your chip roughly 3x worse when you chose to go with PCIe 5.0 instead of 112G Ethernet-style SerDes. This gap remains with
next generation GPUs and AI accelerators adopting 224G SerDes, keeping this 3x gap with PCIe 6.0 / CXL 3.0. We are in a pad limited world, and throwing away IO efficiency is an insane tradeoff.
The main scale up and scale out interconnects for AI Clusters will be proprietary protocols such as Nvidia NVlink and Google ICI, or Ethernet and Infiniband. This inferiority is due to
intrinsic PCIe SerDes limitations even in scale up formats.
...
Ethernet-style SerDes are much less constrained by stringent PCIe specifications, allowing it to be much faster and have higher bandwidth. As a result, NVlink has higher latency, but this doesn’t matter as much in the AI world of massively parallel workloads, where ~100ns vs ~30ns is not a worthy consideration.
First off, the MI300 AID uses most of its beachfront area for PCIe SerDes instead of Ethernet-style SerDes. While this gives AMD more configurability in terms of IFIS, CXL, and PCIe connectivity, it results in the total IO being about 1/3 that of Ethernet-style SerDes. AMD needs to immediately move off PCIe-style SerDes for their AI accelerator if they want to have any hopes at competing with Nvidia’s B100. We believe they will be for MI400.