Nvidia’s machine learning ecosystem advantage

TopSpoiler

Regular
This is a very detailed analysis between the H100/200 and the MI300X. The sad thing for AMD is that even after a year of the MI300X launch, they still haven't gotten the stability to get 100% performance out of the MI300X.


Some quotes from the article:
A few days ago, after we informed both that we had confirmed an article publication date of December 20th, AMD requested that we delay publication to include results based on a beta WIP development build on an AMD developer’s branch. All of our benchmarking on Nvidia was conducted on publicly available stable release builds. In the spirit of transparency and fairness, we include these results as well as updated testing harness results on as the original November 25th deadline image and the latest publicly available software. However, we believe that the correct way to interpret the results is to look at the performance of the public stable release of AMD/Nvidia software.
Below is AMD’s December 21st development build docker image. As you can see, it uses a number of non stable devlopment branches for dependencies such as hipBLASLt, AOTriton, ROCm Attention and installs everything including PyTorch from source code, taking upwards of 5 hours to build. These versions of the dependencies haven’t even been merged into AMD’s own main branch yet. 99.9% of users will not be installing PyTorch from source code and all of its dependencies from source code on development branches but will instead use the public stable PyPi PyTorch.
AMD’s December 21st Dev build is on a hanging development branch. That means it is a branch that has not been fully QA’ed and is at use only at a risk branch. There are many concerns about the validity of the results from using a development build and branches and building from source code, as most users are not doing this in real life. Most users will be installing AMD/Nvidia PyTorch from PyPI stable release mostly so we recommend readers keep this in mend when analyzing these results.
 
This bit here is very very important.

The only reason we have been able to get AMD performance within 75% of H100/H200 performance is because we have been supported by multiple teams at AMD in fixing numerous AMD software bugs.

To get AMD to a usable state with somewhat reasonable performance, a giant ~60 command Dockerfile that builds dependencies from source, hand crafted by an AMD principal engineer, was specifically provided for us, since the Pytorch Nightly and public PyTorch AMD images functioned poorly and had version differences.

This docker image requires ~5 hours to build from source and installs dependencies and sub-dependencies (hipBLASLt, Triton, PyTorch, TransformerEngine), a huge difference compared to Nvidia, which offers a pre-built, out of the box experience and takes but a single line of code. Most users do not build Pytorch, hipBLASLt from source code but instead use the stable release.

This echos what I have been saying about that old ChipsandCheese article, they were in touch with many AMD contacts to deliver their questionable, not based on reality results, they never contacted any NVIDIA rep to validate their results, they also used very slow libraries on NVIDIA, when faced with such criticism the editors shrugged it off and never cared to a dress it.
 
AMD's lackluster software would explain why NVIDIA is at +90% marketshare for AI hardware.

That article was horricfic reading, it sounded like they were describing a start-up, not a long running company.
 
AMD's lackluster software would explain why NVIDIA is at +90% marketshare for AI hardware.

That article was horricfic reading, it sounded like they were describing a start-up, not a long running company.
Exactly.
The most shocking of all is that AMD software stack is build on Nvidia code !!!

450-amd-forks-of-NV-libraries.jpg

Many of AMD’s libraries are forked off Nvidia’s open-source or ecosystem libraries. AMD uses a tool called Hipify to carry out source-to-source translation of Nvidia CUDA to AMD HIP. While the motivation is understandable, they arenevertheless building on top of their competitor’s platform and cannot expect to match or surpass Nvidia’s user experience with this software development strategy. They need to contribute their software to the AMD ecosystem. For example, instead of supporting FP8 training by forking Nvidia/TransformerEngine and source-to-source translation, they should attempt PyTorch native FP8 training to work well on their own hardware. Currently, AMD PyTorch native FP8 training recipes don’t work on AMD and the unit tests don’t even pass yet, there is no CI/CD for AMD PyTorch native FP8 training.

No vision. no future. Extremely sad for a billion dollar company
 
That’s the kind of stuff you don’t see on the PowerPoint slides but when people actually start trying to get work done…

I didn’t know it was that bad. Maybe Nvidia really is a decade ahead of everyone.
 
CUDA’s software stack has a few distinct pillars that are triumphs of software engineering (let alone software architecture). The driver API was built in C, portable across both operating systems and CPU architectures. Across the 6 years I worked on CUDA, no one questioned why I was making sure it ran on Windows as well as Linux. It was healthy for the code base. Today, NVIDIA has parlayed CUDA’s Windowa support into a monopoly position in GPU workstations, because 1,200 workstation apps use CUDA.

Another pillar is PTX, which enables NVIDIA to churn the hardware instruction set with abandon, sometimes even committing featurecide and relying on the PTX translator to emulate instructions that were removed. The PTX translation code is in the driver as well as the offline toolchain (ptxas) and is multithreaded, so it can exploit modern multicore CPUs for performance gains proportional to the core count.

Another triumph of software engineering. All of this great software runs on a span of platforms from tiny SOCs for cars and drones and robots, to the biggest supercomputers in the world. So yeah, CUDA is a deep, deep moat.I get pretty offended when folks intimate that any luck was involved. We knew exactly what we were doing and why.

-Nicholas Wilt

 
AMD is a hardware-first company and it takes a lot of effort to shift the mindset and put more focus on software development and developer ecosystem. There's nothing preventing AMD from working on something home-grown while using forked tech as a stop-gap measure. But even if they are working on something that would let them challenge Nvidia, it will take years to get there.
 
There is also one crucial detail that seemed to be overlooked, inter GPU bandwidth is very low on MI300X. H100 is a full 450GB/s in all directions, while MI300X is only 64GB/s in all directions.

The scale up fabric on H100 and H200 is called NVLink and provides 450GByte/s of bandwidth per GPU and connects 8 GPUs together. On the MI300X, the scale up fabric is called xGMI and, on paper, it connects 8 GPUs, providing 448GByte/s of bandwidth per GPU.

First, MI300X’s xGMI is a point-to-point fabric, which means that it isn’t actually providing 448GByte/s of bandwidth between GPUs pairs. Instead, each GPU can only talk to one another at 64GByte/s. A GPU can only reach the stated 448GByte/s if one GPU addresses all 7 other GPUs simultaneously.

In contrast, since NVIDIA’s NVLink uses a switched topography, one GPU can talk to another GPU at the full 450GByte/s.

Furthermore, the four NVSwitches in H100/H200 support in-network reduction (referred to as NVLink SHARP (NVLS), enabled by default), a technique to reduce data movements by carrying out collectives/reductions inside the switch itself.

This often leads to a significantly degraded performance once you increase the number of AMD GPUs, it's why we haven't seen anyone use AMD hardware in large clusters.

The MI300X’s performance decreases if you scale out (i.e. increase) the number of GPUs participating in a collective. As you can imagine, modern frontier training is carried out on clusters of at least 100,000 GPUs. MI300X RoCEv2 runs at half the speed for all the real-world message sizes of 16MiB to 256MiB when compared to the baseline of InfiniBand Non-SHARP

For all gather, all to all, and reduce scatter collectives, MI300X is anywhere from 2-4 times slower
 
Back
Top