Nvidia’s machine learning ecosystem advantage

TopSpoiler

Regular
This is a very detailed analysis between the H100/200 and the MI300X. The sad thing for AMD is that even after a year of the MI300X launch, they still haven't gotten the stability to get 100% performance out of the MI300X.


Some quotes from the article:
A few days ago, after we informed both that we had confirmed an article publication date of December 20th, AMD requested that we delay publication to include results based on a beta WIP development build on an AMD developer’s branch. All of our benchmarking on Nvidia was conducted on publicly available stable release builds. In the spirit of transparency and fairness, we include these results as well as updated testing harness results on as the original November 25th deadline image and the latest publicly available software. However, we believe that the correct way to interpret the results is to look at the performance of the public stable release of AMD/Nvidia software.
Below is AMD’s December 21st development build docker image. As you can see, it uses a number of non stable devlopment branches for dependencies such as hipBLASLt, AOTriton, ROCm Attention and installs everything including PyTorch from source code, taking upwards of 5 hours to build. These versions of the dependencies haven’t even been merged into AMD’s own main branch yet. 99.9% of users will not be installing PyTorch from source code and all of its dependencies from source code on development branches but will instead use the public stable PyPi PyTorch.
AMD’s December 21st Dev build is on a hanging development branch. That means it is a branch that has not been fully QA’ed and is at use only at a risk branch. There are many concerns about the validity of the results from using a development build and branches and building from source code, as most users are not doing this in real life. Most users will be installing AMD/Nvidia PyTorch from PyPI stable release mostly so we recommend readers keep this in mend when analyzing these results.
 
This bit here is very very important.

The only reason we have been able to get AMD performance within 75% of H100/H200 performance is because we have been supported by multiple teams at AMD in fixing numerous AMD software bugs.

To get AMD to a usable state with somewhat reasonable performance, a giant ~60 command Dockerfile that builds dependencies from source, hand crafted by an AMD principal engineer, was specifically provided for us, since the Pytorch Nightly and public PyTorch AMD images functioned poorly and had version differences.

This docker image requires ~5 hours to build from source and installs dependencies and sub-dependencies (hipBLASLt, Triton, PyTorch, TransformerEngine), a huge difference compared to Nvidia, which offers a pre-built, out of the box experience and takes but a single line of code. Most users do not build Pytorch, hipBLASLt from source code but instead use the stable release.

This echos what I have been saying about that old ChipsandCheese article, they were in touch with many AMD contacts to deliver their questionable, not based on reality results, they never contacted any NVIDIA rep to validate their results, they also used very slow libraries on NVIDIA, when faced with such criticism the editors shrugged it off and never cared to a dress it.
 
AMD's lackluster software would explain why NVIDIA is at +90% marketshare for AI hardware.

That article was horricfic reading, it sounded like they were describing a start-up, not a long running company.
 
AMD's lackluster software would explain why NVIDIA is at +90% marketshare for AI hardware.

That article was horricfic reading, it sounded like they were describing a start-up, not a long running company.
Exactly.
The most shocking of all is that AMD software stack is build on Nvidia code !!!

450-amd-forks-of-NV-libraries.jpg

Many of AMD’s libraries are forked off Nvidia’s open-source or ecosystem libraries. AMD uses a tool called Hipify to carry out source-to-source translation of Nvidia CUDA to AMD HIP. While the motivation is understandable, they arenevertheless building on top of their competitor’s platform and cannot expect to match or surpass Nvidia’s user experience with this software development strategy. They need to contribute their software to the AMD ecosystem. For example, instead of supporting FP8 training by forking Nvidia/TransformerEngine and source-to-source translation, they should attempt PyTorch native FP8 training to work well on their own hardware. Currently, AMD PyTorch native FP8 training recipes don’t work on AMD and the unit tests don’t even pass yet, there is no CI/CD for AMD PyTorch native FP8 training.

No vision. no future. Extremely sad for a billion dollar company
 
That’s the kind of stuff you don’t see on the PowerPoint slides but when people actually start trying to get work done…

I didn’t know it was that bad. Maybe Nvidia really is a decade ahead of everyone.
 
Back
Top