Nvidia shows signs in [2023]

August 29, 2023
As generative AI and large language models (LLMs) continue to drive innovations, compute requirements for training and inference have grown at an astonishing pace.

To meet that need, Google Cloud today announced the general availability of its new A3 instances, powered by NVIDIA H100 Tensor Core GPUs. These GPUs bring unprecedented performance to all kinds of AI applications with their Transformer Engine — purpose-built to accelerate LLMs.
Today, there are over a thousand generative AI startups building next-generation applications, many using NVIDIA technology on Google Cloud. Some notable ones include Writer and Runway.

Writer uses transformer-based LLMs to enable marketing teams to quickly create copy for web pages, blogs, ads and more. To do this, the company harnesses NVIDIA NeMo, an application framework from NVIDIA AI Enterprise that helps companies curate their training datasets, build and customize LLMs, and run them in production at scale.

Using NeMo optimizations, Writer developers have gone from working with models with hundreds of millions of parameters to 40-billion parameter models. The startup’s customer list includes household names like Deloitte, L’Oreal, Intuit, Uber and many other Fortune 500 companies.

Runway uses AI to generate videos in any style. The AI model imitates specific styles prompted by given images or through a text prompt. Users can also use the model to create new video content using existing footage. This flexibility enables filmmakers and content creators to explore and design videos in a whole new way.
The companies have also made NVIDIA AI Enterprise available on Google Cloud Marketplace and integrated NVIDIA acceleration software into the Vertex AI development environment.
Powered by the Arm Neoverse N2 cores, the Grace CPU will be utilized in NVIDIA's Superchips that come in both CPU+CPU and CPU+GPU flavors. NVIDIA recently announced its most powerful GPU for AI and Compute workloads known as GH200 which also comes with the world's fastest HBM3e memory and that will be adopted by the Grace Hopper Superchip.
For the Hot Chips 2023 presentation, NVIDIA's Chief Scientist, Bill Dally, presented the performance comparisons between an NVIDIA Grace Superchip and a competing dual-socket x86 solution from its competitors. These include AMD's EPYC 9654 which is the fastest 96 cores & 192 thread solution & also Intel's flagship, the Xeon Platinum 8480+ which features 56 cores and 112 threads. Since the solutions were running on a dual-socket configuration, that's a total of 192 cores for AMD and 112 cores for Intel's platform.
We know from the official NVIDIA Grace CPU specs that the Grace Superchip offers a total of 144 (72 Arm Neoverse V2 per chip) cores, supports up to 960 GB of LPDDR5X memory with up to 1 TB/s of raw bandwidth, and has a combined power draw of 500W. Additional specs include 117 MB of L3 cache, and 58 Gen5 lanes, all while using the TSMC 4N process node.
The benchmarks selected by NVIDIA cover a wide spectrum of server applications such as Weather WRF, MD CP2K, Climate NEMO, CFD OpenFOAM, & Graph Analytics GapBS BFS. In all benchmarks, NVIDIA's Grace Superchip CPUs offer up to 40% better performance than AMD's Genoa CPUs while sitting much ahead of Intel's Sapphire Rapids CPUs. The majority of benchmarks were on par with Genoa and even that is great for Grace since two of those chips have a combined TDP of 640W (320 Watts per EPYC 9654) whereas the Grace Superchip runs at 500W.

However, the performance comparisons get even more interesting when compared to an actual large-scale data center application. A 5 MW Data Center throughput benchmark shows that NVIDIA's Grace Superchips can offer up to 2.5x the performance while being vastly efficient within the same benchmarks.
What’s the state of HPC on Arm? I’m guessing there’s a ton of datacenter software written for x86 that won’t be ported anytime soon. Grace probably has a better shot at AI workloads.
That's impressive. A lot of stuff is running on x86 but I wouldn't be surprise is nVidia help the crap out some devs to port over their stuff over arm.
What’s the state of HPC on Arm? I’m guessing there’s a ton of datacenter software written for x86 that won’t be ported anytime soon. Grace probably has a better shot at AI workloads.
My guess is the state is pretty good. Nvidia has some Dev Kits to assist HPC data centers porting x86 code to Grace.
The NVIDIA Arm HPC Developer Kit is an integrated hardware and software platform for creating, evaluating, and benchmarking HPC, AI, and scientific computing applications on a heterogeneous GPU- and CPU-accelerated computing system. NVIDIA announced its availability in March of 2021.

The kit is designed as a stepping stone to the next-generation NVIDIA Grace Hopper Superchip for HPC and AI applications. It can be used to identify non-obvious x86 dependencies and ensure software readiness ahead of NVIDIA Grace Hopper systems available in 1H23. For more information, see the NVIDIA Grace Hopper Superchip Architecture whitepaper.
Eleven different teams carried out the evaluation work

  • Oak Ridge National Laboratory
  • Sandia National Laboratories
  • University of Illinois at Urbana – Champaign
  • Georgia Institute of Technology
  • University of Basel
  • Swiss National Supercomputing Center (SNSC)
  • Helmholtz-Zentrum Dresden-Rossendorf
  • University of Delaware
Table 1 summarizes the final list of applications and their various characteristics. The applications cover eight different scientific domains and include codes written in Fortran, C, and C++. The parallel programming models used were MPI, OpenMP/OpenACC, Kokkos, Alpaka, and CUDA. No changes were made to the application codes during the porting activities. The evaluation process primarily focused on application porting and testing, with less emphasis on absolute performance considering the experimental nature of the testbed.

This post covers the results for four of the applications. For more information about the other applications, see Application Experiences on a GPU-Accelerated Arm-based HPC Testbed.
Not without pre-conditions ... I thought it would be unusually to offer this just to anyone.
The offering from NERSC is undoubtedly generous, and the scientific center could make some easy money if it were offering its capacity commercially.

However, the problem is that they only offer it to existing NERSC users who use the Perlmutter supercomputer for scientific research. Since these users were on summer break, they were probably not running their workloads on the supercomputer and are not going to till the end of the year; at least some of the GPU nodes were idling for some time, which begs the question of why the organization does not backfill its idle capacity with commercial workloads.

While using supercomputers built by the U.S. government for commercial AI and HPC workloads would have brought a lot of money that could be spent to advance American supercomputer prowess, this is not something that institutions like NERSC do.

The U.S. Department of Energy supercomputers are meant to be used primarily for things that present national security matters or by pre-selected users, including those that use these machines for research that could be used for commercial applications. As a result, these machines are not available for everyone.

Not to be outdone, Nvidia shared results (H100) that demonstrate why they remain the one to beat. As always, the company ran and won every benchmark. No surprise there. Most interesting was the first submission for Grace Hopper, the Arm CPU, and Hopper GPU Superchip, which has become the Nvidia fleet's flagship for AI inferencing.

Note that in these benchmarks, while the H100 is clearly the fastest, the competition also continues to improve. Both Intel and Qualcomm showed excellent results fueled by software improvements. The Qualcomm Cloud AI100 results were achieved using a fraction of the power consumed by the rest of the party.
The Grace Hopper results beat the H100 results by up to 17%. Grace Hopper has more HBM and higher networking speed between the Arm CPUs and the GPU than is possible with an x86 CPU and a discreet Hopper. And Nvidia announced that it has enabled “Automatic Power Steering.” No, this is not to steer your car; this steers the right amount of power to each chip to maximize performance at the lowest power consumption.

But 17% should not be a surprise. I had expected a more significant advantage. Nvidia indicated that we would soon learn how server configurations will change when OEMs bring out GH100-based servers next month. There will be no dual X86 CPUs, no PCIe card cages and potentially no off-package memory. Just the GH Superchip with HBM, a power supply, and a NIC. Done. The simplification afforded by Grace Hopper will substantially lower the costs of the server and inference processing; the CPU will essentially be free.

But, unfortunately for competitors, Nvidia announced a new TensorRT for LLMs just last Friday, which they say doubled the performance of the H100 in inference processing after the benchmarks were submitted. Nvidia didn’t have the software ready in time for MLPerf.

While the MLCommons peer review process has not verified this claim, it indicates the massive improvements in inference efficiency possible with clever software. GPUs need to be more utilized in inferencing today; that means there is a repository of opportunities Nvidia can mine. And there is nothing here that is proprietary to Nvidia; in fact, TensorRT-ML is open-source code, so in theory, Intel, AMD, and everyone else could implement the in-flight batch processing and other TensorRT-LLM concepts and realize some performance boost themselves.

Finally, talking about software, Nvidia has improved the performance of its edge AI Jetson platform by 60-80% since the MLPerf 3.0 benchmarks were run just six months ago. We would assume that the TensorRT-LLM could have an impact here, but this isn’t a platform for LLMs today, but that could change as LLMs eat the world.

Today, NVIDIA is releasing its first performance benchmarks within the MLPerf Inference v3.1 benchmark suite which covers a wide range of industry-standard benchmarks for AI use cases. These workloads range from Recommender, Natural Language Processing, Large Language Model, Speech Recognition, Image Classification, Medical Imaging, and Object Detection.
The two new sets of benchmarks include DLRM-DCNv2 and GPT-J 6B. The first is a larger multi-hot dataset representation of real recommenders which uses a new cross-layer algorithm to deliver better recommendations and has twice the parameter count versus the previous version. GPT-J on the other other is a small-scale LLM that has a base model that's open source and was released in 2021. This workload is designed for summarization tasks.
In terms of performance benchmarks, the NVIDIA H100 was tested across the entire MLPerf v3.1 Inference set (Offline) against competitors from Intel (HabanaLabs), Qualcomm (Cloud AI 100) and Google (TPUv5e). NVIDIA delivered leadership performance across all workloads.
But coming back to the benchmarks, NVIDIA's GH200 Grace Hopper Superchip also made its first submission on MLPerf, yielding a 17% improvement over the H100 GPU. This performance gain is mainly coming from higher VRAM capacities (96 GB HBM3 vs. 80 GB HBM3) and 4TB/s bandwidth.
The NVIDIA L4 GPU which is based on the Ada Lovelace GPU architecture also made a strong entry in MLPerf v3.1. It was not only able to run all workloads but did so very efficiently, running up to 6x faster than modern x86 CPUs (Intel 8380 Dual-Socket) at a 72W TDP in an FHFL form factor. The L4 GPU also offered a 120x increase in Video/AI tasks such as Decoding, Inferencing, Encoding. Lastly, the NVIDIA Jetson Orion got an up to 84% performance boost thanks to software updates & shows NVIDIA's commitment to improving the software stack to the next level.
Last edited:
In Q2 of fiscal year 2024, Nvidia disclosed datacenter hardware sales totaling $10.3 billion. Detailed analysis by market research firm Omdia reveals that during Q2 of 2023, Nvidia dispatched 900 tons of its H100 processors. Omdia's research indicates that over 900 tons (equivalent to 1.8 million pounds) of H100 compute GPUs, tailored for artificial intelligence (AI) and high-performance computing (HPC) applications, were shipped by Nvidia in this period.

These findings are based on an assumption that an average Nvidia H100 compute GPU with a heatsink weighs more than 3 kilograms (around 6.6 pounds), suggesting Nvidia's distribution of approximately 300,000 H100 GPUs during the quarter.
For the upcoming quarters, Omdia predicts Nvidia will maintain similar GPU shipment volumes. They project a potential total of roughly 3,600 tons (or 7.2 million pounds) of H100 GPUs shipped annually, which implies a production of about 1.2 million H100 GPUs if the trend remains consistent.

The aforementioned estimate points to over 300,000 H100 GPUs being shipped in just one quarter. It should be noted that the H100 figure might also include Nvidia's China-centric H800 processors. Moreover, Nvidia persistently provides substantial amounts of its prior-generation A100, A800, and A30 compute GPUs to enterprises that utilize them for AI inference and technical computing. Therefore, Nvidia's cumulative quarterly GPU shipments are likely higher than 300,000 units and may surpass 900 tons.