Nvidia Hopper Speculation, Rumours and Discussion

Nvidia coprocessors can be found in 154 of the TOPP500 supercomputers; only seven of the supercomputers are using AMD Instinct cards.


Good PR. Just try to read between the lines though.
You are totally misleading the article. Supercomputer use a lot of GPUs and CPUs. One small failure could have big impacts on results. (Chaos Theory, butterfly effect) That’s the challenge. I think also Nvidia is facing the same issues. When you have 10.000 GPUs of caus 100 could be dead. That’s 0,1% of all GPUs, so no big deal. So please Stop your Nvidia propaganda.
 
Last edited:

Grace-Hopper evaluation benchmark test results are starting to crop up.
Kicking off our NVIDIA GH200 Grace Hopper benchmarking at Phoronix is an initial look at the 72-core Grace CPU performance with 96GB of HBM3 memory. Here are some initial benchmarks of the Grace CPU performance while the Hopper GPU benchmarks will be coming in a follow-up article.
...
NVIDIA's GH200 combines the 72-core Grace CPU with H100 Tensor Core GPU and support for up to 480GB of LPDDR5 memory and 96GB of HBM3 or 144GB of HBM3e memory. The Grace CPU employs Arm Neoverse-V2 cores with 1MB of L2 cache per core and 117MB of L3 cache.

The other processors featured for this initial GH200 CPU benchmarking included:

- EPYC 8534P
- EPYC 8534PN
- EPYC 9554
- EPYC 9554 2P
- EPYC 9654
- EPYC 9654 2P
- EPYC 9684X
- EPYC 9684X 2P
- EPYC 9754
- EPYC 9754 2P
- Xeon Platinum 8280 2P
- Xeon Platinum 8380
- Xeon Platinum 8380 2P
- Xeon Platinum 8490H
- Xeon Platinum 8490H 2P
- Xeon Platinum 8592+
- Xeon Platinum 8592+ 2P
- Ampere Altra Max M128-30
- GPTshop.ai GH200

Other CPU performance tests ...
The benchmarks shown at the HPC Asia conference last week are perhaps the most detailed we've seen thus far, with the Barcelona and New York researchers each presenting their findings at the conference. Each group tested differently, with the Barcelona benchmarks focusing on Grace's performance relative to Skylake-X's, and the New York tests comparing Grace to a variety of other AMD and Intel CPUs.

 
Last edited by a moderator:
Paper includes architectural topics as well as H800 benchmarks.

21 Feb 2024
In this research, we propose an extensive benchmarking study focused on the Hopper GPU. The objective is to unveil its microarchitectural intricacies through an examination of the new instruction-set architecture (ISA) of Nvidia GPUs and the utilization of new CUDA APIs. Our approach involves two main aspects. Firstly, we conduct conventional latency and throughput comparison benchmarks across the three most recent GPU architectures, namely Hopper, Ada, and Ampere. Secondly, we delve into a comprehensive discussion and benchmarking of the latest Hopper features, encompassing the Hopper DPX dynamic programming (DP) instruction set, distributed shared memory, and the availability of FP8 tensor cores.
...
This paper delves into memory hierarchy and tensor core performance of the newest three Nvidia GPU architectures using instruction-level benchmarks. We found that the hopper architecture shows advantages in both memory bandwidth and tensor core that are consistent with official claims. It is worth noting that on tensor core, we need to use the latest wgmma instructions to take advantage of all the performance of the fourth generation tensor core. We analyze AI performance across diverse architectures at library and application levels, emphasizing the impact of varied precisions. Experiments show that when the operation scale is relatively large, low-precision data types will show greater advantages. Additionally, we explore key features of the Hopper architecture: DPX, asynchronous data movement, and distributed shared memory. Our research enhances comprehension of the latest architecture’s traits and performance, aiding in optimized algorithm design and application performance.
 
Last edited by a moderator:
Phew, I was worried there for a second they were doing a lot of the same tests than I've been working on, but there's not much overlap and it's a lot less low-level than what I'm planning to analyse. I haven't looked into Distributed Shared Memory properly yet but I can't make sense of their results, so either they're doing something wrong, or I'm going to have to spend more time than I expected to figure it out...

Also if anyone has anything specific they'd really like tested/analysed on H100 (and AD102/A100 potentially if I have the time), let me know - I'm running dangerously out of time given my intention to publish something before NVIDIA GTC though.
 
March 14, 2024
If we travel back 2 years in time, before the rapid rise in AI, much of the datacenter hardware world was chasing CXL. It was promised as the messiah to bring heterogenous compute, memory pooling, and composable server architectures. Existing players and a whole new host of startups were rushing to integrate CXL into their products, or create new CXL based products such as memory expanders, poolers, and switches. Fast forward to 2023 and early 2024, and many projects have been quietly shelved and many of the hyperscalers and large semiconductor companies have almost entirely pivoted away.
...
Currently, CXL availability is the main issue, as Nvidia GPUs don't support it, and with AMD the technology is limited to the MI300A. While the MI300X can theoretically support CXL in hardware, it is not exposed properly. Availability of CXL IP will improve in the future, but there exist deeper issues than just availability that render CXL irrelevant in the era of accelerated computing.

A H100 GPU has 3 IO formats: PCIe, NVlink, and C2C (for connecting to Grace). Nvidia has decided to include only the minimum 16 PCIe lanes, as Nvidia largely prefers the latter NVLink and C2C. Note that server CPUs, such as AMD’s Genoa, go up to 128 lanes of PCIe.

The main reason for this choice is bandwidth. A 16-lane PCIe interface has 64GB/s of bandwidth in each direction. Nvidia’s NVlink brings 450 GB/s bandwidth in each direction to other GPUs, roughly 7x higher. Nvidia’s C2C also brings 450GB/s bandwidth in each direction with the Grace CPU. To be fair, Nvidia dedicates far more beachfront area to NVLink, we so need to include silicon area in that equation; but even so, we estimate that per mm² across a large variety of SOCs, Ethernet-style SerDes such as Nvidia NVLink, Google ICI, etc have 3x more bandwidth per unit of shoreline area.

Therefore, if you are a chip designer in a bandwidth constrained world, you are making your chip roughly 3x worse when you chose to go with PCIe 5.0 instead of 112G Ethernet-style SerDes. This gap remains with next generation GPUs and AI accelerators adopting 224G SerDes, keeping this 3x gap with PCIe 6.0 / CXL 3.0. We are in a pad limited world, and throwing away IO efficiency is an insane tradeoff.

The main scale up and scale out interconnects for AI Clusters will be proprietary protocols such as Nvidia NVlink and Google ICI, or Ethernet and Infiniband. This inferiority is due to intrinsic PCIe SerDes limitations even in scale up formats.
...
Ethernet-style SerDes are much less constrained by stringent PCIe specifications, allowing it to be much faster and have higher bandwidth. As a result, NVlink has higher latency, but this doesn’t matter as much in the AI world of massively parallel workloads, where ~100ns vs ~30ns is not a worthy consideration.

First off, the MI300 AID uses most of its beachfront area for PCIe SerDes instead of Ethernet-style SerDes. While this gives AMD more configurability in terms of IFIS, CXL, and PCIe connectivity, it results in the total IO being about 1/3 that of Ethernet-style SerDes. AMD needs to immediately move off PCIe-style SerDes for their AI accelerator if they want to have any hopes at competing with Nvidia’s B100. We believe they will be for MI400.
 


TL;DR​

  • A single NVIDIA GH200, enhanced by ZeRO-Inference, effectively handles LLMs up to 176 billion parameters.
  • It significantly improves inference throughput compared to a single NVIDIA H100 or A100 Tensor Core GPU.
  • The superchip’s GPU-CPU 900GB/s bidirectional NVLink Chip-to-Chip (C2C) bandwidth is key to its superior performance.

We first tested the Bloom 176b model with CPU-offload, where the NVIDIA GH200 Grace Hopper Superchip achieved significantly higher throughput compared to H100 and A100 GPUs, showcasing the benefits of its high bandwidth. Specifically, the GH200 produces 4-5x inference throughput of H100, and 9-11x inference throughput of A100, thanks to its significantly higher chip-to-chip bandwidth that removed the communication bottleneck for CPU-offload.

M-K1qOuPbjUB3STyoqzJx9-M5YYpCMc5MrmA-8-y6U2oSD0Z5Gu9fRoLM3aXB-DquHxTAQP4jmPCWJhfcDgR-yc9AJjPIwHBdR1MGtCtaaf1zsCCg2-d8VleYhXbacoBdPJU-2coe1tT


Our second benchmark tested the throughput with/without CPU-offload on the GH200, and saw if CPU-offload can produce overall higher throughput. The figure below shows that when the same batch size is used, CPU-offload inference indeed slightly reduces the throughputs. Nonetheless, CPU-offload also enables running inference with larger batch size (in this case, 32) and produces overall the highest throughput, 1.5x the highest throughput of inference without CPU-offload (using batch size 32, compared to batch size 16, the max without offload).

N8VWlQo7cY7Z0PNICELslI6EQxmlrB0SzhbB4NAavjLHA1-lzLbdvK_ixmih0UJUseSZ2JYLgnsYRrfpNkQLNhtR_VHzLkBK1swP3r_zm4z_UXhjfiWu2eaHQS_XnhS0mfuYZ8r5ttf5
 
Last edited by a moderator:
Does this fluid dynamic application really use OpenCL version 1.2 released in 2011? 😅

It's weird they even included the macOS benchmarks as OpenCL was deprecated back in 2018 on the platform.
 
I think he borrows benchmarks from other past reviews (not all current) and results reflect any included/excluded optimizations done at that time with external reviews.
Would be a bit more relevant if he had included GH200 benchmarks.
 
Meta recently released a study detailing its Llama 3 405B model training run on a cluster containing 16,384 Nvidia H100 80GB GPUs. The training run took place over 54 days and the cluster encountered 419 unexpected component failures during that time, averaging one failure every three hours. In half of the failure cases, GPUs or their onboard HBM3 memory were to blame.
...
The scale and synchronous nature of 16,384 GPU training make it prone to failures. If the failures aren't mitigated correctly, a single GPU failure can disrupt the entire training job, necessitating a restart. However, the Llama 3 team maintained over a 90% effective training time.
...
To enhance efficiency, Meta's team reduced job startup and checkpointing times and developed proprietary diagnostic tools. PyTorch’s NCCL flight recorder was used extensively to quickly diagnose and resolve hangs and performance issues, particularly related to NCCLX. This tool captures collective metadata and stack traces, aiding in swift problem resolution.
 
Last edited by a moderator:

Researchers Benchmark Nvidia’s GH200 Supercomputing Chips​

September 4, 2024
Nvidia is putting its GH200 chips in European supercomputers, and researchers are getting their hands on those systems and releasing research papers with performance benchmarks. In the first paper, Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip, researchers benchmarked various applications of the GH200, which has an integrated CPU and GPU. The numbers highlighted the chip’s blazing speed and how AI and scientific application performance can benefit from the localized HBM3 and DDR5 memory.

One benchmark from the Alps system — which is still being upgraded — measures the GH200 performance when running AI applications.

Another paper, Boosting Earth System Model Outputs And Saving PetaBytes in their Storage Using Exascale Climate Emulators, measures the performance of large clusters of GH200 to AMD’s MI250X in Frontier, the Nvidia A100 in Leonardo, and the Nvidia V100 in Summit. The systems are former Top500 chart-toppers and are now in the top 10.

Researchers-Image-2-BENCH-TABLE1.png
Article and Summary link:
 
Last edited by a moderator:
Benchmark results of NVIDIA GH200 Superchip on OCI
August 28, 2024
This benchmark gives us a look at the power of the NVIDIA GH200 Grace Hopper Superchip running on Oracle Cloud Infrastructure (OCI). While utterly obscure to anyone who isn’t up-to-date on the latest GPU revolution, this blog post explains to layfolk and decision-makers what’s happening from a computer science standpoint—and results from our point of view—at OCI.
...
The fact that the NVIDIA GH200 benchmarks from Oracle closely track results published by NVIDIA for both ML and more traditional data processing tasks points to reliable high-performance acceleration using this new hardware.
 
Cutting-edge LLMs, like Llama 3.1 405B, require multiple GPUs working together for peak performance. To effectively use multiple GPUs for processing inference requests, an inference software stack must provide developers with optimized implementations of key parallelism techniques, including tensor, pipeline, and expert parallelism. These parallelism techniques require that GPUs be able to transfer data quickly and efficiently, necessitating a robust GPU-to-GPU interconnect fabric for maximum performance.

In this post, we explain two of these parallelism techniques and show, on an NVIDIA HGX H200 system with NVLink and NVSwitch, how the right parallelism increases Llama 3.1 405B performance by 1.5x in throughput-sensitive scenarios. We also show how use of pipeline parallelism enabled a 1.2x speedup in the MLPerf Inference v4.1 Llama 2 70B benchmark on HGX H100 compared to our results published in August. These improvements are possible due to recent software improvements in TensorRT-LLM with NVSwitch.
 
Back
Top