NVidia Hopper Speculation, Rumours and Discussion

Have you actually read the article you linked? It has absolutely nothing to do with DL or AI or any such thing, it's just telling that AMD has decided to use Google for additional capacity on top of what they had. The hardware they're using in the Google Cloud is nothing special either, just Milan Epycs.

Improved design and operations from applied Google Cloud artificial intelligence and machine learning tools and frameworks

Which other reason exists to pay a company to use their own CPUs?!
 
Which other reason exists to pay a company to use their own CPUs?!
Must have missed that part reading it through, but it still in no way indicates or suggests it would be the first they're using such technologies.
As for the economics of it, I can see plenty of scenarios where it's more economical to rent your "own hardware" from outside source rather than build similar sized server farm yourself.
 
Upcoming GTC sessions related to Hopper architecture.

  • Learn about the latest additions to the CUDA platform, Language and Toolkit, and what the new Hopper GPU architecture brings to CUDA. Presented by one of the architects of CUDA, this engineer-focused session covers all the latest developments for NVIDIA's GPU developer ecosystem as well as looking ahead to where CUDA will be going over the coming year.

  • NVIDIA’s new Hopper GPUs contain advanced features that can unleash tremendous application performance, but they can also require some new techniques for different ways of coding your applications. Learn how to access Hopper’s advanced capabilities and squeeze all the juice from your hardware while retaining performance portability. We’ll cover how to opt in for performance features, design patterns for structuring code for compatibility and performance, and strategies for effective testing and QA of high-performance applications with an eye to portability.

  • This session will introduce new features in CUDA for programming Hopper architecture. The new programming model for Hopper is more hierarchical and asynchronous. CUDA programming for Hopper introduces optional level of hierarchy called Thread Block clusters, that enable multiple thread blocks within the cluster to communicate using a common pool of shared memory. The asynchronous data movement is now hardware accelerated in all directions between global and shared memories. We will look at how to exploit the new programming model in applications for performance tuning.

  • The upcoming Hopper-based platforms, as well as the Grace Hopper Superchip are exciting developments for high-performance computing. Taking advantage of this new hardware performance takes great software, and CUDA developer tools are here to help. We'll give a brief overview of the tools available for free to developers, then detail the newest features and explain how they help users identify performance and correctness issues, where they are occurring, and some options to fix them. We'll pay specific attention to features supporting new architectures. CUDA developer tools are designed in lockstep with the CUDA ecosystem, including hardware and software. With new technologies like the Grace Hopper Superchip, visibility and optimization of the entire platform are key to unleashing the next level of accelerated computing performance. This presentation will prepare you for that move to the leading edge.
 
The H100 is built upon the PG520 PCB board which has over 30 power VRMs & a massive integral interposer that uses TSMC's CoWoS tech to combine the Hopper H100 GPU with a 6-stack HBM3 design. Some of the main technologies of the Hopper H100 GPU include:

  • 132 SMs (2x Performance Per Clock)
  • 4th Gen Tensor Cores
  • Thread Block Clusters
  • 2nd Gen Multi-Instance GPU
  • Confidential Computing
  • PCIe Gen 5.0 Interface
  • World's First HBM3 DRAM
  • Larger 50 MB L2 Cache
  • 4th Gen NVLink (900 GB/s Total Bandwidth)
  • New SHARP support
  • NVLink Network
Out of the six stacks, two stacks are kept to ensure yield integrity. But the new HBM3 standard allows for up to 80 GB capacities at 3 TB/s speeds which are crazy. For comparison, the current fastest gaming graphics card, the RTX 3090 Ti, offers just 1 TB/s of bandwidth and 24 GB VRAM capacities. Other than that, the H100 Hopper GPU also packs in the latest FP8 data format, and through its new SXM connection, it helps accommodate the 700W power design that the chip is designed around. It also offers twice the FP32 and FP64 FMA rates and 256 KB L1 cache (shared memory).
...
Rounding up the performance figures, NVIDIA's GH100 Hopper GPU will offer 4000 TFLOPs of FP8, 2000 TFLOPs of FP16, 1000 TFLOPs of TF32 and 60 TFLOPs of FP64 Compute performance. These record-shattering figures decimate all other HPC accelerators that came before it. For comparison, this is 3.3x faster than NVIDIA's own A100 GPU and 28% faster than AMD's Instinct MI250X in the FP64 compute. In FP16 compute, the H100 GPU is 3x faster than A100 and 5.2x faster than MI250X which is literally bonkers.
 
Last edited:
Didn't people here laugh at AMD putting 1:1 FP64, yet here we are next gen and NVIDIA is doing the exact same thing (also Intel)
 
According to the specs H100 is 1:2 non-tensor and 1:17 tensor (TF32).

Not that I’m laughing at either NVIDIA or AMD’s decisions here. I’m sure they’ve worked out the math.
Was the link above wrong then? It at least lists 1:1 FP32:FP64 cores?
 

 
NVIDIA says H100 is up to 4X times faster than A100.

H100-final-scaled-1.jpg


 
He said that running the HPL benchmark is different from engaging the system while running scientific applications “without a hardware failure, without a hiccup in the network, and getting everything tuned.”

Distributed computing is hard. Just building a linpack chip and outdated interconnections is not enough. Optical NVLink is so far ahead right know that real scientific workloads will be running so much better on their plattform.
 
Last edited:

Distributed computing is hard. Just building a linpack chip and outdated interconnections is not enough. Optical NVLink is so far ahead right know that real scientific workloads will be running so much better on their plattform.
Seriously, you need to fix your hostility.
Slingshot interconnect launched in 2020 was definitely not outdated when they specced Frontier and the nodes it uses. Optical NVLink was nowhere near ready when those systems were specced.
 
Back
Top