Nvidia Hopper Speculation, Rumours and Discussion

no word about clock though ....
There are footnotes speaking of 1,9 GHz edit: Those were given under the impression, that the SXM5 model would use 124 SMs, so maybe that's a bit on the high side for the 132 incarnation. Also not finalized though.
 
Last edited:
p18 of the Whitepaper PDF:
"Only two TPCs in both the SXM5 and PCIe H100 GPUs are graphics-capable (that is, they can run vertex, geometry, and pixel shaders)."

Note how Nvidia refers to productized H100 chips instead of the H100 GPU in general. And also, how it's not the "1 SM" as speculated solely on the basis of a slightly misaligend SM earlier.
 
AFAIU it's conceptually the same setup. Just 2x more execution ressources per register.
 
Do the 64 INT32 lanes share data paths with 64 of the FP32 lanes like in Ampere?
I see 32 lanes and I think it's pretty much the same as in Ampere (not GA100 but GA10x) - one SIMD is FP32/INT and another is FP32 only. Doesn't make much sense to make a separate INT32 as you won't be able to load it properly.
 
The Whitepaper says it's 16 of each INT32, FP32, FP32, FP64 plus 4 SFU and 8 L/S inside each quarter-cluster of an SM.
 
So, the H100 (1 GPU) has the following advantages vs the MI250X (2 GPUs):

FP32: 60 vs 48
FP32 Matrix: 500 vs 95 (with sparsity: 1000 vs 95)
FP16 Matrix: 1000 vs 383 (with sparsity: 2000 vs 383)
INT8 Matrix: 2000 vs 383 (with sparsity: 4000 vs 383)
INT4 Matrix: 4000 vs 383 (with sparsity: 8000 vs 383)

And all the reliability of a single monolithic GPU die, that has 3TB/s of bandwidth dedicated to it alone, vs the MI250X nature of 2 dies sharing 3.2TB of bandiwdth and a slow interconnect.

Still, the MI250X enjoys a better theoretical FP64 performance vs the H100, when it can utilize both GPUs of course:

FP64: 48 vs 30
FP64 Matrix: 97 vs 60
 
Last edited:
MI250X bandwidth is only 400GB/s when sharing data or 13,3% of H100. With Hopper nVidia introduces their NVLink Switch System with full speed up to 32 nodes. So even inter-node communication is now 2.25x faster than MI200 interconnect...

The loser is the US goverment still building their exascale system which will be beaten by a computer in a basement. And next year with Grace nVidia will be so far ahead that Frontier looks like doa.
 
MI250X bandwidth is only 400GB/s when sharing data or 13,3% of H100. With Hopper nVidia introduces their NVLink Switch System with full speed up to 32 nodes. So even inter-node communication is now 2.25x faster than MI200 interconnect...

The loser is the US goverment still building their exascale system which will be beaten by a computer in a basement. And next year with Grace nVidia will be so far ahead that Frontier looks like doa.
Funny, it sounds almost like you think they don't know what they want, which in this case wasn't NVIDIA. Next year there will be new products from all which make "Frontier look DOA". But it's completely different story on when they could get a government scale supercomputer up and running on those.
 
Funny, it sounds almost like you think they don't know what they want, which in this case wasn't NVIDIA. Next year there will be new products from all which make "Frontier look DOA". But it's completely different story on when they could get a government scale supercomputer up and running on those.

AFAICT there's still no option to pair NV GPUs with high-end CPUs and get accelerated unified memory as well. There are virtually no high-end CPUs released yet with PCIe 5.0 either ...
 
“Hopper” GH100 GPUs Are The Heart Of A More Expansive Nvidia System (nextplatform.com)
The Hopper GPU has over 80 billion transistors and is implemented in a custom 4N 4 nanometer chip making process from TSMC, which a pretty significant shrink from the N7 7 nanometer processes used in the Ampere GA100 GPUs. That process shrink is being used to boost the number of streaming multiprocessors on the unit, which adds floating point and integer capacity, and according to Paresh Kharya, senior director of accelerated computing at Nvidia, the devices also have a new FP8 eight-bit floating point format that can be used for machine learning training, radically improving the size and throughput of neural network that can be created, as well as a new Tensor Core unit that has the capability of dynamically changing its bit-ness for different parts of what are called transformer models to boost their performance in particular.

“The challenge always with mixed precision is to intelligently manage the precision for performance while maintaining the accuracy,” explains Kharya. “Transformer Engine does precisely this with custom, Nvidia-tuned heuristics to dynamically choose between the 8-bit and 16-bit calculations and automatically handle the recasting and scaling that is required between the 16-bit and 8-bit calculations in each layer to deliver dramatic speed ups without the loss of accuracy. With Transformer Engine training of transformer models can be reduced from weeks down to days.”

These new fourth-generation Tensor Cores, therefore, are highly tuned to the BERT language translation model and the Megatron Turing conversational AI model, which are in turn used in applications such as SuperGLUE Leaderboard for language interpretation, Alphafold2 for protein structure prediction, Segformer for semantic segmentation, OpenAI CLIP for creating images from natural language and visa versa, Google ViT for computer vision, and Decision Transformer for reinforcement learning.

The combination of this Transformer Engine inside the Tensor Cores as well as the use of FP8 data formats (in conjunction with FP16 and TF32 formats still needed for AI training, as well as the occasional FP64 for that matter), the performance of these transformer models has increased by a factor of 6X over what the Ampere A100 can do.
...
Nvidia will be building a follow-on to its existing “Selene” supercomputer based on all of this technology, to be called “Eos” and expected to be delivered in the next few months, according to Jensen Huang, Nvidia’s co-founder and chief executive officer. This machine will weigh in at 275 petaflops at FP64 precision and 18 exaflops at FP8 precision, presumably with sparsity support doubling the raw metrics for both of these precisions.

Because many servers today do not have PCI-Express 5.0 slots available, Nvidia has cooked up a variant of the PCI-Express 5.0 Hopper CNX card that includes the Hopper GPU and a ConnectX-7 adapter card that plugs into a PCI-Express port but lets the GPUs talk to each other over 400 Gb/sec InfiniBand or 400 Gb/sec Ethernet networks using GPUDirect drivers over RDMA and RoCE. This is smart. And in fact, it is a GPU SmartNIC.
 
Last edited by a moderator:
A small detour to mention that the biggest hardware announcement of GTC2022 is the Spectrum-4 switch with its mind blowing 100 billion transistors, 20 billion more than Hopper :runaway::runaway::runaway:

Spectrum-4.jpg

Most people don't realize that interconnect speed and handling massive data transfer are the key to high-performance distributed HPC workloads. Now that Nvidia controls the full ecosystem (CPU+DPU+GPU+Interconnect+software), they can innovate at their pace and push more easily their standard (it starts with NVlink-C2C)
 
Back
Top