DegustatoR
Legend
So it's actually a bit smaller than GA100.
Math gives 1,775 GHz for 60TF FP32 SXM5 module.no word about clock though ....
Whitepaper table 3 says "not finalized" and figure 13 says 1.3x clock speed compared to A100.no word about clock though ....
There are footnotes speaking of 1,9 GHz edit: Those were given under the impression, that the SXM5 model would use 124 SMs, so maybe that's a bit on the high side for the 132 incarnation. Also not finalized though.no word about clock though ....
I see 32 lanes and I think it's pretty much the same as in Ampere (not GA100 but GA10x) - one SIMD is FP32/INT and another is FP32 only. Doesn't make much sense to make a separate INT32 as you won't be able to load it properly.Do the 64 INT32 lanes share data paths with 64 of the FP32 lanes like in Ampere?
Funny, it sounds almost like you think they don't know what they want, which in this case wasn't NVIDIA. Next year there will be new products from all which make "Frontier look DOA". But it's completely different story on when they could get a government scale supercomputer up and running on those.MI250X bandwidth is only 400GB/s when sharing data or 13,3% of H100. With Hopper nVidia introduces their NVLink Switch System with full speed up to 32 nodes. So even inter-node communication is now 2.25x faster than MI200 interconnect...
The loser is the US goverment still building their exascale system which will be beaten by a computer in a basement. And next year with Grace nVidia will be so far ahead that Frontier looks like doa.
Funny, it sounds almost like you think they don't know what they want, which in this case wasn't NVIDIA. Next year there will be new products from all which make "Frontier look DOA". But it's completely different story on when they could get a government scale supercomputer up and running on those.
The Hopper GPU has over 80 billion transistors and is implemented in a custom 4N 4 nanometer chip making process from TSMC, which a pretty significant shrink from the N7 7 nanometer processes used in the Ampere GA100 GPUs. That process shrink is being used to boost the number of streaming multiprocessors on the unit, which adds floating point and integer capacity, and according to Paresh Kharya, senior director of accelerated computing at Nvidia, the devices also have a new FP8 eight-bit floating point format that can be used for machine learning training, radically improving the size and throughput of neural network that can be created, as well as a new Tensor Core unit that has the capability of dynamically changing its bit-ness for different parts of what are called transformer models to boost their performance in particular.
“The challenge always with mixed precision is to intelligently manage the precision for performance while maintaining the accuracy,” explains Kharya. “Transformer Engine does precisely this with custom, Nvidia-tuned heuristics to dynamically choose between the 8-bit and 16-bit calculations and automatically handle the recasting and scaling that is required between the 16-bit and 8-bit calculations in each layer to deliver dramatic speed ups without the loss of accuracy. With Transformer Engine training of transformer models can be reduced from weeks down to days.”
These new fourth-generation Tensor Cores, therefore, are highly tuned to the BERT language translation model and the Megatron Turing conversational AI model, which are in turn used in applications such as SuperGLUE Leaderboard for language interpretation, Alphafold2 for protein structure prediction, Segformer for semantic segmentation, OpenAI CLIP for creating images from natural language and visa versa, Google ViT for computer vision, and Decision Transformer for reinforcement learning.
The combination of this Transformer Engine inside the Tensor Cores as well as the use of FP8 data formats (in conjunction with FP16 and TF32 formats still needed for AI training, as well as the occasional FP64 for that matter), the performance of these transformer models has increased by a factor of 6X over what the Ampere A100 can do.
...
Nvidia will be building a follow-on to its existing “Selene” supercomputer based on all of this technology, to be called “Eos” and expected to be delivered in the next few months, according to Jensen Huang, Nvidia’s co-founder and chief executive officer. This machine will weigh in at 275 petaflops at FP64 precision and 18 exaflops at FP8 precision, presumably with sparsity support doubling the raw metrics for both of these precisions.
Because many servers today do not have PCI-Express 5.0 slots available, Nvidia has cooked up a variant of the PCI-Express 5.0 Hopper CNX card that includes the Hopper GPU and a ConnectX-7 adapter card that plugs into a PCI-Express port but lets the GPUs talk to each other over 400 Gb/sec InfiniBand or 400 Gb/sec Ethernet networks using GPUDirect drivers over RDMA and RoCE. This is smart. And in fact, it is a GPU SmartNIC.