Re: shared memory, Hopper adds an optional "cluster" level. By grouping thread blocks into clusters, those thread blocks can share each other's shared.mem.
Much much more at the linkTo be honest, building an ALU that can do a multiply-add is relatively straightforward, and even though I don’t want to offend anybody I probably will by saying that. The trick, the art, the skill of doing an FP8 operation, to make it work and be successful, is doing so by operating with two or three bits of mantissa. You might have four or five bits of exponent. But it is a small representation, and we can make it work because AI is fundamentally a statistical problem – you are working on probabilities at the layer level and that kind of stuff. But making it work well and making be able to train a model like GPT-3 or Megatron 530B is where the art is.
So what Hopper does that is unique is that it actually implements what we call the Transformer Engine, which is a combination of hardware and software. We built a brand new Tensor Core that has the FP8 capability, but the Transformer Engine has special functions for collecting statistics and adjusting the range and bias of the computation on a layer per layer basis during the training run. So as the transformer model is being trained, the Tensor Cores are outputting statistics to the software, which is then optimizing it or doing the scale and biases to maintain the computation. It can keep it in the range of FP8 or if necessary, promote it back to FP16 or FP16. So the back end of the ALU is highly configurable, it takes in FP8 but can output FP16 or FP32.
To create the transformer Engine, we had to dedicate the entire “Selene” supercomputer – which annoyed a lot of people – to running training simulations so it could learn to maintain the accuracy of the model training and run it at FP8 precision on the inputs.
This is the key: Why people buy Hopper, they are not just getting the H100 GPU accelerator, they are getting this optimized Transformer Engine that knows how to train a transformer model.
Much more at the source“We are not adverse to chiplets,” explains Jonah Alben, senior vice president of GPU engineering, referring directly to the co-packaged “Grace” Arm server CPU and the Hopper GPU. “But we are really good at making big dies, and I would say that I think we were actually better with Hopper than we were with Ampere at making a big die. One big die is still the best place to be if you can do it, and I think we know how to do that better than anybody else. So we built Hopper that way.”
Cerebras might have something to say about big dies, "46225mm², 2.6 trillion transistors".
Nvidia is said to have opted to outsource the production of its next-generation GPUs to Taiwan's TSMC. Nvidia intends to manufacture its H100 GPUs on TSMC's 4-nanometer manufacturing technology. The new GPUs will be available beginning in the third quarter of 2022.
TSMC's 5-nm technology will also be used to mass-produce Nvidia's other GPU RTX 4000 series, business area reports.
How is this news? They said it at launch that it's made on "4N process" which is "TSMC 5nm class process customized for NVIDIA"
It's apparently news in Korea (as of April 4) that TSMC will carry all Hopper (4nm) and Lovelace (5nm) production. The article highlights Samsung's woes as Qualcomm will likely move to TSMC as well.How is this news? They said it at launch that it's made on "4N process" which is "TSMC 5nm class process customized for NVIDIA"
...GDep Advance, a retailer specializing in HPC and workstation systems, recently began taking pre-orders for Nvidia's H100 80GB AI and HPC PCI 5.0 compute card with passive cooling for servers. The board costs ¥4,745,950 ($36,405), which includes a ¥4,313,000 base price ($32,955), a ¥431,300 ($3308) consumption tax (sales tax), and a ¥1,650 ($13) delivery charge, according to the company (via Hermitage Akihabara). The board will ship in the latter half of the year, though we are unsure as to exactly when this will be.
...
Nvidia's H100 PCIe 5.0 compute accelerator carries the company's latest GH100 compute GPU with 7296/14592 FP64/FP32 cores (see exact specifications below) that promises to deliver performance of up to 24 FP64 TFLOPS, 48 FP32 TFLOPS, 800 FP16 TFLOPS, and 1.6 INT8 TOPS. The board carries 80GB of HBM2E memory with a 5120-bit interface offering a bandwidth of around 2TB/s and has NVLink connectors (up to 600 GB/s) that allow to build systems with up to eight H100 GPUs. The card is rated for a 350W thermal design power (TDP).
We do not know whether Nvidia plans to increase list price of its H100 PCIe cards compared to A100 boards because customers get at least two times higher performance at a lower power. Meanwhile, we do know that initially Nvidia will ship its DGX H100 and DGX SuperPod systems containing SXM5 versions of GH100 GPUs as well as SXM5 boards to HPC vendors like Atos, Boxx, Dell, HP, and Lenovo.
Later on, the company will begin shipping its H100 PCIe cards to HPC vendors and only then those H100 PCIe boards will be available to smaller AI/HPC system integrators as well as value-added resellers. All of these companies are naturally more interested in shipping complete systems with H100 inside rather than selling only cards. Therefore, it is possible that initially H100 PCIe cards will be overpriced due to high demand, limited availability, and appetites of retailers.
Moving over to the accelerated nodes, Kestrel will deploy 132 of these configurations, each with four NVIDIA H100 GPU accelerators based on the Hopper graphics architecture and a dual-socket AMD EPYC Genoa CPU config. That's 528 NVIDIA Hopper H100 GPUs and 264 AMD EPYC Genoa chips packed within these accelerated nodes.
nVidia has published own numbers between A100 and MI250: https://developer.nvidia.com/blog/fueling-high-performance-computing-with-full-stack-innovation/
That shows that these linpack numbers isnt really relevant anymore. Hopper will be in his own league with the new NVLink network.
The interesting part is that MI300 should have been an option for this project (installation by end of 2023) but they went H100 instead of full AMD for the accelerated nodes. I've heard that lot of HPC projects are following this trend, with the most prestigious going with Grace. And MI250X is nowhere to be seen, except the 2 exascale. On the other side A100 is still selling very well with still lot of back orders...Hopper wins the GPU portion of the USA's DOE new super computer: Kestrel.
https://wccftech.com/nrels-kestrel-...-genoa-dual-socket-cpus-528-nvidia-h100-gpus/
MI250X is found in LUMI too at least. And pretty sure there's other machines using same HPE Cray EX235a bladesThe interesting part is that MI300 should have been an option for this project (installation by end of 2023) but they went H100 instead of full AMD for the accelerated nodes. I've heard that lot of HPC projects are following this trend, with the most prestigious going with Grace. And MI250X is nowhere to be seen, except the 2 exascale. On the other side A100 is still selling very well with still lot of back orders...