Marble Madness 2020 confirmed!Omniverse ray tracing stuff looks great. Link should be to right time, if not it starts around 9:42. The actual explanation what omniverse is before the demo part
Marble Madness 2020 confirmed!Omniverse ray tracing stuff looks great. Link should be to right time, if not it starts around 9:42. The actual explanation what omniverse is before the demo part
You are looking at the wrong metrics, they expanded Tensor functionality to 32 bit, where they achieved 156TF/312TF, this is an AI optimized chip, and should be treated accordingly.Looking at AT's article, single precision is 19.5TF up from 15.7TF on V100 and double precision is 9.7TF from 7.8TF. The boost clock is down by about 100Mhz. How much different the gaming chip would have to be considering these changes look anemic for the node jump.
You are looking at the wrong metrics, they expanded Tensor functionality to 32 bit, where they achieved 156TF/312TF, this is an AI optimized chip, and should be treated accordingly.
Possibly one memory partition as well, since there are six symmetric places for HBM2 stacks - one of them being a dummy with 40 GBytes per SXM.So, the current crop of A100 GPUs disables one of the eight GPCs to gain yealds (plus few extra SMs), that's why MIG virtualization is limited to 7 partitions?!
A100 GPU hardware architecture
The NVIDIA GA100 GPU is composed of multiple GPU processing clusters (GPCs), texture processing clusters (TPCs), streaming multiprocessors (SMs), and HBM2 memory controllers.
The full implementation of the GA100 GPU includes the following units:
The A100 Tensor Core GPU implementation of the GA100 GPU includes the following units:
- 8 GPCs, 8 TPCs/GPC, 2 SMs/TPC, 16 SMs/GPC, 128 SMs per full GPU
- 64 FP32 CUDA Cores/SM, 8192 FP32 CUDA Cores per full GPU
- 4 third-generation Tensor Cores/SM, 512 third-generation Tensor Cores per full GPU
- 6 HBM2 stacks, 12 512-bit memory controllers
- 7 GPCs, 7 or 8 TPCs/GPC, 2 SMs/TPC, up to 16 SMs/GPC, 108 SMs
- 64 FP32 CUDA Cores/SM, 6912 FP32 CUDA Cores per GPU
- 4 third-generation Tensor Cores/SM, 432 third-generation Tensor Cores per GPU
- 5 HBM2 stacks, 10 512-bit memory controllers
Can you test the GPU before assembly when using CoWoS? If not, there might not be any dummies but instead they just bin the fully working ones to release them at later datePossibly one memory partition as well, since there are six symmetric places for HBM2 stacks - one of them being a dummy with 40 GBytes per SXM.
So, no one else feeling a little baffled about the die size and transistor count? 54 bln. x-tors in 826 mm² is more than 1,5x the density AMD gets with (the albeit much smaller) Navi 10 and Vega 20.
Here, Jensen mentions 70% more transistors (indirectly referencing Volta), which would put it at 35.7 bln x-tors and a much more plausible 43,2M x-tors/mm².
Interestingly, RT cores are omitted from the Tesla Ampere variants, the same way display connectors and NVENC encoder are omitted. Which means NVIDIA will highly customize their GPUs this time around.Official Ampere deep dive from Nvidia:
https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/
Yes indeed, I expect them to ditch a lot of the tensor cores in the gaming chips, also FP64 will be gone too, in addition to a lot of the HPC silicon. RT cores will be back.unless tensor cores are being used in normal shader pipeline, the gaming Ampere chip would be drastically different.
8 GPUs pushing 5 PFLOPS is around 625 TFLOPS per GPU, clearly they're talking about Tensor-FLOPS and not general FP32 FLOPS
Double the L/S units, but still just one TMU quad. Probably that SM layout is not conclusive for the consumer parts, particularly regarding the increased RT performance.Thinking ahead to consumer parts, obviously the FP64 cores will go bye-bye, but I don't see how they can cut back on the tensor cores with this SM architecture in a way that saves die space. But it looks like load/store throughput and L1 cache has doubled compared to Turing SM, so that should lead to some IPC gains.
Double the L/S units, but still just one TMU quad. Probably that SM layout is not conclusive for the consumer parts, particularly regarding the increased RT performance.
Tensors occupy a third of the SM now, which is a massive increase over Volta and Turing, so they will be restructuring the SM to add in RT cores and minimize Tensor space for the consumer chips.but I don't see how they can cut back on the tensor cores with this SM architecture in a way that saves die space