Nvidia Ampere Discussion [2020-05-14]

Looking at AT's article, single precision is 19.5TF up from 15.7TF on V100 and double precision is 9.7TF from 7.8TF. The boost clock is down by about 100Mhz. How much different the gaming chip would have to be considering these changes look anemic for the node jump.
 
Looking at AT's article, single precision is 19.5TF up from 15.7TF on V100 and double precision is 9.7TF from 7.8TF. The boost clock is down by about 100Mhz. How much different the gaming chip would have to be considering these changes look anemic for the node jump.
You are looking at the wrong metrics, they expanded Tensor functionality to 32 bit, where they achieved 156TF/312TF, this is an AI optimized chip, and should be treated accordingly.
 
You are looking at the wrong metrics, they expanded Tensor functionality to 32 bit, where they achieved 156TF/312TF, this is an AI optimized chip, and should be treated accordingly.

You are looking at the wrong comment then. I'm talking of gaming chip which is supposedly Ampere too, unless tensor cores are being used in normal shader pipeline, the gaming Ampere chip would be drastically different.
 
So, the current crop of A100 GPUs disables one of the eight GPCs to gain yealds (plus few extra SMs), that's why MIG virtualization is limited to 7 partitions?!
Possibly one memory partition as well, since there are six symmetric places for HBM2 stacks - one of them being a dummy with 40 GBytes per SXM.
 
Official Ampere deep dive from Nvidia:
https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/
A100 GPU hardware architecture
The NVIDIA GA100 GPU is composed of multiple GPU processing clusters (GPCs), texture processing clusters (TPCs), streaming multiprocessors (SMs), and HBM2 memory controllers.

The full implementation of the GA100 GPU includes the following units:

  • 8 GPCs, 8 TPCs/GPC, 2 SMs/TPC, 16 SMs/GPC, 128 SMs per full GPU
  • 64 FP32 CUDA Cores/SM, 8192 FP32 CUDA Cores per full GPU
  • 4 third-generation Tensor Cores/SM, 512 third-generation Tensor Cores per full GPU
  • 6 HBM2 stacks, 12 512-bit memory controllers
The A100 Tensor Core GPU implementation of the GA100 GPU includes the following units:

  • 7 GPCs, 7 or 8 TPCs/GPC, 2 SMs/TPC, up to 16 SMs/GPC, 108 SMs
  • 64 FP32 CUDA Cores/SM, 6912 FP32 CUDA Cores per GPU
  • 4 third-generation Tensor Cores/SM, 432 third-generation Tensor Cores per GPU
  • 5 HBM2 stacks, 10 512-bit memory controllers
 
Possibly one memory partition as well, since there are six symmetric places for HBM2 stacks - one of them being a dummy with 40 GBytes per SXM.
Can you test the GPU before assembly when using CoWoS? If not, there might not be any dummies but instead they just bin the fully working ones to release them at later date
 
So, no one else feeling a little baffled about the die size and transistor count? 54 bln. x-tors in 826 mm² is more than 1,5x the density AMD gets with (the albeit much smaller) Navi 10 and Vega 20.

Here, Jensen mentions 70% more transistors (indirectly referencing Volta), which would put it at 35.7 bln x-tors and a much more plausible 43,2M x-tors/mm².
 
So, no one else feeling a little baffled about the die size and transistor count? 54 bln. x-tors in 826 mm² is more than 1,5x the density AMD gets with (the albeit much smaller) Navi 10 and Vega 20.

Here, Jensen mentions 70% more transistors (indirectly referencing Volta), which would put it at 35.7 bln x-tors and a much more plausible 43,2M x-tors/mm².

I really think he's talking about the new tensor cores though.

The density of the entire chip doesn't surprise me one bit, as I've discussed several times in the forums. It's what I would expect from a node that is claimed to be 3x denser... It's always been relatively close to what the foundry claimed before 7nm, why would it be different now. It's always been AMD's denisty that did't make any sense.
 
Interestingly, RT cores are omitted from the Tesla Ampere variants, the same way display connectors and NVENC encoder are omitted. Which means NVIDIA will highly customize their GPUs this time around.

unless tensor cores are being used in normal shader pipeline, the gaming Ampere chip would be drastically different.
Yes indeed, I expect them to ditch a lot of the tensor cores in the gaming chips, also FP64 will be gone too, in addition to a lot of the HPC silicon. RT cores will be back.
 
8 GPUs pushing 5 PFLOPS is around 625 TFLOPS per GPU, clearly they're talking about Tensor-FLOPS and not general FP32 FLOPS

625TFLOPS per GPU is actually its FP16 tensor core numbers (edit: with sparse matrix optimization).
 
y9bDA8b.png


Thinking ahead to consumer parts, obviously the FP64 cores will go bye-bye, but I don't see how they can cut back on the tensor cores with this SM architecture in a way that saves die space. But it looks like load/store throughput and L1 cache has doubled compared to Turing SM, so that should lead to some IPC gains.

I'm guessing we'll see in the range of 84-90 SMs (5376-5760 FP32 Cuda cores) for GA102,
320-384 bit crossbar memory controller with 20-24GB GDDR6,
ditch most of the NVlink connections and add in RT cores,
should give us a die size in the 600-650mm^2 range.
 
Last edited:
Thinking ahead to consumer parts, obviously the FP64 cores will go bye-bye, but I don't see how they can cut back on the tensor cores with this SM architecture in a way that saves die space. But it looks like load/store throughput and L1 cache has doubled compared to Turing SM, so that should lead to some IPC gains.
Double the L/S units, but still just one TMU quad. Probably that SM layout is not conclusive for the consumer parts, particularly regarding the increased RT performance.
 
Double the L/S units, but still just one TMU quad. Probably that SM layout is not conclusive for the consumer parts, particularly regarding the increased RT performance.

GA100 SM doesn't even have RT cores, so GA102 will need to incorporate them.
 
Back
Top