Nvidia Ampere Discussion [2020-05-14]

bdmosky · May 14, 2020

manux said:
Omniverse ray tracing stuff looks great. Link should be to right time, if not it starts around 9:42. The actual explanation what omniverse is before the demo part

Marble Madness 2020 confirmed!

del42sa · May 14, 2020

https://www.youtube.com/playlist?list=PLZHnYvH1qtOZ2BSwG4CHmKSVHxC2lyIPL

all keynote 2020 videos

gamervivek · May 14, 2020

Looking at AT's article, single precision is 19.5TF up from 15.7TF on V100 and double precision is 9.7TF from 7.8TF. The boost clock is down by about 100Mhz. How much different the gaming chip would have to be considering these changes look anemic for the node jump.

del42sa · May 14, 2020

https://videocardz.com/press-release/nvidia-announces-ampere-ga100-gpu

https://videocardz.com/press-release/nvidia-announces-dgx-a100-ai-system

DavidGraham · May 14, 2020

gamervivek said:
Looking at AT's article, single precision is 19.5TF up from 15.7TF on V100 and double precision is 9.7TF from 7.8TF. The boost clock is down by about 100Mhz. How much different the gaming chip would have to be considering these changes look anemic for the node jump.

You are looking at the wrong metrics, they expanded Tensor functionality to 32 bit, where they achieved 156TF/312TF, this is an AI optimized chip, and should be treated accordingly.

fellix · May 14, 2020

So, the current crop of A100 GPUs disables one of the eight GPCs to gain yealds (plus few extra SMs), that's why MIG virtualization is limited to 7 partitions?!

gamervivek · May 14, 2020

DavidGraham said:
You are looking at the wrong metrics, they expanded Tensor functionality to 32 bit, where they achieved 156TF/312TF, this is an AI optimized chip, and should be treated accordingly.

You are looking at the wrong comment then. I'm talking of gaming chip which is supposedly Ampere too, unless tensor cores are being used in normal shader pipeline, the gaming Ampere chip would be drastically different.

BRiT · May 14, 2020

Moved offtopic discussion of UE5 PC impact to it's own thread -- https://forum.beyond3d.com/threads/pc-system-impacts-from-tech-like-ue5-spawn.61744/

manux · May 14, 2020

Looks like BMW has chosen nvidia robotics platform to be used in their factories

CarstenS · May 14, 2020

fellix said:
So, the current crop of A100 GPUs disables one of the eight GPCs to gain yealds (plus few extra SMs), that's why MIG virtualization is limited to 7 partitions?!

Possibly one memory partition as well, since there are six symmetric places for HBM2 stacks - one of them being a dummy with 40 GBytes per SXM.

xpea · May 14, 2020

Official Ampere deep dive from Nvidia:
https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/

A100 GPU hardware architecture
The NVIDIA GA100 GPU is composed of multiple GPU processing clusters (GPCs), texture processing clusters (TPCs), streaming multiprocessors (SMs), and HBM2 memory controllers.

The full implementation of the GA100 GPU includes the following units:

8 GPCs, 8 TPCs/GPC, 2 SMs/TPC, 16 SMs/GPC, 128 SMs per full GPU

64 FP32 CUDA Cores/SM, 8192 FP32 CUDA Cores per full GPU

4 third-generation Tensor Cores/SM, 512 third-generation Tensor Cores per full GPU

6 HBM2 stacks, 12 512-bit memory controllers

The A100 Tensor Core GPU implementation of the GA100 GPU includes the following units:

7 GPCs, 7 or 8 TPCs/GPC, 2 SMs/TPC, up to 16 SMs/GPC, 108 SMs

64 FP32 CUDA Cores/SM, 6912 FP32 CUDA Cores per GPU

4 third-generation Tensor Cores/SM, 432 third-generation Tensor Cores per GPU

5 HBM2 stacks, 10 512-bit memory controllers

Kaotik · May 14, 2020

CarstenS said:
Possibly one memory partition as well, since there are six symmetric places for HBM2 stacks - one of them being a dummy with 40 GBytes per SXM.

Can you test the GPU before assembly when using CoWoS? If not, there might not be any dummies but instead they just bin the fully working ones to release them at later date

CarstenS · May 14, 2020

So, no one else feeling a little baffled about the die size and transistor count? 54 bln. x-tors in 826 mm² is more than 1,5x the density AMD gets with (the albeit much smaller) Navi 10 and Vega 20.

Here, Jensen mentions 70% more transistors (indirectly referencing Volta), which would put it at 35.7 bln x-tors and a much more plausible 43,2M x-tors/mm².

Benetanegia · May 14, 2020

CarstenS said:
So, no one else feeling a little baffled about the die size and transistor count? 54 bln. x-tors in 826 mm² is more than 1,5x the density AMD gets with (the albeit much smaller) Navi 10 and Vega 20.

Here, Jensen mentions 70% more transistors (indirectly referencing Volta), which would put it at 35.7 bln x-tors and a much more plausible 43,2M x-tors/mm².

I really think he's talking about the new tensor cores though.

The density of the entire chip doesn't surprise me one bit, as I've discussed several times in the forums. It's what I would expect from a node that is claimed to be 3x denser... It's always been relatively close to what the foundry claimed before 7nm, why would it be different now. It's always been AMD's denisty that did't make any sense.

DavidGraham · May 14, 2020

xpea said:
Official Ampere deep dive from Nvidia:
https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/

Interestingly, RT cores are omitted from the Tesla Ampere variants, the same way display connectors and NVENC encoder are omitted. Which means NVIDIA will highly customize their GPUs this time around.

gamervivek said:
unless tensor cores are being used in normal shader pipeline, the gaming Ampere chip would be drastically different.

Yes indeed, I expect them to ditch a lot of the tensor cores in the gaming chips, also FP64 will be gone too, in addition to a lot of the HPC silicon. RT cores will be back.

pcchen · May 14, 2020

Kaotik said:
8 GPUs pushing 5 PFLOPS is around 625 TFLOPS per GPU, clearly they're talking about Tensor-FLOPS and not general FP32 FLOPS

625TFLOPS per GPU is actually its FP16 tensor core numbers (edit: with sparse matrix optimization).

ShaidarHaran · May 14, 2020

Thinking ahead to consumer parts, obviously the FP64 cores will go bye-bye, but I don't see how they can cut back on the tensor cores with this SM architecture in a way that saves die space. But it looks like load/store throughput and L1 cache has doubled compared to Turing SM, so that should lead to some IPC gains.

I'm guessing we'll see in the range of 84-90 SMs (5376-5760 FP32 Cuda cores) for GA102,
320-384 bit crossbar memory controller with 20-24GB GDDR6,
ditch most of the NVlink connections and add in RT cores,
should give us a die size in the 600-650mm^2 range.

fellix · May 14, 2020

ShaidarHaran said:
Thinking ahead to consumer parts, obviously the FP64 cores will go bye-bye, but I don't see how they can cut back on the tensor cores with this SM architecture in a way that saves die space. But it looks like load/store throughput and L1 cache has doubled compared to Turing SM, so that should lead to some IPC gains.

Double the L/S units, but still just one TMU quad. Probably that SM layout is not conclusive for the consumer parts, particularly regarding the increased RT performance.

ShaidarHaran · May 14, 2020

fellix said:
Double the L/S units, but still just one TMU quad. Probably that SM layout is not conclusive for the consumer parts, particularly regarding the increased RT performance.

GA100 SM doesn't even have RT cores, so GA102 will need to incorporate them.

DavidGraham · May 14, 2020

ShaidarHaran said:
but I don't see how they can cut back on the tensor cores with this SM architecture in a way that saves die space

Tensors occupy a third of the SM now, which is a massive increase over Volta and Turing, so they will be restructuring the SM to add in RT cores and minimize Tensor space for the consumer chips.

Nvidia Ampere Discussion [2020-05-14]

bdmosky

del42sa

gamervivek

del42sa

DavidGraham

fellix

gamervivek

BRiT

(>• •)>⌐■-■ (⌐■-■)

manux

CarstenS

Moderator

xpea

Kaotik

Drunk Member

CarstenS

Moderator

Benetanegia

DavidGraham

pcchen

Moderator

ShaidarHaran

hardware monkey

fellix

ShaidarHaran

hardware monkey

DavidGraham

Similar threads