Nvidia Ampere Discussion [2020-05-14]

Benetanegia · May 14, 2020

ShaidarHaran said:
but I don't see how they can cut back on the tensor cores with this SM architecture in a way that saves die space.

Probably just like they replaced tensor cores for FP16 units in GTX vs RTX Turing chips, they just can. They'll just replace them with simpler tensor cores this time around.

But it looks like load/store throughput and L1 cache has doubled compared to Turing SM, so that should lead to some IPC gains.

IMO they'll shrink it to 128KB most likely, as Turing got reduced to 96KB from 128KB on Volta. Still a nice improvement IMO, tho not keeping those 192KB will likely make game/shader programmers cry, lol.

Another thing that is likely to go is the massive 40MB (48MB? on full die?) L2 cache. 12MB I can see happening.

Any other thoughts?

ShaidarHaran · May 14, 2020

DavidGraham said:
Tensors occupy a third of the SM now, which is a massive increase over Volta and Turing, so they will be restructuring the SM to add in RT cores and minimize Tensor space for the consumer chips.

Benetanegia said:
Probably just like they replaced tensor cores for FP16 units in GTX vs RTX Turing chips, they just can. They'll just replace them with simpler tensor cores this time around.

That approach makes sense, although I would point out that if NV keeps the 4 tensor core/SM layout for consumer Ampere parts, that will result in a tensor core deficit compared to Turing. Perhaps they will carry forward the Turing tensor cores to consumer Ampere?

Benetanegia said:
IMO they'll shrink it to 128KB most likely, as Turing got reduced to 96KB from 128KB on Volta. Still a nice improvement IMO, tho not keeping those 192KB will likely make game/shader programmers cry, lol.

Another thing that is likely to go is the massive 40MB (48MB? on full die?) L2 cache. 12MB I can see happening.

Any other thoughts?

I suspect you are correct about reducing the L1 (and certainly the L2) caches for consumer Ampere. 40MB L2 (48MB with 8MB disabled due to disabled SMs) has surely got to occupy a large die area. 12MB L2 for GA102 would be double that of TU102, but still a massive saving over GA100, I think that's a reasonable estimate.

Some quick math shows that a 48MB L2 cache should occupy somewhere around 40mm^2 die area on TSMC's 7nm HP node, given a cell density of 64.98 MTr/mm^2 and an approximate transistor count of 60M per 1MB cache. If we reduce this to 12MB that gives an even 10mm^2 result, shaving 30mm^2 off GA100's die for GA102. Not quite down to the 600-650mm^2 range yet but I suspect the additional changes I mentioned previously ought to get there.

Man from Atlantis · May 14, 2020

It's almost have same die size(826 vs 815mm2) as GV100 and lower clock speed(1410 vs 1530MHz), yet power consumption is 100W higher (400 vs 300W). Is it not on TSMC N7 EUV, or same process like AMD's RDNA 1 GPUs.

del42sa · May 14, 2020

Man from Atlantis said:
It's almost have same die size(826 vs 815mm2) as GV100 and lower clock speed(1410 vs 1530MHz), yet power consumption is 100W higher (400 vs 300W). Is it not on TSMC N7 EUV, or same process like AMD's RDNA 1 GPUs.

he says "optimized for Nvidia"

Deleted member 13524 · May 14, 2020

ToTTenTranz said:
I sure hope this server GPU isn't the only thing they're going to show today, and we'll be able to get a glimpse of the new consumer lineup.

Oh well...

At least they released a non-completely-castrated lower cost version of the Xavier NX devkit, complete with GPIOs, I2C, MIPI-CSI, etc.
It's a very interesting solution for IIoT development.

Kaotik · May 14, 2020

del42sa said:
he says "optimized for Nvidia"

That's most likely more marketing than anything else, just like "12FFN" was

Bondrewd · May 14, 2020

del42sa said:
he says "optimized for Nvidia

Yeah because nothing works without lotsa DTCO these days.

CarstenS · May 14, 2020

DavidGraham said:
Tensors occupy a third of the SM now, [...]

Where did you get that figure from?

CarstenS · May 14, 2020

Benetanegia said:
I really think he's talking about the new tensor cores though.

Not seeing where that assessment is coming from. I listened again closely to a whole minute before he mentions transistor budget and nothing indicates that he's talking about transistors.

Benetanegia said:
The density of the entire chip doesn't surprise me one bit, as I've discussed several times in the forums. It's what I would expect from a node that is claimed to be 3x denser... It's always been relatively close to what the foundry claimed before 7nm, why would it be different now. It's always been AMD's denisty that did't make any sense.

Given the large amount of tightly packable SRAMs, I am inclined to give Nvidia a bit more leeway here, usually having had less dense chips than AMD on the same process. But +50% is more like one figure, AMDs or Nvidias, is not telling the whole truth.

ShaidarHaran · May 14, 2020

CarstenS said:
Where did you get that figure from?

I haven't seen that figure officially mentioned anywhere, but if we assume the SM block diagram I posted is even remotely analogous to the real thing, 1/3 looks about right to me.

Benetanegia · May 14, 2020

CarstenS said:
Not seeing where that assessment is coming from. I listened again closely to a whole minute before he mentions transistor budget and nothing indicates that he's talking about transistors.

He's talking about AI performance. I'm 99% sure there's not a single mention of general compute performance in the entirety of the video (he clearly states training and inference performance in the same sentence as he mentions transistor budget, for example), and maybe even the entirety of the presentation. It's pretty much all about AI and the new tensor core capability and performance. It makes much more sense that he's talking about size of the relevant silicon and not just everything else. Specially when taking it the other way creates the problem of not aligning with the provided transistor count (by 17 billion no less!!!) and you have to come up with some sort of "someone lying scenario" to explain the discrepancy.

Man from Atlantis · May 14, 2020

https://twitter.com/x/status/1260946684898615296

trinibwoy · May 14, 2020

ShaidarHaran said:
I haven't seen that figure officially mentioned anywhere, but if we assume the SM block diagram I posted is even remotely analogous to the real thing, 1/3 looks about right to me.

It’s not.

DegustatoR · May 14, 2020

ShaidarHaran said:
Thinking ahead to consumer parts, obviously the FP64 cores will go bye-bye, but I don't see how they can cut back on the tensor cores with this SM architecture in a way that saves die space.

You don't really need tensor cores specifically to run workloads which are targeted at tensor cores. They can easily cut out a lot out of GA100's tensor cores capabilities and just run this code on general SIMDs. They can probably scale down the matrix size as well making them slower.
But with that being said it's worth remembering that "gaming" GPUs are used by NV in Tesla parts targeting AI inferencing. So it's kinda possible that they won't cut anything but FP64 math here.

Deleted member 13524 · May 15, 2020

Did anyone else notice how the dGPUs in the drive L5 "Robotaxi" are a smaller Ampere with 4 HBM2 stacks?

nvidia is saying each of these GPUs can go as high as 400W.

Kaotik · May 15, 2020

ToTTenTranz said:
Did anyone else notice how the dGPUs in the drive L5 "Robotaxi" are a smaller Ampere with 4 HBM2 stacks?

nvidia is saying each of these GPUs can go as high as 400W.

Not quite, each Orin can go at least up to 45W (L2+ spec), but @Ryan Smith suggests they could go as high as 65 - 70W each. Then there's whatever that daughterboard in the upper edge is too. The whole platform is supposed to be 800W.

Deleted member 2197 · May 15, 2020

AlNom · May 15, 2020

2 pops?

xpea · May 15, 2020

To put into perspective...

Frenetic Pony · May 15, 2020

The sparse int just looks like it's for poorly pruned deployment neural nets to begin with, how else would you zero half the nodes with no outcome effect?

Well, I suppose an easy way to optimize is highly tempting for a lot of devs, and locking them into a Nvidia only supported mode is good for Nvidia.

Nvidia Ampere Discussion [2020-05-14]

Benetanegia

ShaidarHaran

hardware monkey

Man from Atlantis

idk

del42sa

Deleted member 13524

Guest

Kaotik

Drunk Member

Bondrewd

CarstenS

Moderator

CarstenS

Moderator

ShaidarHaran

hardware monkey

Benetanegia

Man from Atlantis

idk

trinibwoy

Meh

DegustatoR

Deleted member 13524

Guest

Kaotik

Drunk Member

Deleted member 2197

Guest

AlNom

Moderator

xpea

Frenetic Pony

Similar threads