Nvidia Ampere Discussion [2020-05-14]

but I don't see how they can cut back on the tensor cores with this SM architecture in a way that saves die space.

Probably just like they replaced tensor cores for FP16 units in GTX vs RTX Turing chips, they just can. They'll just replace them with simpler tensor cores this time around.

But it looks like load/store throughput and L1 cache has doubled compared to Turing SM, so that should lead to some IPC gains.

IMO they'll shrink it to 128KB most likely, as Turing got reduced to 96KB from 128KB on Volta. Still a nice improvement IMO, tho not keeping those 192KB will likely make game/shader programmers cry, lol.

Another thing that is likely to go is the massive 40MB (48MB? on full die?) L2 cache. 12MB I can see happening.

Any other thoughts?
 
Tensors occupy a third of the SM now, which is a massive increase over Volta and Turing, so they will be restructuring the SM to add in RT cores and minimize Tensor space for the consumer chips.

Probably just like they replaced tensor cores for FP16 units in GTX vs RTX Turing chips, they just can. They'll just replace them with simpler tensor cores this time around.

That approach makes sense, although I would point out that if NV keeps the 4 tensor core/SM layout for consumer Ampere parts, that will result in a tensor core deficit compared to Turing. Perhaps they will carry forward the Turing tensor cores to consumer Ampere?

IMO they'll shrink it to 128KB most likely, as Turing got reduced to 96KB from 128KB on Volta. Still a nice improvement IMO, tho not keeping those 192KB will likely make game/shader programmers cry, lol.

Another thing that is likely to go is the massive 40MB (48MB? on full die?) L2 cache. 12MB I can see happening.

Any other thoughts?

I suspect you are correct about reducing the L1 (and certainly the L2) caches for consumer Ampere. 40MB L2 (48MB with 8MB disabled due to disabled SMs) has surely got to occupy a large die area. 12MB L2 for GA102 would be double that of TU102, but still a massive saving over GA100, I think that's a reasonable estimate.

Some quick math shows that a 48MB L2 cache should occupy somewhere around 40mm^2 die area on TSMC's 7nm HP node, given a cell density of 64.98 MTr/mm^2 and an approximate transistor count of 60M per 1MB cache. If we reduce this to 12MB that gives an even 10mm^2 result, shaving 30mm^2 off GA100's die for GA102. Not quite down to the 600-650mm^2 range yet but I suspect the additional changes I mentioned previously ought to get there.
 
Last edited:
It's almost have same die size(826 vs 815mm2) as GV100 and lower clock speed(1410 vs 1530MHz), yet power consumption is 100W higher (400 vs 300W). Is it not on TSMC N7 EUV, or same process like AMD's RDNA 1 GPUs.
 
I sure hope this server GPU isn't the only thing they're going to show today, and we'll be able to get a glimpse of the new consumer lineup.

Oh well...

At least they released a non-completely-castrated lower cost version of the Xavier NX devkit, complete with GPIOs, I2C, MIPI-CSI, etc.
It's a very interesting solution for IIoT development.
 
I really think he's talking about the new tensor cores though.
Not seeing where that assessment is coming from. I listened again closely to a whole minute before he mentions transistor budget and nothing indicates that he's talking about transistors.
The density of the entire chip doesn't surprise me one bit, as I've discussed several times in the forums. It's what I would expect from a node that is claimed to be 3x denser... It's always been relatively close to what the foundry claimed before 7nm, why would it be different now. It's always been AMD's denisty that did't make any sense.
Given the large amount of tightly packable SRAMs, I am inclined to give Nvidia a bit more leeway here, usually having had less dense chips than AMD on the same process. But +50% is more like one figure, AMDs or Nvidias, is not telling the whole truth.
 
Not seeing where that assessment is coming from. I listened again closely to a whole minute before he mentions transistor budget and nothing indicates that he's talking about transistors.

He's talking about AI performance. I'm 99% sure there's not a single mention of general compute performance in the entirety of the video (he clearly states training and inference performance in the same sentence as he mentions transistor budget, for example), and maybe even the entirety of the presentation. It's pretty much all about AI and the new tensor core capability and performance. It makes much more sense that he's talking about size of the relevant silicon and not just everything else. Specially when taking it the other way creates the problem of not aligning with the provided transistor count (by 17 billion no less!!!) and you have to come up with some sort of "someone lying scenario" to explain the discrepancy.
 
ex_hyvcxgaamqf8irkvs.jpg


 
Last edited:
Thinking ahead to consumer parts, obviously the FP64 cores will go bye-bye, but I don't see how they can cut back on the tensor cores with this SM architecture in a way that saves die space.
You don't really need tensor cores specifically to run workloads which are targeted at tensor cores. They can easily cut out a lot out of GA100's tensor cores capabilities and just run this code on general SIMDs. They can probably scale down the matrix size as well making them slower.
But with that being said it's worth remembering that "gaming" GPUs are used by NV in Tesla parts targeting AI inferencing. So it's kinda possible that they won't cut anything but FP64 math here.
 
Did anyone else notice how the dGPUs in the drive L5 "Robotaxi" are a smaller Ampere with 4 HBM2 stacks?

Hi5qaLS.png


nvidia is saying each of these GPUs can go as high as 400W.
 
Did anyone else notice how the dGPUs in the drive L5 "Robotaxi" are a smaller Ampere with 4 HBM2 stacks?

Hi5qaLS.png


nvidia is saying each of these GPUs can go as high as 400W.
Not quite, each Orin can go at least up to 45W (L2+ spec), but @Ryan Smith suggests they could go as high as 65 - 70W each. Then there's whatever that daughterboard in the upper edge is too. The whole platform is supposed to be 800W.
 
The sparse int just looks like it's for poorly pruned deployment neural nets to begin with, how else would you zero half the nodes with no outcome effect?

Well, I suppose an easy way to optimize is highly tempting for a lot of devs, and locking them into a Nvidia only supported mode is good for Nvidia.
 
Back
Top