Nvidia Ampere Discussion [2020-05-14]

nVidia explains how they have archived 2x FP32 throughput, enjoy:
Eg71Hv9X0AQENLs


Heh! So we nailed it several months ago.
 
Last edited:
What does this actually mean? Instead of a crossbar for data transfers between the GPCs and ROPs, that traffic now has to route through GPCs?

"Each GPC includes a dedicated Raster Engine, and now also includes two ROP partitions (each partition containing eight ROP units), which is a new feature for NVIDIA Ampere Architecture GA10x GPUs."

Also, how is the combined FP32/INT32 pipeline any different to Pascal/Maxwell/Kepler etc? This doesn't sound like a new thing.

Doing a bit of ROP math assuming GA102 has 84SMs total.

96 ROPs = 6 GPCs = 14 SMs per GPC or
192 ROPs = 12 GPCs = 7 SMs per GPC.

If it's the latter that's going to be a chunky increase in raw rasterization throughput.
 
Last edited:
Not sure why doubling cache bandwidth was needed tho? In Turing there were 2 datapaths too, from what was told, one for FP, the other for INT, with Ampere the INT one just gets shared with FP, right?
I guess in Turing INT datapath was like a second rate bandwidth consumer because it would be idling more time than not anyway? And now that the seccond datapath will consume at all times, a adoubling ir required? Any thoughts?
 
Not sure why doubling cache bandwidth was needed tho? In Turing there were 2 datapaths too, from what was told, one for FP, the other for INT, with Ampere the INT one just gets shared with FP, right?
I guess in Turing INT datapath was like a second rate bandwidth consumer because it would be idling more time than not anyway? And now that the seccond datapath will consume at all times, a adoubling ir required? Any thoughts?

Good point 32 bits is 32 bits. The “new” FP32 ALUs should be able to use the existing INT32 data path. Weird.
 
What does this actually mean? Instead of a crossbar for data transfers between the GPCs and ROPs, that traffic now has to route through GPCs?

"Each GPC includes a dedicated Raster Engine, and now also includes two ROP partitions (each partition containing eight ROP units), which is a new feature for NVIDIA Ampere Architecture GA10x GPUs."

Also, how is the combined FP32/INT32 pipeline any different to Pascal/Maxwell/Kepler etc? This doesn't sound like a new thing.

Doing a bit of ROP math assuming GA102 has 84SMs total.

96 ROPs = 6 GPCs = 14 SMs per GPC or
192 ROPs = 12 GPCs = 7 SMs per GPC.

If it's the latter that's going to be a chunky increase in raw rasterization throughput.

Turing 1 FP32 + 1 INT32
Ampere 1 FP32 + 1 (FP32 or INT32)

So Ampere can have double the FP32 rate of Turing, but when INT32 is mixed it is the same as Turing. So mileage will vary based on workload.

Edit: Turing ROPs were tied to the memory controllers (1 per controller), but now they're tied to the GPC (2 per GPC). I'm not sure what effect that has on the data path.
 
Last edited:
Three SIMDs with only two being accessible each clock means only good things for utilization - no scheduling bubbles to speak of. INTs are very simple in h/w so them being idle doesn't mean much either. This is an obvious and the most straight forward enhancement of Turing SM. Driver compiler will get a bit more complex which means that good old "game ready" releases will likely result in higher (or to be precise - any at all) performance gains again. But otherwise this points to FP32 utilization likely being on par with Turing: Ampere will either go with 32 FP32 launches or 16 FP32 + 16 INT32 when there aren't enough of the former in a warp. What will kinda suffer is INT32 h/w utilization but who cares.
 
What does this actually mean? Instead of a crossbar for data transfers between the GPCs and ROPs, that traffic now has to route through GPCs?

"Each GPC includes a dedicated Raster Engine, and now also includes two ROP partitions (each partition containing eight ROP units), which is a new feature for NVIDIA Ampere Architecture GA10x GPUs."

Also, how is the combined FP32/INT32 pipeline any different to Pascal/Maxwell/Kepler etc? This doesn't sound like a new thing.

Doing a bit of ROP math assuming GA102 has 84SMs total.

96 ROPs = 6 GPCs = 14 SMs per GPC or
192 ROPs = 12 GPCs = 7 SMs per GPC.

If it's the latter that's going to be a chunky increase in raw rasterization throughput.

In Turing, there was a set of FP32 ALUs and a set of INT32 ALUs, each could work in parallel. With Ampere, the INT32-set was expanded to be able to do FP32 as well. That's my take of it.
GA102, judging from the die shot, should be at 7 GPCs, so 7x8x2=112 ROPs with full config. Probably, Nvidia can enable/disable ROPs independently from memory partitions now.

I wonder why no one's talking about GA104 already... ;)
 
Turing 1 FP32 + 1 INT32
Ampere 1 FP32 + 1 (FP32 or INT32)

So Ampere can have double the FP32 rate of Turing, but when INT32 is mixed it is the same as Turing. So mileage will vary based on workload.

In Turing, there was a set of FP32 ALUs and a set of INT32 ALUs, each could work in parallel. With Ampere, the INT32-set was expanded to be able to do FP32 as well. That's my take of it.
GA102, judging from the die shot, should be at 7 GPCs, so 7x8x2=112 ROPs with full config. Probably, Nvidia can enable/disable ROPs independently from memory partitions now.

I wonder why no one's talking about GA104 already... ;)

Yeah I get that. I was asking about the ROPs being inside the GPC part.
 
GA102, judging from the die shot, should be at 7 GPCs, so 7x8x2=112 ROPs with full config. Probably, Nvidia can enable/disable ROPs independently from memory partitions now.

Oh, that's interesting. It must mean they decoupled ROPs from the memory controllers so the number of ROPs is no longer determined by bus width.
 
This seems like a logical option but I wonder if having them separate in h/w would actually be _less_ complex than having one SIMD capable of FP32+INT32.
You mean like FP32+FP32+INT32? I would have approached it the other way around: have both blocks be able to do FP32 or INT32. Both would be a little bit bigger, but you had more opportunities to schedule.
 
You mean like FP32+FP32+INT32? I would have approached it the other way around: have both blocks be able to do FP32 or INT32. Both would be a little bit bigger, but you had more opportunities to schedule.
Yeah. Three 16-wide SIMDs: FP32+FP32+INT32. This would explain the reasoning behind the extension of data paths at least as otherwise you'd have the same 64 bits going to the same two SIMDs with one or two of them being capable of FP32+INT32 now.
Both of them being FP32+INT32 capable would be easy on the driver team and thus I kinda doubt that NV went this way )) It would also mean that there would be a lot of unavoidably idle h/w each clock - again, not something which NV is known for. So I'd wager that it's either FP32+FP32+INT32 or at least FP32/INT32+FP32.

In any case, I don't see how any of these arrangements would create issues with FP32 utilization, especially if we consider that next gen console gaming code will be 32 or 64 wide FP32 - which should fit nicely onto Ampere FP32 pipeline and saturate it to 100%.

Btw, do we know if gaming Ampere will keep TF32 support of GA100 on its tensor cores?
 
Last edited:
Also, how is the combined FP32/INT32 pipeline any different to Pascal/Maxwell/Kepler etc? This doesn't sound like a new thing.

I think the difference is in power gating granularity. Being able to switch off part of it versus only being able to switch the entire thing off when doing nothing. Specifically switching off the INT SIMDs when only doing FP. Because iirc from some (twitter?) posts at the time, the FP unit actually only takes care of the most relevant (more used) FP instructions, while the so called INT units also do some fringe less used FP instructions. I suppose this way the FP units become the leaner (less micro-instructions supported), most used unit that is enbled most of the time and the INT units are the fat units that get switched off most of the time.
 
I was confused at the mentioning of data paths at first, too. But I think he just meant the blocks the Warps are assigned to.
FP32+[FP32|INT32] is still my go-to choice with the following reasoning from available material:
1st: https://www.nvidia.com/content/dam/...ure/NVIDIA-Turing-Architecture-Whitepaper.pdf
Page 13, concurrent execution: 36x INT32 for every 100 FP32 over a variety of gaming workloads

Going FP32+[FP32|INT32] would obviously reduce performance compared to FP32+FP32+INT32.

Evidence 2:
Chart at timestamp 19:05 shows 3080 vs. 2080 perf in 4k which for the games, i.e. workloads using aforementioned mix, is at roughly 1.6x to 1.7x

200 (2x FP32) - 36 (INT32 share) ist 164 and almost exactly where the perf seems to be at.
 
I think the difference is in power gating granularity. Being able to switch off part of it versus only being able to switch the entire thing off when doing nothing. Specifically switching off the INT SIMDs when only doing FP. Because iirc from some (twitter?) posts at the time, the FP unit actually only takes care of the most relevant (more used) FP instructions, while the so called INT units also do some fringe less used FP instructions. I suppose this way the FP units become the leaner (less micro-instructions supported), most used unit that is enbled most of the time and the INT units are the fat units that get switched off most of the time.

That's a great point. I think FP units are more complex/fatter than INT units though.
 
FP32+[FP32|INT32] is still my go-to choice

Yeah it has to be that. I suspect the reason for increasing L1 throughput is that the INT32 pipe was idle much of the time so it didn't consume as much operand fetch bandwidth. With 2 FP32 pipes running full tilt you need that fatter memory pipe.
 
Back
Top