Nvidia Ampere Discussion [2020-05-14]

Benetanegia · Sep 2, 2020

troyan said:
nVidia explains how they have archived 2x FP32 throughput, enjoy:

https://twitter.com/x/status/1301246897202573319

Heh! So we nailed it several months ago.

CarstenS · Sep 2, 2020

troyan said:
nVidia explains how they have archived 2x FP32 throughput, enjoy:

Good that it's finally in the open.

And while we're at it, here's the original link:

https://www.reddit.com/r/nvidia/comments/iko4u7/geforce_rtx_30series_community_qa_submit_your/g3qkzva

Moreover, I can now also say, HDMI 2.1 is full 48 Gbps:

https://www.reddit.com/r/nvidia/comments/iko4u7/geforce_rtx_30series_community_qa_submit_your/g3qj39p

trinibwoy · Sep 2, 2020

What does this actually mean? Instead of a crossbar for data transfers between the GPCs and ROPs, that traffic now has to route through GPCs?

"Each GPC includes a dedicated Raster Engine, and now also includes two ROP partitions (each partition containing eight ROP units), which is a new feature for NVIDIA Ampere Architecture GA10x GPUs."

Also, how is the combined FP32/INT32 pipeline any different to Pascal/Maxwell/Kepler etc? This doesn't sound like a new thing.

Doing a bit of ROP math assuming GA102 has 84SMs total.

96 ROPs = 6 GPCs = 14 SMs per GPC or
192 ROPs = 12 GPCs = 7 SMs per GPC.

If it's the latter that's going to be a chunky increase in raw rasterization throughput.

Benetanegia · Sep 2, 2020

Not sure why doubling cache bandwidth was needed tho? In Turing there were 2 datapaths too, from what was told, one for FP, the other for INT, with Ampere the INT one just gets shared with FP, right?
I guess in Turing INT datapath was like a second rate bandwidth consumer because it would be idling more time than not anyway? And now that the seccond datapath will consume at all times, a adoubling ir required? Any thoughts?

trinibwoy · Sep 2, 2020

Benetanegia said:
Not sure why doubling cache bandwidth was needed tho? In Turing there were 2 datapaths too, from what was told, one for FP, the other for INT, with Ampere the INT one just gets shared with FP, right?
I guess in Turing INT datapath was like a second rate bandwidth consumer because it would be idling more time than not anyway? And now that the seccond datapath will consume at all times, a adoubling ir required? Any thoughts?

Good point 32 bits is 32 bits. The “new” FP32 ALUs should be able to use the existing INT32 data path. Weird.

Scott_Arm · Sep 2, 2020

trinibwoy said:
What does this actually mean? Instead of a crossbar for data transfers between the GPCs and ROPs, that traffic now has to route through GPCs?

"Each GPC includes a dedicated Raster Engine, and now also includes two ROP partitions (each partition containing eight ROP units), which is a new feature for NVIDIA Ampere Architecture GA10x GPUs."

Also, how is the combined FP32/INT32 pipeline any different to Pascal/Maxwell/Kepler etc? This doesn't sound like a new thing.

Doing a bit of ROP math assuming GA102 has 84SMs total.

96 ROPs = 6 GPCs = 14 SMs per GPC or
192 ROPs = 12 GPCs = 7 SMs per GPC.

If it's the latter that's going to be a chunky increase in raw rasterization throughput.

Turing 1 FP32 + 1 INT32
Ampere 1 FP32 + 1 (FP32 or INT32)

So Ampere can have double the FP32 rate of Turing, but when INT32 is mixed it is the same as Turing. So mileage will vary based on workload.

Edit: Turing ROPs were tied to the memory controllers (1 per controller), but now they're tied to the GPC (2 per GPC). I'm not sure what effect that has on the data path.

DegustatoR · Sep 2, 2020

Three SIMDs with only two being accessible each clock means only good things for utilization - no scheduling bubbles to speak of. INTs are very simple in h/w so them being idle doesn't mean much either. This is an obvious and the most straight forward enhancement of Turing SM. Driver compiler will get a bit more complex which means that good old "game ready" releases will likely result in higher (or to be precise - any at all) performance gains again. But otherwise this points to FP32 utilization likely being on par with Turing: Ampere will either go with 32 FP32 launches or 16 FP32 + 16 INT32 when there aren't enough of the former in a warp. What will kinda suffer is INT32 h/w utilization but who cares.

CarstenS · Sep 2, 2020

trinibwoy said:
What does this actually mean? Instead of a crossbar for data transfers between the GPCs and ROPs, that traffic now has to route through GPCs?

"Each GPC includes a dedicated Raster Engine, and now also includes two ROP partitions (each partition containing eight ROP units), which is a new feature for NVIDIA Ampere Architecture GA10x GPUs."

Also, how is the combined FP32/INT32 pipeline any different to Pascal/Maxwell/Kepler etc? This doesn't sound like a new thing.

Doing a bit of ROP math assuming GA102 has 84SMs total.

96 ROPs = 6 GPCs = 14 SMs per GPC or
192 ROPs = 12 GPCs = 7 SMs per GPC.

If it's the latter that's going to be a chunky increase in raw rasterization throughput.

In Turing, there was a set of FP32 ALUs and a set of INT32 ALUs, each could work in parallel. With Ampere, the INT32-set was expanded to be able to do FP32 as well. That's my take of it.
GA102, judging from the die shot, should be at 7 GPCs, so 7x8x2=112 ROPs with full config. Probably, Nvidia can enable/disable ROPs independently from memory partitions now.

I wonder why no one's talking about GA104 already...

trinibwoy · Sep 2, 2020

Scott_Arm said:
Turing 1 FP32 + 1 INT32
Ampere 1 FP32 + 1 (FP32 or INT32)

So Ampere can have double the FP32 rate of Turing, but when INT32 is mixed it is the same as Turing. So mileage will vary based on workload.

CarstenS said:
In Turing, there was a set of FP32 ALUs and a set of INT32 ALUs, each could work in parallel. With Ampere, the INT32-set was expanded to be able to do FP32 as well. That's my take of it.
GA102, judging from the die shot, should be at 7 GPCs, so 7x8x2=112 ROPs with full config. Probably, Nvidia can enable/disable ROPs independently from memory partitions now.

I wonder why no one's talking about GA104 already...

Yeah I get that. I was asking about the ROPs being inside the GPC part.

DegustatoR · Sep 2, 2020

CarstenS said:
With Ampere, the INT32-set was expanded to be able to do FP32 as well.

This seems like a logical option but I wonder if having them separate in h/w would actually be _less_ complex than having one SIMD capable of FP32+INT32.

trinibwoy · Sep 2, 2020

CarstenS said:
GA102, judging from the die shot, should be at 7 GPCs, so 7x8x2=112 ROPs with full config. Probably, Nvidia can enable/disable ROPs independently from memory partitions now.

Oh, that's interesting. It must mean they decoupled ROPs from the memory controllers so the number of ROPs is no longer determined by bus width.

CarstenS · Sep 2, 2020

trinibwoy said:
Oh, that's interesting. It must mean they decoupled ROPs from the memory controllers so the number of ROPs is no longer determined by bus width.

That's what Tamasi's answer sounds like to me, yeah.

CarstenS · Sep 2, 2020

DegustatoR said:
This seems like a logical option but I wonder if having them separate in h/w would actually be _less_ complex than having one SIMD capable of FP32+INT32.

You mean like FP32+FP32+INT32? I would have approached it the other way around: have both blocks be able to do FP32 or INT32. Both would be a little bit bigger, but you had more opportunities to schedule.

DegustatoR · Sep 2, 2020

CarstenS said:
You mean like FP32+FP32+INT32? I would have approached it the other way around: have both blocks be able to do FP32 or INT32. Both would be a little bit bigger, but you had more opportunities to schedule.

Yeah. Three 16-wide SIMDs: FP32+FP32+INT32. This would explain the reasoning behind the extension of data paths at least as otherwise you'd have the same 64 bits going to the same two SIMDs with one or two of them being capable of FP32+INT32 now.
Both of them being FP32+INT32 capable would be easy on the driver team and thus I kinda doubt that NV went this way )) It would also mean that there would be a lot of unavoidably idle h/w each clock - again, not something which NV is known for. So I'd wager that it's either FP32+FP32+INT32 or at least FP32/INT32+FP32.

In any case, I don't see how any of these arrangements would create issues with FP32 utilization, especially if we consider that next gen console gaming code will be 32 or 64 wide FP32 - which should fit nicely onto Ampere FP32 pipeline and saturate it to 100%.

Btw, do we know if gaming Ampere will keep TF32 support of GA100 on its tensor cores?

Benetanegia · Sep 2, 2020

trinibwoy said:
Also, how is the combined FP32/INT32 pipeline any different to Pascal/Maxwell/Kepler etc? This doesn't sound like a new thing.

I think the difference is in power gating granularity. Being able to switch off part of it versus only being able to switch the entire thing off when doing nothing. Specifically switching off the INT SIMDs when only doing FP. Because iirc from some (twitter?) posts at the time, the FP unit actually only takes care of the most relevant (more used) FP instructions, while the so called INT units also do some fringe less used FP instructions. I suppose this way the FP units become the leaner (less micro-instructions supported), most used unit that is enbled most of the time and the INT units are the fat units that get switched off most of the time.

CarstenS · Sep 2, 2020

I was confused at the mentioning of data paths at first, too. But I think he just meant the blocks the Warps are assigned to.
FP32+[FP32|INT32] is still my go-to choice with the following reasoning from available material:
1st: https://www.nvidia.com/content/dam/...ure/NVIDIA-Turing-Architecture-Whitepaper.pdf
Page 13, concurrent execution: 36x INT32 for every 100 FP32 over a variety of gaming workloads

Going FP32+[FP32|INT32] would obviously reduce performance compared to FP32+FP32+INT32.

Evidence 2:

Chart at timestamp 19:05 shows 3080 vs. 2080 perf in 4k which for the games, i.e. workloads using aforementioned mix, is at roughly 1.6x to 1.7x

200 (2x FP32) - 36 (INT32 share) ist 164 and almost exactly where the perf seems to be at.

trinibwoy · Sep 2, 2020

Benetanegia said:
I think the difference is in power gating granularity. Being able to switch off part of it versus only being able to switch the entire thing off when doing nothing. Specifically switching off the INT SIMDs when only doing FP. Because iirc from some (twitter?) posts at the time, the FP unit actually only takes care of the most relevant (more used) FP instructions, while the so called INT units also do some fringe less used FP instructions. I suppose this way the FP units become the leaner (less micro-instructions supported), most used unit that is enbled most of the time and the INT units are the fat units that get switched off most of the time.

That's a great point. I think FP units are more complex/fatter than INT units though.

PSman1700 · Sep 2, 2020

trinibwoy said:
I prefer the more sober approach.

Yea, didn't expect anything else

Goes for every product though, can find alot sober approaches to PS5, AMD, NV, MS if i want to.

trinibwoy · Sep 2, 2020

CarstenS said:
FP32+[FP32|INT32] is still my go-to choice

Yeah it has to be that. I suspect the reason for increasing L1 throughput is that the INT32 pipe was idle much of the time so it didn't consume as much operand fetch bandwidth. With 2 FP32 pipes running full tilt you need that fatter memory pipe.

Davros · Sep 2, 2020

If no one's posted it Digital Foundry benchmarked a 3080 {on an Intel board so no pci-e 4} it performed between 70 - 80% faster than a 2080 in non raytracing scenarios.

Nvidia Ampere Discussion [2020-05-14]

Benetanegia

CarstenS

Moderator

trinibwoy

Meh

Benetanegia

trinibwoy

Meh

Scott_Arm

DegustatoR

CarstenS

Moderator

trinibwoy

Meh

DegustatoR

trinibwoy

Meh

CarstenS

Moderator

CarstenS

Moderator

DegustatoR

Benetanegia

CarstenS

Moderator

trinibwoy

Meh

PSman1700

trinibwoy

Meh

Davros

Similar threads