Nvidia Ampere Discussion [2020-05-14]

I wonder if the beta Automatic Tuning feature could accomplish a similar result.
geforce-experience-new-performance-tuning-and-monitoring-options.png

https://www.nvidia.com/en-us/geforc...tform/#automatic-tuning-in-geforce-experience
that's a nice discovery, 'cos right now MSi Afterburner has auto OC, and it is working well for me. MSI Afterburner seems to use some kind of undervolt. But if it adapts to your computer's PSU -dunno how-, that's another reason to leave MSi Afterburner -maybe-.
 
to sum some things up.

- CUDA Cores: int32 and fp32 operations are indifferently performed by a CUDA core. That's HUGE!! as shown by the graph below --no specific cuda cores for int32 and fp32 operations. Means: no idle int32 or fp32 cores because cuda cores either perform floating point or integer operations. (games perform around 20-30% operations in int32 format)

- Where you will notice the generational leap is in the fact that new GPUs perform muh better raytracing denoising.

- IMO, where series 3000 destroys previous generations is in RT games, which is the way to go.

- The 3070 performs as the 2080Ti despite having 20 teraflops vs 13 teraflops. This shows how different they are now -this imho means again, that where they truly shine is at RT games.

jm55Use.png
 
that's a nice discovery, 'cos right now MSi Afterburner has auto OC, and it is working well for me. MSI Afterburner seems to use some kind of undervolt. But if it adapts to your computer's PSU -dunno how-, that's another reason to leave MSi Afterburner -maybe-.

Yah, one way people undervolt is to let oc scanner in afterburner run and generate a curve. Then they know what frequency is stable at each voltage. So you kind flatten out from whatever frequency you have as your target. I would not be surprised if you can push things a little bit more manually, but it's a good starting point. You should be able to do the same with the power limit adjusted up or down. This nvidia tool looks like it can do the same thing, but if it requires geforce experience I'm not sure that I want it.
 
to sum some things up.

- CUDA Cores: int32 and fp32 operations are indifferently performed by a CUDA core. That's HUGE!! as shown by the graph below --no specific cuda cores for int32 and fp32 operations. Means: no idle int32 or fp32 cores because cuda cores either perform floating point or integer operations. (games perform around 20-30% operations in int32 format)

- Where you will notice the generational leap is in the fact that new GPUs perform muh better raytracing denoising.

- IMO, where series 3000 destroys previous generations is in RT games, which is the way to go.

- The 3070 performs as the 2080Ti despite having 20 teraflops vs 13 teraflops. This shows how different they are now -this imho means again, that where they truly shine is at RT games.

Not true. Supported modes are only 128+0 or 64+64.
 
Not true. Supported modes are only 128+0 or 64+64.
Not sure about that. Isn't it really per SM partition? It should be at least. 32+0 or 16+16. I'm splitting hairs like this, because that's leaving the pure FP32 block idle only when you have a 1:2 ratio of FP vs. INT (and underutilized at anything less than 1:1).

In other news, and because it was mentioned earlier: This makes paper-TFlops IMO actually more comparable between Nvidia and AMD('s current gen), since now INT-ops go against the TFlops budget on both sides, where as they were kind of free on Nvidia with Turing, making it overperform wrt to its TFlops rating.
 
Yah, one way people undervolt is to let oc scanner in afterburner run and generate a curve. Then they know what frequency is stable at each voltage. So you kind flatten out from whatever frequency you have as your target. I would not be surprised if you can push things a little bit more manually, but it's a good starting point. You should be able to do the same with the power limit adjusted up or down.
I do it like this :
Main problem with TDP slider is inefficiency (since it doesn't throttle voltage as much as it can [for GPU to be stable], performance drop is also higher than it should be under maximum load).
 
Not sure about that. Isn't it really per SM partition? It should be at least. 32+0 or 16+16. I'm splitting hairs like this, because that's leaving the pure FP32 block idle only when you have a 1:2 ratio of FP vs. INT (and underutilized at anything less than 1:1).
Shouldn't that split be like this :
128:0 = 4x16 [FP32 fixed] + 4x16 FP32 ["mixed" units]
64:64 = 4x16 [FP32 fixed] + 4x16 INT32 ["mixed" units]
?
 
Shouldn't that split be like this :
128:0 = 4x16 [FP32 fixed] + 4x16 FP32 ["mixed" units]
64:64 = 4x16 [FP32 fixed] + 4x16 INT32 ["mixed" units]
?

No, instruction dispatch is handled independently each clock in each of the 4 SM partitions. Each partition has 16 FP32 fixed + 16 FP32/INT32 mixed pipelines. The mix of instructions each clock can be different between partitions in the same SM.
 
Shouldn't that split be like this :
128:0 = 4x16 [FP32 fixed] + 4x16 FP32 ["mixed" units]
64:64 = 4x16 [FP32 fixed] + 4x16 INT32 ["mixed" units]
?
Each SM-Partition (i.e. block consisting of 256kiB RF, 16xINT32, 16xFP32, 4xL/S, 4xSFU) has its own scheduler and dispatch, so they should be able to operate independently.
 
Not sure about that. Isn't it really per SM partition? It should be at least. 32+0 or 16+16. I'm splitting hairs like this, because that's leaving the pure FP32 block idle only when you have a 1:2 ratio of FP vs. INT (and underutilized at anything less than 1:1).

In other news, and because it was mentioned earlier: This makes paper-TFlops IMO actually more comparable between Nvidia and AMD('s current gen), since now INT-ops go against the TFlops budget on both sides, where as they were kind of free on Nvidia with Turing, making it overperform wrt to its TFlops rating.
I wasn't sure how granular the split could be but Computerbase.de states its either 128FP or 64FP + 64INT. I figured it was a scheduling limitation of some type and would help explain the performance scaling deficit. With finer grained scheduling and higher utilization i would have expected the dramatic increase in core counts to result in a bigger performance increase even with other possible bottlenecks.
 
Last edited:
I wasn't sure how granular the split could be but Computerbase.de states its either 128FP or 64FP + 64INT. I figured it was a scheduling limitation of some type and would help explain the performance scaling deficit. With finer grained scaling and higher utilization i would have expected the dramatic increase in core counts to result in a bigger performance increase even with other possible bottlenecks.
Bandwidth, CPU, etc are still there. The more shading limited Ampere is - the closer it gets to its peak flops.
 
I wasn't sure how granular the split could be but Computerbase.de states its either 128FP or 64FP + 64INT. I figured it was a scheduling limitation of some type and would help explain the performance scaling deficit. With finer grained scheduling and higher utilization i would have expected the dramatic increase in core counts to result in a bigger performance increase even with other possible bottlenecks.
Not sure why Nvidia would invest in more control logic (scheduler, dispatch, compared to Pascal) if not for more fine grained control. Haven't been in the briefings, so this is just me making assumptions, though.
 
Not sure about that. Isn't it really per SM partition? It should be at least. 32+0 or 16+16. I'm splitting hairs like this, because that's leaving the pure FP32 block idle only when you have a 1:2 ratio of FP vs. INT (and underutilized at anything less than 1:1).

In other news, and because it was mentioned earlier: This makes paper-TFlops IMO actually more comparable between Nvidia and AMD('s current gen), since now INT-ops go against the TFlops budget on both sides, where as they were kind of free on Nvidia with Turing, making it overperform wrt to its TFlops rating.
BTW I think TFlops was a somewhat fair indicative between turing and navi 10 gaming performance, but ampere is definitely another kind of beast. Nvdia said 14.2TFlop for 2080 Ti FE at 1635Mhz boost, and AMD 9.7 for 5700 XT at 1905Mhz, but 2080 Ti FE real gaming median clock is 1830Mhz and 1890Mhz for 5700 XT, so should be around 16TFlop and 9.5TFlop...in latest TPU performance chart, 2080 Ti is 50% faster at gaming 4K, so looks like neck and neck or even AMDs TFlops are already a bit ahead there.
 
Back
Top