Nvidia Ampere Discussion [2020-05-14]

This nvidia tool looks like it can do the same thing, but if it requires geforce experience I'm not sure that I want it.
I made the move to GFE right after registering for a GOG account and have not looked back. I think if you want to take full advantage of features using tensor/rt cores (gamers or professionals) then it will become necessary.
 
Which tests demonstrate this?
Most of those which give an option between a simpler and more complex shading workload.
I've already mentioned TS vs TSE.
RT results are similar albeit it's harder to decouple RT improvements from flops gains there.
I think that people don't fully understand that Ampere is mostly limited by other stuff besides math atm. And it will likely increase it's performance advantage over Turing over time as more next gen games (even without RT) will come to market.
 
Most of those which give an option between a simpler and more complex shading workload.
I've already mentioned TS vs TSE.
RT results are similar albeit it's harder to decouple RT improvements from flops gains there.
I think that people don't fully understand that Ampere is mostly limited by other stuff besides math atm. And it will likely increase it's performance advantage over Turing over time as more next gen games (even without RT) will come to market.

Nvidia hasn't said anything yet about why they went so hard on flops. It's certainly not guaranteed to pay dividends in the future. SVOGI is getting some attention but I don't know if that's even compute bound. There's also a lot of focus on high quality asset streaming but that's likely to be rasterization bound anyway. Not sure where all these flops will prove beneficial.
 
To demonstrate progress? If the 1080 was twice as fast as per your suggestion, it would be as fast as RTX 3080 in that example.

But there is no progress to demonstrate so the slide is misleading at best. A Pascal SM retired 256 FP32 flops per clock and INT32 had to compete for instruction slots, same as Ampere.
 
Nvidia hasn't said anything yet about why they went so hard on flops. It's certainly not guaranteed to pay dividends in the future. SVOGI is getting some attention but I don't know if that's even compute bound. There's also a lot of focus on high quality asset streaming but that's likely to be rasterization bound anyway. Not sure where all these flops will prove beneficial.
Well, RTX I/O is flops, for one.
 
Most of those which give an option between a simpler and more complex shading workload.
I've already mentioned TS vs TSE.
I think you should post specifics so that we can have a discussion about your theory.

RT results are similar albeit it's harder to decouple RT improvements from flops gains there.
I think that people don't fully understand that Ampere is mostly limited by other stuff besides math atm. And it will likely increase it's performance advantage over Turing over time as more next gen games (even without RT) will come to market.
So "fine wine"?
 
But there is no progress to demonstrate so the slide is misleading at best. A Pascal SM retired 256 FP32 flops per clock and INT32 had to compete for instruction slots, same as Ampere.

They're demonstrating the architectural improvement irrespective of unit counts.

Ampere can issue FP and INT simultaneously just like Turing and unlike Pascal. The difference is in the ratios which in Ampere represent a more realistic real world ratio. This test demonstrates that. Pascal would always come out behind, but if the test featured a more unrealistic 50/50 FP/INT split then Turing would have equaled Ampere in this unit count agnostic test.
 
But there is no progress to demonstrate so the slide is misleading at best. A Pascal SM retired 256 FP32 flops per clock and INT32 had to compete for instruction slots, same as Ampere.
An SM is just a measure of control per compute. Or do you think, all generations after Kepler with its 192 FP32 ALUs per SM are a regression?
 
I wasn't sure how granular the split could be but Computerbase.de states its either 128FP or 64FP + 64INT. I figured it was a scheduling limitation of some type and would help explain the performance scaling deficit. With finer grained scheduling and higher utilization i would have expected the dramatic increase in core counts to result in a bigger performance increase even with other possible bottlenecks.
SMs can issue 4 warps per clock, warp is 32 threads. There are 4 fp32 SIMDs that are 16 wide and 4 fp32+int32 SIMDs that are also 16 wide. Issuing a warp to SIMD will take 2 clocks to consume. So other combinations are possible if warps are available.
 
They're demonstrating the architectural improvement irrespective of unit counts.

And this is why the picture is wrong. Looking at a single unit (i.e. SM) Pascal has the same instruction throughput as Ampere for mixed INT and FP loads. Ampere wins because it has more SMs total.

Ampere can issue FP and INT simultaneously just like Turing and unlike Pascal.

And Pascal's units are twice as wide so it's a wash.

Pascal would always come out behind

No, there's no scenario under which a Pascal SM has lower ALU throughput than an Ampere SM.
 
And this is why the picture is wrong. Looking at a single unit (i.e. SM) Pascal has the same instruction throughput as Ampere for mixed INT and FP loads. Ampere wins because it has more SMs total.



And Pascal's units are twice as wide so it's a wash.



No, there's no scenario under which a Pascal SM has lower ALU throughput than an Ampere SM.

I don't think the chart was supposed to represent an SM, rather it's just normalised ALU throughput.
 

gb5drjaf.png


https://www.3dcenter.org/news/erster-geekbench-wert-zur-geforce-rtx-3080-aufgetaucht
 

Seems about right. I'm super curious about overclocking. Overclocking will speed up the rops, the raster engines, the texture units, but I would assume in an OpenCL test those wouldn't be the bottleneck. That means the gpu most likely doesn't have enough bandwidth in all situations for the number of cuda cores, or is there something else in the front-end that could explain it? Is it possible this test is mixing int32, but in an unequal ratio to fp32, like a 2:1 split for fp32 to int32?
 
Can somebody explain me why they build : fp32 + (int32 or fp32) instead of fp32 + int32 + fp32?
They have all 3 units on the chip, why not use all together the same time?
is the scheduling so much more complicated? Maybe issue with cache size and latency?
 
Back
Top