Speculation: GPU Performance Comparisons of 2020 *Spawn*

Status
Not open for further replies.
First of all, no one has said any of that.

Second, because those efficiencies are repeatable. Going by SGEMM for example, nothing in Ampere architecture suggests that that it would be less capable to reach the same % as Turing on that workload. On the contrary, 50% more L1, 2x L1 bandwidth and more varied configs of L1/shared memory suggest actually better efficiency is posible. Now instead of pure SGEMM program it's a certain portion of frametime doing mostly MxM math. What prevents Ampere from being 2x faster per SM in this portions? Nothing. And the 90% figure is lost and irrelevant to that.

Quite frankly, if someone rereads all the discussion until now, that is the impression he'll get.

And again, you are trying to generalize specific workloads to every workload. here you are describing a specific scenario, and then you are applying that result to every scenario. What you just said, cuould be applied to every workload? No. Because as soon as I mix INT instruction, I stop to be 2x or more faster than Turing. So instead of being 2x faster, we get "something that is more than 1 and less than 2"x.
 
I can't see how Ampere utilisation of ALUs is lower than Turing? If Turing had completely separate INT units that were only used 33% of the time, which now double as FP units as well, wouldn't that mean that utilisation actually increased? That when those units are not doing INT calculations that can do FP instead of staying idle? I guess this might be more complex than that in case some INT/FP units might not be used at all because some on the same SM is doing INT preventing some cores from doing FP?
 
Yes I've read and some parts do not make sense at all because you are trying to reach a conclusion
I'm not the one who's trying to reach a predetermined conclusion here.

Who spoke about FLOPS utilization?
The guys at the start of the discussion? "GA102 never reaches 30 TFLOPs in games because the shader processors will halt while waiting for other bottlenecks in the chip"
You in your previous post? "Does it mean Ampere will hit double of FP throughput respect to Turing at ISO clocks? No, it will depend on the workload. This is all."

I was speaking about hardware utilization, in particular second FP pipeline staying idle whenever an INT instruction must be executed.
H/w utilization means little if your chip is still efficient enough per transistor to be competitive. That's the beauty of GPUs - you can do things in a million of different ways, the only thing which matters is the performance per price. So strictly speaking it doesn't even matter and all this discussion is pure theory.

But you are trying to push your view even saying that "Ampere has increased HW utilization". If we were talking about ROPS, or TMU, I could agree, but we were specifically talking about shader core, and in the shader core I have ALWAYS the second FP OR the INT pipeline idle. Whan in Turing that does not happen. So there is hardware in Ampere always unused. How it is possible that you write "hardware utilization increases" is beyond me.
Well let's look at the information which we have then?

ampere-sm.jpg


Do you see any "second FP OR the INT pipeline" here?

As I've said, the answer to that question is tied to the answer on how exactly Ampere handles INT execution.
If it's a separate SIMD then yeah there will be more idle h/w in Ampere then in Turing when running the same code.
If it's the same SIMD as that which is used for FP32 then no, there will be less idle h/w here than in Turing.

You are continuing to move the target.
I'm dead solid on my target. It's you who constantly move between h/w and flops utilization - which aren't at all the same.

And then you will have the hardware on the INT pipeline completely unused.
Again, choose what you're talking about. It's either perf/flop which this discussion has started on or general h/w utilization. If it's the latter then there are two possible scenarios for Ampere, not one.

But as I understand that you are not trying to honestly discuss, but you are pushing your agenda here, I will stop to discuss here.
Says the man who switched the goalpost at least two times in two consecutive posts.
 
Last edited:
And again, you are trying to generalize specific workloads to every workload. here you are describing a specific scenario, and then you are applying that result to every scenario. What you just said, cuould be applied to every workload? No. Because as soon as I mix INT instruction, I stop to be 2x or more faster than Turing. So instead of being 2x faster, we get "something that is more than 1 and less than 2"x.

I didn't apply to every workload. We are discussing capabilities. Hence my example. That's peak

Because as soon as I mix INT instruction, I stop to be 2x or more faster than Turing. So instead of being 2x faster, we get "something that is more than 1 and less than 2"x.

That's correct, and applies to GCN and RDNA too, depending on INT share it's going to be something between 0 and 1, instead of 1 and 2. Only Turing gets a pass on this. If this has never ever been discused before, why do we have to mention that now exactly?

Anyway, for Ampere, if you decrease INT share in favor of FP share, you get closer or farther away from that 2x figure? Literally no one has said that future games will reach 2x, only that it will get closer to it than current games. The closest has been DegustatoR talking about an hypothetical workload that uses FP32 exclusively. Disregarding how realistic such an application would be, would such a FP32 bound workload be 2x faster or not?
 
I didn't apply to every workload. We are discussing capabilities. Hence my example. That's peak



That's correct, and applies to GCN and RDNA too, depending on INT share it's going to be something between 0 and 1, instead of 1 and 2. Only Turing gets a pass on this. If this has never ever been discused before, why do we have to mention that now exactly?

Anyway, for Ampere, if you decrease INT share in favor of FP share, you get closer or farther away from that 2x figure? Literally no one has said that future games will reach 2x, only that it will get closer to it than current games. The closest has been DegustatoR talking about an hypothetical workload that uses FP32 exclusively. Disregarding how realistic such an application would be, would such a FP32 bound workload be 2x faster or not?

I think everyone here agrees that if the workload is purely FP32 then Ampere has a 2x advantage. Problem is that the Nvidia PR man is trying to push that advantage everywhere, everytime.
 
I can't see how Ampere utilisation of ALUs is lower than Turing? If Turing had completely separate INT units that were only used 33% of the time, which now double as FP units as well, wouldn't that mean that utilisation actually increased? That when those units are not doing INT calculations that can do FP instead of staying idle? I guess this might be more complex than that in case some INT/FP units might not be used at all because some on the same SM is doing INT preventing some cores from doing FP?

Per SM block FP32 is up 2x over Turing. So utilisation has improved immensely (with the help of 33% more L1 cache and twice the bandwidth). A 2080TI has 47% more SMs than the RTX3070 and both perform equal (minus less ROP and geometry performance, + 37,5% more bandwidth).
 
I guess this might be more complex than that
It can be more complex if Ampere has added a second FP32 SIMD to Turing's INT32+FP32 SIMDs. In this case there are actually three math units in Ampere h/w (well, four counting SFUs) with two of them sitting on the same datapath and thus inaccessible to be used in parallel. This is possible as NV generally tends to do specialized h/w for different math types in their GPUs. It would still be somewhat weird to see in Ampere though considering that a SIMD which can run both FP32 and INT32 isn't rocket science - NV had them up until Turing, AMD has them, Intel has them too I think?

I can think of only one apparent advantage of having them separate in h/w - if the combined complexity of separate FP32+INT SIMDs is less than the complexity of a universal FP32+INT SIMD. In which case it won't matter much that one of them will be idle at each given clock as you would still have a net win in perf/transistor.
 
I can't see how Ampere utilisation of ALUs is lower than Turing? If Turing had completely separate INT units that were only used 33% of the time, which now double as FP units as well, wouldn't that mean that utilisation actually increased? That when those units are not doing INT calculations that can do FP instead of staying idle? I guess this might be more complex than that in case some INT/FP units might not be used at all because some on the same SM is doing INT preventing some cores from doing FP?

Because these are not the same ALUs. It is the same datapath/scheduler. So if you are saying that the scheduler has increased utilization, it is true. If you take the ALUS I could have more ALU dedicated transistors idling, per clock. So a SM can do more work per clock? Yes. Does it use more of its ALU transistors per clock, or get the same ALU average utilization? Depends on the workload. For gaming, this seems not true.
 
Last edited:
I think everyone here agrees that if the workload is purely FP32 then Ampere has a 2x advantage.

Then why doesn't everyone agree, that if the mix is 1.6x FP rather than say 1.4x FP as is the case in average in current games, that there would be an increase in performance?

Problem is that the Nvidia PR man is trying to push that advantage everywhere, everytime.

Literally no one is doing that. The only time the 2x has been brought up, has been in pure hypothetical scenarios like the one I presented, because it was put into question that Ampere can reach that peak performance in pure FP workloads.
 
Because these are not the same ALUs. It is the same datapath/scheduler. So if you are saying that the scheduler has increased utilization, it is true. If you take the ALUS I could have more ALU dedicated transistors idling, per clock. So a SM can do more work per clock? Yes. Does it use more of its ALU transistors per clock, or get the same ALU average utilization? Depends on the workload. For gaming, this seems not true.

Edit - You are right.
 
Last edited:
Is that completely clear they are not the same ALUs? Because my understanding is that they are, as opposed to Turing which had separate ones? If they are still separate why did Nvidia made the distinction with Ampere? Did Turing have different scheduling for INT and FP? That sounds wasteful?

There is waste in both ways. FP:INT in modern games varies greatly, but it should be normally around 2:1 in average (depends on the engine, shaders and so on, sometimes is more). If this is the ratio, then in Turing you have the INT pipeline working half the time and FP pipeline working all the time. In Ampere you have one FP unit working all the time, and the other FP unit working only 33% of the time, while 66% of the time in the second datapath the INT unit will work. So in Turing I have an INT unit sitting idle 50% of the time, in Amprere I have always an INT or a FP SIMD idle. The higher is the FP:INT ratio, the better is the Ampere usage and the worse is Turing usage. For a pure FP workload, Turing is more inefficient in terms of ALU utilization as there will be always the INT pipeline idle, same in Ampere but in that case you have anyway 2/3 of the SIMD in action.
 
Then why doesn't everyone agree, that if the mix is 1.6x FP rather than say 1.4x FP as is the case in average in current games, that there would be an increase in performance?

I don't know, because it is true (I explained it just the post above). Question is, for better transistor utilization, it would have been better to have 2 complete FP pipelines and 1 INT, probably the transistor budget would have not allowed that. Likewise, it is true that increasing the FP:texture ratio in Vega, would have improved ALU utilization and thus performance.

Literally no one is doing that. The only time the 2x has been brought up, has been in pure hypothetical scenarios like the one I presented, because it was put into question that Ampere can reach that peak performance in pure FP workloads.

Question was posed not because of peak FP performance in pure FP workloads. Because that is true, in those workloads Ampere will do 2xTuring, per SM. Question was about ALU utilization and that 2xFP per SM performance not being achievable, in general, in gaming workloads.
 
Isn't it the other way around? I'm very confused lol.
It's not. All GPUs but Turing run INTs on the same SIMDs as FP which means that these SIMDs have sets of ALUs for both math types. So in this sense Ampere is the same as Navi or GCN or Pascal or Kepler, etc. So when he's saying that "these are not the same ALUs." implying that Ampere is idling a set of ALUs of a SIMD when it's running FP or INT on it - it is the exact same thing as any other GPU on the market does, with the expection of Turing which is running INTs on a separate SIMD where there is no FP ALUs.
 
And to that I responded that there is not such thing as a predefined "gaming workload". One could decide to write a software renderer using purely (mostly) FP32 for their game. Is that less of a gaming worload?

Semantics. Gaming workloads are those that are found in real world. At the moment, no one has written a pure FP32 gaming workload, with FP:INT ratio varying from 1,7:1 to around 3:1.
In the future? Who knows. That does not demonstrate these gaming workloads will exist.
 
So in Turing I have an INT unit sitting idle 50% of the time, in Amprere I have always an INT or a FP SIMD idle.
That's assuming that there are two separate SIMDs for FP and INT - which is unlikely. A far more likely scenario is one SIMD with two sets of ALUs - just like in RDNA or Pascal or whatever. So what are you even arguing about?
RDNA which has 4 SIMDs in a WGP each of which is capable of running INT32 is wasting a lot more h/w in games where there's only 25% of math in INT32 compared to Ampere for example.
 
Status
Not open for further replies.
Back
Top