Speculation: GPU Performance Comparisons of 2020 *Spawn*

Status
Not open for further replies.
I'm sure you didn't forget the part where you inflated Vega's clocks, you're only pretending you did.

I didn't inflate anything. I went with official specs. You know the ones given by your beloved company.

Nah, they're just typical load frequencies when playing games, or the same benchmarks seen in your techpowerup reference.

ModEdit: Removed unnecessary lingo

Sure, those are coincidentally the only two graphics card in existence that boost the same in every game. *redacted* In your link they were literally stress testing the Vega card...

What exactly is "all performance metrics!!!111oneone"

Vega 56 (typical 1.3GHz clock) vs. RX580 (typical >1.4GHz clock):
1.43x FLOPs (9.3 vs. 6.5 TFLOPs)
1.6x memory BW (256 vs. 410GB/s)
1.84x pixel fillrate (45 vs. 83 GPixel/s)
1.45x texel fillrate (201 vs. 291 GTexel/s)
1.38x higher performance at 1440p, 1.45x higher performance at 4K

So where is your 21% between all performance metrics here?

In the Vega 64, which by your own admision was as fast as the Vega56, at same clocks. You forget we are talking about scaling. You deliberately chosing a card with less FP32, less pixel fillrate, less texel fillrate, less of everything, to make the comparison, basically proving the scaling issues on Vega that you are so desperately trying to disprove is hilarious. You're literally saying that Vega 64, had too much of everything == scaling issues. You don't disprove scaling issues by choosing a GPU with less of "everything" needing to scale. *redacted*

BTW, would you like to make a similar comparison between e.g. the RTX 3090 and the RTX 2060? We would all like to see how "Ampere is broken" after looking at that comparison.

Sure. But I warn you, the results are going to be very different than what you expect, so:

4.09x TOPs (36 vs. 8.8 TOPs) (TOPs being flops + 36 INT pr 100 FP32)
2.79x memory BW (336 vs. 936GB/s)
2.77x texel fillrate (556 vs. 201 GTexel/s)
2.03x pixel fillrate (162 vs. 80 GPixel/s)
2.94x higher performance at 4K

So basically even better scaling of performance above metrics scaling than the 3080 vs 2080. Uh oh... that's not what we wanted to find. Ups!

So Vega 10 was bad because it needed "full refactoring of a renderer to make use of its FP32 capabilities". Ampere is good because "it needs a new game engine".
Got it.

To make use of its FP32, pixel and texel crunching and enormous bandwidth capabilities, you mean. In Vega if you used more FP32, you would just have much more ROPs, TMUs and bandwidth sitting around doing nothing, which is again bad scaling. You just moved the inefficiency from FP32 to ROPs or TMU, you didn't make better use of the silicon. In Ampere the only thing sitting around doing nothing are the FP32 units, everything else is forcefully being used to its full capabilities.
 
Last edited by a moderator:
Show me where I've said that Ampere "needs a new game engine". And also do explain why please.
You've been counter-arguing all my posts that explicitly condemn that message. If you don't disagree with it then why are we talking about it?
 
You've been counter-arguing all my posts that explicitly condemn that message. If you don't disagree with it then why are we talking about it?
I don't know what you are talking about, hence why I ask you to show where I've said what you've quoted. Because I certainly didn't say anything of the kind.
And do explain why you think that "Ampere needs a new game engine" to make full use of its FP32 h/w. Because it certainly doesn't. It doesn't even need any new API for that.
 
In Vega if you used more FP32, you would just have much more ROPs, TMUs and bandwidth sitting around doing nothing, which is again bad scaling. You just moved the inefficiency from FP32 to ROPs or TMU, you didn't make better use of the silicon. In Ampere the only thing sitting around doing nothing are the FP32 units, everything else is forcefully being used to its full capabilities.
This here sums up pretty much every important point regarding this topic.
 
To make use of its FP32, pixel and texel crunching and enormous bandwidth capabilities, you mean. In Vega if you used more FP32, you would just have much more ROPs, TMUs and bandwidth sitting around doing nothing, which is again bad scaling. You just moved the inefficiency from FP32 to ROPs or TMU, you didn't make better use of the silicon. In Ampere the only thing sitting around doing nothing are the FP32 units, everything else is forcefully being used to its full capabilities.

Quite frankly, this is simply not true. It depends on the workload, for both Vega and Ampere. There will be always a bottleneck somewhere, the architecture being called Vega, RDNA, Ampere or Turing or Hopper. We can agree that for gaming workloads at the time Vega units' utilization was crap, Pascal was better, Turing even better, Ampere decreased the average utilization. That said, declaring "everything else is used to its full capabilities" is simply crap. There will be never something like "full utilization".
 
That said, declaring "everything else is used to its full capabilities" is simply crap. There will be never something like "full utilization".

I'm pretty sure that I said full capabilities, not full utilization. I think you even quoted it. I think they don't mean the same by any means. I think I made good use of language.
 
We can agree that for gaming workloads at the time Vega units' utilization was crap, Pascal was better, Turing even better, Ampere decreased the average utilization.
Nope. With only 25% of gaming math on average being INT32 Turing's INT SIMDs are idling in gaming in 75% of clocks and this is definitely worse than Pascal in h/w utilization but better for actual math throughput.
The elephant in the room for Ampere is this: does Ampere have one FP32/INT SIMD or two separate INT and FP32 SIMDs on the same datapath?
If it's the former then Ampere has actually improved h/w utilization over Turing as it can run these 75% of FP32 instructions on both SIMDs once it's done with 25% of INTs - nothing idles at any clock.
If it's the latter then sure there will be a decrease of actual h/w utilization compared to Turing.

Edit: Also of note - h/w utilization isn't the same as FLOPS utilization.
 
I'm pretty sure that I said full capabilities, not full utilization. I think you even quoted it. I think they don't mean the same by any means. I think I made good use of language.

What does it mean? It has no sense. Full capabilities means utilizaiton of all units, to their theoretical maximum. There are no other meanings.
 
What does it mean? It has no sense. Full capabilities means utilizaiton of all units, to their theoretical maximum. There are no other meanings.

Theoretical maximum is impossible. There is however a certain level that is achievable and repeatable for each use case. For example for SGEMM and latest generations is some >90% of theoretical maximum.

As for my sentence. We are talking about averages, general cases. By your definition "x% faster on games" makes as little sense as my sentence above. No one seems to have a problem with it tho.
 
Nope. With only 25% of gaming math on average being INT32 Turing's INT SIMDs are idling in gaming in 75% of clocks and this is definitely worse than Pascal in h/w utilization but better for actual math throughput.
The elephant in the room for Ampere is this: does Ampere have one FP32/INT SIMD or two separate INT and FP32 SIMDs on the same datapath?
If it's the former then Ampere has actually improved h/w utilization over Turing as it can run these 75% of FP32 instructions on both SIMDs once it's done with 25% of INTs - nothing idles at any clock.
If it's the latter then sure there will be a decrease of actual h/w utilization compared to Turing.

Edit: Also of note - h/w utilization isn't the same as FLOPS utilization.

It does not matter. Ampere cannot use at the same time the second FP units OR the INT units. That means, some unit will stay idle EVERY TIME. In the hardware. What it does not stay idle is the scheduling of the second pipeline. Also, combining INT and FP will add context swtches and overhead in the scheduling.
So, utilization of available units decreased. It is a good compromise? Yes, as it costed relatively little to add this function in terms of area and it increases even the real FP throughput.
Does it mean Ampere will hit double of FP throughput respect to Turing at ISO clocks? No, it will depend on the workload. This is all.
 
Theoretical maximum is impossible. There is however a certain level that is achievable and repeatable for each use case. For example for SGEMM and latest generations is some >90% of theoretical maximum.

As for my sentence. We are talking about averages, general cases. By your definition "x% faster on games" makes as little sense as my sentence above. No one seems to have a problem with it tho.

Problem with your definition is, averages will depend on the architecture. If by absurd, efficiency in achitecture X will be at maximum 50%, because the architecture constraints say so, it will use its "maximum capabilities" even if it makes half of the units staying idle all the time. Which good sense says it is not so efficient.
 
Does it mean Ampere will hit double of FP throughput respect to Turing at ISO clocks? No, it will depend on the workload. This is all.

I think this needs adressing, because I think it's been repeated at least twice and at this point it's hard to differentiate if this is part of the problem in regards to te discussion or not. No one has suggested that Ampere will under any circumstance hit double FP throughput. Period.
 
Problem with your definition is, averages will depend on the architecture. If by absurd, efficiency in achitecture X will be at maximum 50%, because the architecture constraints say so, it will use its "maximum capabilities" even if it makes half of the units staying idle all the time. Which good sense says it is not so efficient.

But we are not talking about efficiencies in that respect, and they don't matter. It doesn't matter if Nvidia ROPs are typically capable of only 50% of its rated throughput due to architecture contraints, it will be the same on all cards and circumstances. Again within the context that we are also talkibng about "average game performance", and not this specific game at this specific resolution and this specific AO setting. ROPs or FP32 or anything usage could be more efficient in one game or setting, but that likely makes a difference in game performance that makes it deviate from the "average game performance". Like Crysis remaster for example.

Edit: basically, it's silly to point out that unit utilization can vary from task to task, without also taking into consideration that performance in that task most likely also varies, because of it.
 
It does not matter. Ampere cannot use at the same time the second FP units OR the INT units. That means, some unit will stay idle EVERY TIME. In the hardware.
Did you even read what I've said?

Also, combining INT and FP will add context swtches and overhead in the scheduling.
Like on every other GPU in existence? Or do you somehow propose to not run INTs on them at all?

Does it mean Ampere will hit double of FP throughput respect to Turing at ISO clocks? No, it will depend on the workload. This is all.
Yes, it will depend on the workload and if said workload will be FP32 only then Ampere will hit exactly what you've said. This is all.
 
Not sure why there is a “conclusion” drawn that “everything else sitting idle if you do more maths” is GCN/Vega specific

GCN CU is explicitly stated to support issuing up to 5 instructions of different categories from 40 in-order instruction streams. I fail to see how it would be different for SMs that are basically doing the same thing, multiplexing a pool of in-order instruction streams to two issue ports shared by everything. Parallelism in both are still bound by the program order, and compiler assisted scheduling as permitted by the architecture.

If your work is VALU bound, and you are running just a large grid of it, then of course everything else is sitting idle. The ROPs and caches aren’t gonna mine bitcoins on their own, are they? The same logic applies to any bottleneck of any kernel/workload.
 
Last edited:
Not sure why there is a “conclusion” drawn that “everything else sitting idle if you do more maths” is GCN/Vega specific

No one said it's specific to GCN/Vega. However, would you agree that 3080 being 200:1 ratio in Gflops:Gpixels vs Vega 64 127:1 ratio is far more likely to hit pixel bottleneck long before it does flops or not?
 
But we are not talking about efficiencies in that respect, and they don't matter. It doesn't matter if Nvidia ROPs are typically capable of only 50% of its rated throughput due to architecture contraints, it will be the same on all cards and circumstances. Again within the context that we are also talkibng about "average game performance", and not this specific game at this specific resolution and this specific AO setting. ROPs or FP32 or anything usage could be more efficient in one game or setting, but that likely makes a difference in game performance that makes it deviate from the "average game performance". Like Crysis remaster for example.

Edit: basically, it's silly to point out that unit utilization can vary from task to task, without also taking into consideration that performance in that task most likely also varies, because of it.

If these efficiencies do not matter, and everything depends on the workload, then why both parties are trying to demonstrate that "Vega is broken" and "Ampere will be the best gaming architecture ever" vs "Vega needs a specific workload to shine and if it's broken even Ampere is broken"?
Ampere is more gaming oriented than Vega, both could reach their peak FP throughput in certain workloads and both have inefficiencies, one can easily argue that GCN has quite more problems in hitting its peak in gaming and I agree.
 
Did you even read what I've said?

Yes I've read and some parts do not make sense at all because you are trying to reach a conclusion
Like

Edit: Also of note - h/w utilization isn't the same as FLOPS utilization.

Who spoke about FLOPS utilization? I was speaking about hardware utilization, in particular second FP pipeline staying idle whenever an INT instruction must be executed.
But you are trying to push your view even saying that "Ampere has increased HW utilization". If we were talking about ROPS, or TMU, I could agree, but we were specifically talking about shader core, and in the shader core I have ALWAYS the second FP OR the INT pipeline idle. Whan in Turing that does not happen. So there is hardware in Ampere always unused. How it is possible that you write "hardware utilization increases" is beyond me.

Like on every other GPU in existence? Or do you somehow propose to not run INTs on them at all?
You are continuing to move the target.

Yes, it will depend on the workload and if said workload will be FP32 only then Ampere will hit exactly what you've said. This is all.

And then you will have the hardware on the INT pipeline completely unused. So hardware utilization in the shader core actually is lowered. But as I understand that you are not trying to honestly discuss, but you are pushing your agenda here, I will stop to discuss here.
 
If these efficiencies do not matter, and everything depends on the workload, then why both parties are trying to demonstrate that "Vega is broken" and "Ampere will be the best gaming architecture ever" vs "Vega needs a specific workload to shine and if it's broken even Ampere is broken"?
Ampere is more gaming oriented than Vega, both could reach their peak FP throughput in certain workloads and both have inefficiencies, one can easily argue that GCN has quite more problems in hitting its peak in gaming and I agree.

First of all, no one has said any of that.

Second, because those efficiencies are repeatable. Going by SGEMM for example, nothing in Ampere architecture suggests that that it would be less capable to reach the same % as Turing on that workload. On the contrary, 50% more L1, 2x L1 bandwidth and more varied configs of L1/shared memory suggest actually better efficiency is posible. Now instead of pure SGEMM program it's a certain portion of frametime doing mostly MxM math. What prevents Ampere from being 2x faster per SM in this portions? Nothing. And the 90% figure is lost and irrelevant to that.
 
Status
Not open for further replies.
Back
Top