Speculation: GPU Performance Comparisons of 2020 *Spawn*

Discussion in 'Architecture and Products' started by eastmen, Jul 20, 2020.

Thread Status:
Not open for further replies.
  1. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    376
    Likes Received:
    385
    I didn't inflate anything. I went with official specs. You know the ones given by your beloved company.

    ModEdit: Removed unnecessary lingo

    Sure, those are coincidentally the only two graphics card in existence that boost the same in every game. *redacted* In your link they were literally stress testing the Vega card...

    In the Vega 64, which by your own admision was as fast as the Vega56, at same clocks. You forget we are talking about scaling. You deliberately chosing a card with less FP32, less pixel fillrate, less texel fillrate, less of everything, to make the comparison, basically proving the scaling issues on Vega that you are so desperately trying to disprove is hilarious. You're literally saying that Vega 64, had too much of everything == scaling issues. You don't disprove scaling issues by choosing a GPU with less of "everything" needing to scale. *redacted*

    Sure. But I warn you, the results are going to be very different than what you expect, so:

    4.09x TOPs (36 vs. 8.8 TOPs) (TOPs being flops + 36 INT pr 100 FP32)
    2.79x memory BW (336 vs. 936GB/s)
    2.77x texel fillrate (556 vs. 201 GTexel/s)
    2.03x pixel fillrate (162 vs. 80 GPixel/s)
    2.94x higher performance at 4K

    So basically even better scaling of performance above metrics scaling than the 3080 vs 2080. Uh oh... that's not what we wanted to find. Ups!

    To make use of its FP32, pixel and texel crunching and enormous bandwidth capabilities, you mean. In Vega if you used more FP32, you would just have much more ROPs, TMUs and bandwidth sitting around doing nothing, which is again bad scaling. You just moved the inefficiency from FP32 to ROPs or TMU, you didn't make better use of the silicon. In Ampere the only thing sitting around doing nothing are the FP32 units, everything else is forcefully being used to its full capabilities.
     
    #661 Benetanegia, Oct 6, 2020
    Last edited by a moderator: Oct 7, 2020
  2. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,477
    Likes Received:
    6,216
    You've been counter-arguing all my posts that explicitly condemn that message. If you don't disagree with it then why are we talking about it?
     
  3. Picao84

    Veteran Regular

    Joined:
    Feb 15, 2010
    Messages:
    1,954
    Likes Received:
    981
    Because you have been misrepresenting his point all the time...
     
  4. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    1,794
    Likes Received:
    713
    Location:
    msk.ru/spb.ru
    I don't know what you are talking about, hence why I ask you to show where I've said what you've quoted. Because I certainly didn't say anything of the kind.
    And do explain why you think that "Ampere needs a new game engine" to make full use of its FP32 h/w. Because it certainly doesn't. It doesn't even need any new API for that.
     
    PSman1700 likes this.
  5. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,372
    Likes Received:
    3,754
    This here sums up pretty much every important point regarding this topic.
     
    PSman1700, T2098, yuri and 4 others like this.
  6. Leoneazzurro5

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    84
    Likes Received:
    119
    Quite frankly, this is simply not true. It depends on the workload, for both Vega and Ampere. There will be always a bottleneck somewhere, the architecture being called Vega, RDNA, Ampere or Turing or Hopper. We can agree that for gaming workloads at the time Vega units' utilization was crap, Pascal was better, Turing even better, Ampere decreased the average utilization. That said, declaring "everything else is used to its full capabilities" is simply crap. There will be never something like "full utilization".
     
    Lightman likes this.
  7. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    376
    Likes Received:
    385
    I'm pretty sure that I said full capabilities, not full utilization. I think you even quoted it. I think they don't mean the same by any means. I think I made good use of language.
     
  8. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    1,794
    Likes Received:
    713
    Location:
    msk.ru/spb.ru
    Nope. With only 25% of gaming math on average being INT32 Turing's INT SIMDs are idling in gaming in 75% of clocks and this is definitely worse than Pascal in h/w utilization but better for actual math throughput.
    The elephant in the room for Ampere is this: does Ampere have one FP32/INT SIMD or two separate INT and FP32 SIMDs on the same datapath?
    If it's the former then Ampere has actually improved h/w utilization over Turing as it can run these 75% of FP32 instructions on both SIMDs once it's done with 25% of INTs - nothing idles at any clock.
    If it's the latter then sure there will be a decrease of actual h/w utilization compared to Turing.

    Edit: Also of note - h/w utilization isn't the same as FLOPS utilization.
     
    Man from Atlantis and PSman1700 like this.
  9. Leoneazzurro5

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    84
    Likes Received:
    119
    What does it mean? It has no sense. Full capabilities means utilizaiton of all units, to their theoretical maximum. There are no other meanings.
     
  10. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    376
    Likes Received:
    385
    Theoretical maximum is impossible. There is however a certain level that is achievable and repeatable for each use case. For example for SGEMM and latest generations is some >90% of theoretical maximum.

    As for my sentence. We are talking about averages, general cases. By your definition "x% faster on games" makes as little sense as my sentence above. No one seems to have a problem with it tho.
     
    PSman1700 likes this.
  11. Leoneazzurro5

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    84
    Likes Received:
    119
    It does not matter. Ampere cannot use at the same time the second FP units OR the INT units. That means, some unit will stay idle EVERY TIME. In the hardware. What it does not stay idle is the scheduling of the second pipeline. Also, combining INT and FP will add context swtches and overhead in the scheduling.
    So, utilization of available units decreased. It is a good compromise? Yes, as it costed relatively little to add this function in terms of area and it increases even the real FP throughput.
    Does it mean Ampere will hit double of FP throughput respect to Turing at ISO clocks? No, it will depend on the workload. This is all.
     
  12. Leoneazzurro5

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    84
    Likes Received:
    119
    Problem with your definition is, averages will depend on the architecture. If by absurd, efficiency in achitecture X will be at maximum 50%, because the architecture constraints say so, it will use its "maximum capabilities" even if it makes half of the units staying idle all the time. Which good sense says it is not so efficient.
     
  13. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    376
    Likes Received:
    385
    I think this needs adressing, because I think it's been repeated at least twice and at this point it's hard to differentiate if this is part of the problem in regards to te discussion or not. No one has suggested that Ampere will under any circumstance hit double FP throughput. Period.
     
    PSman1700 likes this.
  14. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    376
    Likes Received:
    385
    But we are not talking about efficiencies in that respect, and they don't matter. It doesn't matter if Nvidia ROPs are typically capable of only 50% of its rated throughput due to architecture contraints, it will be the same on all cards and circumstances. Again within the context that we are also talkibng about "average game performance", and not this specific game at this specific resolution and this specific AO setting. ROPs or FP32 or anything usage could be more efficient in one game or setting, but that likely makes a difference in game performance that makes it deviate from the "average game performance". Like Crysis remaster for example.

    Edit: basically, it's silly to point out that unit utilization can vary from task to task, without also taking into consideration that performance in that task most likely also varies, because of it.
     
    PSman1700 likes this.
  15. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    1,794
    Likes Received:
    713
    Location:
    msk.ru/spb.ru
    Did you even read what I've said?

    Like on every other GPU in existence? Or do you somehow propose to not run INTs on them at all?

    Yes, it will depend on the workload and if said workload will be FP32 only then Ampere will hit exactly what you've said. This is all.
     
    PSman1700 likes this.
  16. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    340
    Likes Received:
    278
    Not sure why there is a “conclusion” drawn that “everything else sitting idle if you do more maths” is GCN/Vega specific

    GCN CU is explicitly stated to support issuing up to 5 instructions of different categories from 40 in-order instruction streams. I fail to see how it would be different for SMs that are basically doing the same thing, multiplexing a pool of in-order instruction streams to two issue ports shared by everything. Parallelism in both are still bound by the program order, and compiler assisted scheduling as permitted by the architecture.

    If your work is VALU bound, and you are running just a large grid of it, then of course everything else is sitting idle. The ROPs and caches aren’t gonna mine bitcoins on their own, are they? The same logic applies to any bottleneck of any kernel/workload.
     
    #676 pTmdfx, Oct 6, 2020
    Last edited: Oct 6, 2020
  17. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    376
    Likes Received:
    385
    No one said it's specific to GCN/Vega. However, would you agree that 3080 being 200:1 ratio in Gflops:Gpixels vs Vega 64 127:1 ratio is far more likely to hit pixel bottleneck long before it does flops or not?
     
    PSman1700 likes this.
  18. Leoneazzurro5

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    84
    Likes Received:
    119
    If these efficiencies do not matter, and everything depends on the workload, then why both parties are trying to demonstrate that "Vega is broken" and "Ampere will be the best gaming architecture ever" vs "Vega needs a specific workload to shine and if it's broken even Ampere is broken"?
    Ampere is more gaming oriented than Vega, both could reach their peak FP throughput in certain workloads and both have inefficiencies, one can easily argue that GCN has quite more problems in hitting its peak in gaming and I agree.
     
  19. Leoneazzurro5

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    84
    Likes Received:
    119
    Yes I've read and some parts do not make sense at all because you are trying to reach a conclusion
    Like

    Who spoke about FLOPS utilization? I was speaking about hardware utilization, in particular second FP pipeline staying idle whenever an INT instruction must be executed.
    But you are trying to push your view even saying that "Ampere has increased HW utilization". If we were talking about ROPS, or TMU, I could agree, but we were specifically talking about shader core, and in the shader core I have ALWAYS the second FP OR the INT pipeline idle. Whan in Turing that does not happen. So there is hardware in Ampere always unused. How it is possible that you write "hardware utilization increases" is beyond me.

    You are continuing to move the target.

    And then you will have the hardware on the INT pipeline completely unused. So hardware utilization in the shader core actually is lowered. But as I understand that you are not trying to honestly discuss, but you are pushing your agenda here, I will stop to discuss here.
     
  20. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    376
    Likes Received:
    385
    First of all, no one has said any of that.

    Second, because those efficiencies are repeatable. Going by SGEMM for example, nothing in Ampere architecture suggests that that it would be less capable to reach the same % as Turing on that workload. On the contrary, 50% more L1, 2x L1 bandwidth and more varied configs of L1/shared memory suggest actually better efficiency is posible. Now instead of pure SGEMM program it's a certain portion of frametime doing mostly MxM math. What prevents Ampere from being 2x faster per SM in this portions? Nothing. And the 90% figure is lost and irrelevant to that.
     
    pharma and PSman1700 like this.
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...