Speculation: GPU Performance Comparisons of 2020 *Spawn*

Discussion in 'Architecture and Products' started by eastmen, Jul 20, 2020.

Thread Status:
Not open for further replies.
  1. Leoneazzurro5

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    84
    Likes Received:
    119
    Quite frankly, if someone rereads all the discussion until now, that is the impression he'll get.

    And again, you are trying to generalize specific workloads to every workload. here you are describing a specific scenario, and then you are applying that result to every scenario. What you just said, cuould be applied to every workload? No. Because as soon as I mix INT instruction, I stop to be 2x or more faster than Turing. So instead of being 2x faster, we get "something that is more than 1 and less than 2"x.
     
    Krteq likes this.
  2. Picao84

    Veteran Regular

    Joined:
    Feb 15, 2010
    Messages:
    1,954
    Likes Received:
    981
    I can't see how Ampere utilisation of ALUs is lower than Turing? If Turing had completely separate INT units that were only used 33% of the time, which now double as FP units as well, wouldn't that mean that utilisation actually increased? That when those units are not doing INT calculations that can do FP instead of staying idle? I guess this might be more complex than that in case some INT/FP units might not be used at all because some on the same SM is doing INT preventing some cores from doing FP?
     
    T2098, Scott_Arm and PSman1700 like this.
  3. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    1,794
    Likes Received:
    713
    Location:
    msk.ru/spb.ru
    I'm not the one who's trying to reach a predetermined conclusion here.

    The guys at the start of the discussion? "GA102 never reaches 30 TFLOPs in games because the shader processors will halt while waiting for other bottlenecks in the chip"
    You in your previous post? "Does it mean Ampere will hit double of FP throughput respect to Turing at ISO clocks? No, it will depend on the workload. This is all."

    H/w utilization means little if your chip is still efficient enough per transistor to be competitive. That's the beauty of GPUs - you can do things in a million of different ways, the only thing which matters is the performance per price. So strictly speaking it doesn't even matter and all this discussion is pure theory.

    Well let's look at the information which we have then?

    [​IMG]

    Do you see any "second FP OR the INT pipeline" here?

    As I've said, the answer to that question is tied to the answer on how exactly Ampere handles INT execution.
    If it's a separate SIMD then yeah there will be more idle h/w in Ampere then in Turing when running the same code.
    If it's the same SIMD as that which is used for FP32 then no, there will be less idle h/w here than in Turing.

    I'm dead solid on my target. It's you who constantly move between h/w and flops utilization - which aren't at all the same.

    Again, choose what you're talking about. It's either perf/flop which this discussion has started on or general h/w utilization. If it's the latter then there are two possible scenarios for Ampere, not one.

    Says the man who switched the goalpost at least two times in two consecutive posts.
     
    #683 DegustatoR, Oct 6, 2020
    Last edited: Oct 6, 2020
    Scott_Arm, pharma and PSman1700 like this.
  4. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    376
    Likes Received:
    385
    I didn't apply to every workload. We are discussing capabilities. Hence my example. That's peak

    That's correct, and applies to GCN and RDNA too, depending on INT share it's going to be something between 0 and 1, instead of 1 and 2. Only Turing gets a pass on this. If this has never ever been discused before, why do we have to mention that now exactly?

    Anyway, for Ampere, if you decrease INT share in favor of FP share, you get closer or farther away from that 2x figure? Literally no one has said that future games will reach 2x, only that it will get closer to it than current games. The closest has been DegustatoR talking about an hypothetical workload that uses FP32 exclusively. Disregarding how realistic such an application would be, would such a FP32 bound workload be 2x faster or not?
     
    PSman1700 likes this.
  5. Leoneazzurro5

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    84
    Likes Received:
    119
    I think everyone here agrees that if the workload is purely FP32 then Ampere has a 2x advantage. Problem is that the Nvidia PR man is trying to push that advantage everywhere, everytime.
     
  6. troyan

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    224
    Likes Received:
    405
    Per SM block FP32 is up 2x over Turing. So utilisation has improved immensely (with the help of 33% more L1 cache and twice the bandwidth). A 2080TI has 47% more SMs than the RTX3070 and both perform equal (minus less ROP and geometry performance, + 37,5% more bandwidth).
     
  7. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    1,794
    Likes Received:
    713
    Location:
    msk.ru/spb.ru
    It can be more complex if Ampere has added a second FP32 SIMD to Turing's INT32+FP32 SIMDs. In this case there are actually three math units in Ampere h/w (well, four counting SFUs) with two of them sitting on the same datapath and thus inaccessible to be used in parallel. This is possible as NV generally tends to do specialized h/w for different math types in their GPUs. It would still be somewhat weird to see in Ampere though considering that a SIMD which can run both FP32 and INT32 isn't rocket science - NV had them up until Turing, AMD has them, Intel has them too I think?

    I can think of only one apparent advantage of having them separate in h/w - if the combined complexity of separate FP32+INT SIMDs is less than the complexity of a universal FP32+INT SIMD. In which case it won't matter much that one of them will be idle at each given clock as you would still have a net win in perf/transistor.
     
    PSman1700 likes this.
  8. Leoneazzurro5

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    84
    Likes Received:
    119
    Because these are not the same ALUs. It is the same datapath/scheduler. So if you are saying that the scheduler has increased utilization, it is true. If you take the ALUS I could have more ALU dedicated transistors idling, per clock. So a SM can do more work per clock? Yes. Does it use more of its ALU transistors per clock, or get the same ALU average utilization? Depends on the workload. For gaming, this seems not true.
     
    #688 Leoneazzurro5, Oct 6, 2020
    Last edited: Oct 6, 2020
  9. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    1,794
    Likes Received:
    713
    Location:
    msk.ru/spb.ru
    Which can be said about any GPU on the market but Turing then?
     
    PSman1700 likes this.
  10. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    376
    Likes Received:
    385
    Then why doesn't everyone agree, that if the mix is 1.6x FP rather than say 1.4x FP as is the case in average in current games, that there would be an increase in performance?

    Literally no one is doing that. The only time the 2x has been brought up, has been in pure hypothetical scenarios like the one I presented, because it was put into question that Ampere can reach that peak performance in pure FP workloads.
     
    PSman1700 likes this.
  11. Picao84

    Veteran Regular

    Joined:
    Feb 15, 2010
    Messages:
    1,954
    Likes Received:
    981
    Edit - You are right.
     
    #691 Picao84, Oct 6, 2020
    Last edited: Oct 6, 2020
    PSman1700 likes this.
  12. Picao84

    Veteran Regular

    Joined:
    Feb 15, 2010
    Messages:
    1,954
    Likes Received:
    981
    Isn't it the other way around? I'm very confused lol.
     
  13. Picao84

    Veteran Regular

    Joined:
    Feb 15, 2010
    Messages:
    1,954
    Likes Received:
    981
    Nevermind, I looked at the White Paper. Yes they are independent ALUs.
     
  14. Leoneazzurro5

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    84
    Likes Received:
    119
    There is waste in both ways. FP:INT in modern games varies greatly, but it should be normally around 2:1 in average (depends on the engine, shaders and so on, sometimes is more). If this is the ratio, then in Turing you have the INT pipeline working half the time and FP pipeline working all the time. In Ampere you have one FP unit working all the time, and the other FP unit working only 33% of the time, while 66% of the time in the second datapath the INT unit will work. So in Turing I have an INT unit sitting idle 50% of the time, in Amprere I have always an INT or a FP SIMD idle. The higher is the FP:INT ratio, the better is the Ampere usage and the worse is Turing usage. For a pure FP workload, Turing is more inefficient in terms of ALU utilization as there will be always the INT pipeline idle, same in Ampere but in that case you have anyway 2/3 of the SIMD in action.
     
  15. PSman1700

    Veteran Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    3,443
    Likes Received:
    1,364
    One thing has to be said, discussions here in the GPU/PC sections are on a whole different level compared to the Console forums. Thanks.
     
    Lightman likes this.
  16. Leoneazzurro5

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    84
    Likes Received:
    119
    I don't know, because it is true (I explained it just the post above). Question is, for better transistor utilization, it would have been better to have 2 complete FP pipelines and 1 INT, probably the transistor budget would have not allowed that. Likewise, it is true that increasing the FP:texture ratio in Vega, would have improved ALU utilization and thus performance.

    Question was posed not because of peak FP performance in pure FP workloads. Because that is true, in those workloads Ampere will do 2xTuring, per SM. Question was about ALU utilization and that 2xFP per SM performance not being achievable, in general, in gaming workloads.
     
  17. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    376
    Likes Received:
    385
    And to that I responded that there is not such thing as a predefined "gaming workload". One could decide to write a software renderer using purely (mostly) FP32 for their game. Is that less of a gaming worload?
     
    PSman1700 likes this.
  18. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    1,794
    Likes Received:
    713
    Location:
    msk.ru/spb.ru
    It's not. All GPUs but Turing run INTs on the same SIMDs as FP which means that these SIMDs have sets of ALUs for both math types. So in this sense Ampere is the same as Navi or GCN or Pascal or Kepler, etc. So when he's saying that "these are not the same ALUs." implying that Ampere is idling a set of ALUs of a SIMD when it's running FP or INT on it - it is the exact same thing as any other GPU on the market does, with the expection of Turing which is running INTs on a separate SIMD where there is no FP ALUs.
     
    PSman1700 likes this.
  19. Leoneazzurro5

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    84
    Likes Received:
    119
    Semantics. Gaming workloads are those that are found in real world. At the moment, no one has written a pure FP32 gaming workload, with FP:INT ratio varying from 1,7:1 to around 3:1.
    In the future? Who knows. That does not demonstrate these gaming workloads will exist.
     
  20. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    1,794
    Likes Received:
    713
    Location:
    msk.ru/spb.ru
    That's assuming that there are two separate SIMDs for FP and INT - which is unlikely. A far more likely scenario is one SIMD with two sets of ALUs - just like in RDNA or Pascal or whatever. So what are you even arguing about?
    RDNA which has 4 SIMDs in a WGP each of which is capable of running INT32 is wasting a lot more h/w in games where there's only 25% of math in INT32 compared to Ampere for example.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...