Speculation: GPU Performance Comparisons of 2020 *Spawn*

Discussion in 'Architecture and Products' started by eastmen, Jul 20, 2020.

Thread Status:
Not open for further replies.
  1. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    2,414
    Likes Received:
    1,963
    Location:
    msk.ru/spb.ru
    Crysis Remastered does use RT h/w but in a weird way - via a Vulkan interop from D3D11 renderer.
    20 series cards fare a bit better here thanks to this than in a purely s/w RT Neon Noir demo.
     
    Lightman, Krteq, PSman1700 and 3 others like this.
  2. T2098

    Newcomer

    Joined:
    Jun 15, 2020
    Messages:
    49
    Likes Received:
    105
    Huh - today I learned. Thanks for this! Details are pretty sparse on exactly how they implemented it.
    What few references I could find to it seemed to mention that they were only using the Vulkan VKRay extensions to accelerate RT reflections and not for any of the GI stuff.

    CryTek also said that the remaster also only sporadically uses RT reflections - when it would cause a significantly different result. Most reflections are still done the old way in screen space.

    https://www.pcgamer.com/crysis-rema...nd-hardware-ray-tracing-and-is-built-on-dx11/
     
    #642 T2098, Oct 6, 2020
    Last edited: Oct 6, 2020
    PSman1700 and pharma like this.
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,308
    Likes Received:
    1,587
    Location:
    London
    Quake 2 RTX is an OpenGL rendered game that uses Vulkan extensions to add hardware based ray tracing.

    Crysis Remastered ray tracing adds to CPU bottlenecks:



    I've timestamped the section of the video showing CPU bottlenecking with RT.
     
    Lightman likes this.
  4. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    2,414
    Likes Received:
    1,963
    Location:
    msk.ru/spb.ru
    Building and updating BVHs for RT adds to CPU work of course.
     
    PSman1700 likes this.
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,308
    Likes Received:
    1,587
    Location:
    London
    It certainly does. But, there are plenty of spare cores available on the tested systems in the video, so CPU threading for RT looks pretty broken, if that's the cause.

    RT also adds data that traverses PCI Express, if we assume BVH is updated on CPU. But in the video the BVH looks like it would mostly be static where "CPU bottleneck from RT" is being shown.

    RT itself seems to add to the "draw call count" problem, on PC. Here is the file src/refresh/vkpt/shader/path_tracer.h from NVidia's Quake 2 RTX implementation:

    https://github.com/NVIDIA/Q2RTX/blob/master/src/refresh/vkpt/shader/path_tracer.h

    which documents RT usage:

    Code:
    The path tracer is separated into 4 stages for performance reasons:
      1. `primary_rays.rgen` - responsible for shooting primary rays from the
         camera. It can work with different projections, see `projection.glsl` for
         more information. The primary rays stage will find the primary visible
         surface or test the visibility of a gradient sample. For the visible
         surface, it will compute motion vectors and texture gradients.
         This stage will also collect transparent objects such as sprites and
         particles between the camera and the primary surface, and store them in
         the transparency channel. Surface information is stored as a visibility
         buffer, which is enough to reconstruct the position and material parameters
         of the surface in a later shader stage.
         The primary rays stage can potentially be replaced with a rasterization
         pass, but that pass would have to process checkerboarding and per-pixel
         offsets for temporal AA using programmable sample positions. Also, a
         rasterization pass will not be able to handle custom projections like
         the cylindrical projection.
      2. `reflect_refract.rgen` - shoots a single reflection or refraction ray
         per pixel if the G-buffer surface is a special material like water, glass,
         mirror, or a security camera. The surface found with this ray replaces the
         surface that was in the G-buffer originally. This shaded is executed a
         number of times to support recursive reflections, and that number is
         specified with the `pt_reflect_refract` cvar.
         To support surfaces that need more than just a reflection, the frame is
         separated into two checkerboard fields that can follow different paths:
         for example, reflection in one field and refraction in another. Most of
         that logic is implemented in the shader for stage (2). For additional
         information about the checkerboard rendering approach, see the comments
         in `checkerboard_interleave.comp`.
         Between stages (1) and (2), and also between the first and second
         iterations of stage (2), the volumetric lighting tracing shader is
         executed, `god_rays.comp`. That shader accumulates the inscatter through
         the media (air, glass or water) along the primary or reflection ray
         and accumulates that inscatter.
      3. `direct_lighting.rgen` - computes direct lighting from local polygonal and
         sphere lights and sun light for the surface stored in the G-buffer.
      4. `indirect_lighting.rgen` - takes the opaque surface from the G-buffer,
         as produced by stages (1-2) or previous iteration of stage (4). From that
         surface, it traces a single bounce ray - either diffuse or specular,
         depending on whether it's the first or second bounce (second bounce
         doesn't do specular) and depending on material roughness and incident
         ray direction. For the surface hit by the bounce ray, this stage will
         compute direct lighting in the same way as stage (3) does, including
         local lights and sun light.
         Stage (4) can be invoked multiple times, currently that number is limited
         to 2. First invocation computes the first lighting bounce and replaces
         the surface parameters in the G-buffer with the surface hit by the bounce
         ray. Second invocation computes the second lighting bounce.
         Second bounce does not include local lights because they are very
         expensive, yet their contribution from the second bounce is barely
         noticeable, if at all.
    You can see that there are multiple-invocations of some stages, which I'm guessing corresponds with a "draw call" in conventional rasterisation-based rendering.

    So far I've not been able to find a description of the hardware/driver execution model for ray generation versus intersection testing. It seems that intersection results are written to a VRAM buffer, and so would be consumed by the next stage of ray generation (e.g. for the later iterations of stages 2 and 4).

    Ideally it would be possible to form a pipeline for all these stages, so that the entire pipeline is fully operational, so that intersection test results would not have to leave the GPU die. But my idea of "ideally" may be a mismatch for the quantities of data and GPU capability. It might also be a mismatch when considering "cache coherency" when considering BVH traversal.

    So lots unknown about the hardware, the nature of "draw calls" with respect to RT invocations and where CPU limitations would arise.

    Obviously Q2RTX is a "path tracer" so makes much more intensive use of RT than seen in Crysis Remastered. But at least it provides some prompts on how to think about the hardware, the driver and CPU threading.
     
    Lightman, PSman1700, T2098 and 2 others like this.
  6. ToTTenTranz

    Legend Veteran

    Joined:
    Jul 7, 2008
    Messages:
    12,294
    Likes Received:
    7,248
    There's no game with 50/50 math between FP32 and INT32. This is information coming directly from nvidia who saw an average of ~25% of total operations in games being INT32. The best they found was Battlefield 1 using 33% of operations being INT32.
    Just like the 2080 never reaches 10 TFLOPs + 10 TIOPs, GA102 never reaches 30 TFLOPs in games because the shader processors will halt while waiting for other bottlenecks in the chip (e.g. memory bandwidth, triangle setup, fillrate, etc.).



    Wow. So classy.


    Except your metrics aren't wrong by 5%. They're wrong by a lot more than that. Where you claimed 12.6 TFLOPs on Vega 64 vs. 6.1 TFLOPs on RX580, it's actually 11.5 TFLOPs on Vega 64 vs. 6.5 TFLOPs RX580. We just went from 106% higher throughput to 77% higher throughput. It's a 30% difference from your claims, not 5%.
    In the case of Vega 56, it's 9.3 TFLOPs vs. 6.5 TFLOPs, which means it has 43% higher FP32 throughput than the RX580.


    And if you only increased FP32 math on games and nothing else (somehow without increasing bandwidth requirements, which might be impossible in any real life scenario unless other loads are reduced), then the Vega 64 would also increase game performance. At ISO clocks, the Vega 56 and Vega 64 have the exact same gaming performance, meaning the Vega 64 always has, at least, 8 CUs / 512sp / 1.4 TFLOPs / 1xXBOne-worth-of-compute-power just idling. It didn't mean the architecture was broken then, it just meant RTG at the time had very limited resources and by only being able to launch one 14nm GCN5 chip for desktop, they chose to make one that served several markets at the same time, while sacrificing gaming performance.

    You either believe that games will push higher proportions of FP32 or you don't. You don't get to select that only a favorite IHV gets to blame game engines while the other gets accused of being broken and hopeless.
     
    Lightman, T2098 and Kaotik like this.
  7. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    2,414
    Likes Received:
    1,963
    Location:
    msk.ru/spb.ru
    Feel free to adjust the proportion to whatever you want really. 25% INT on average means that you have to add 25% to Turing FLOPS number to see them in Ampere FLOPS metrics.

    Well, that's what I call progress. Next step is substituting 2080 and GA102 there with "any GPU" really.
    The difference between Turing and Ampere here is also pretty obvious: while it's somewhat unlikely that INT usage in games will increase above these 25% on average in the future, it is far more likely that FP32 usage in games will in fact increase in the future.
    Thus while Turing will never be able to fully utilize its INT h/w in gaming, Ampere may well reach its peak FLOPS eventually, in games.
     
    Picao84, PSman1700 and pharma like this.
  8. T2098

    Newcomer

    Joined:
    Jun 15, 2020
    Messages:
    49
    Likes Received:
    105
    One other thing I'd be very curious about - Crysis Remastered's settings are not particularly granular. From what I've seen, unlike for example Control, there isn't fine grained options for things like reflections.
    Obviously SVOGI on the shaders is there for every card with RT enabled. However, the question I have is: are the RT reflections there on non-Turing and non-Ampere cards, or do they just use the screen space fallback code path 100% of the time?

    And if the RT reflections *are* done on the shaders for the other cards - are they the same quality, resolution, draw distance, etc? Makes apples-to-apples comparisons a bit difficult. Is there a way to completely disable all the RTX extensions in software on Turing or Ampere and make it behave like all the other cards?

    Seems almost like Crytek added the tiny bit of RTX hardware support as an afterthought / marketing push so that people who have RTX enabled cards don't dismiss Crysis Remastered as 'all software, for AMD cards' and not buy it. Lets them check the little checkbox on the spec sheet saying 'Look, we use your RTX hardware too! Buy our game!"
     
  9. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,891
    Likes Received:
    4,079
    Location:
    Finland
    RT reflections are there even on current gen consoles, distance is much lower on them though. For PC I haven't seen anyone mention range being dependent on acceleration
     
    T2098 likes this.
  10. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    394
    Likes Received:
    425
    6.2 vs 6.5 is what again? Ammm 5%? Wow!

    And even with your nitpìcked numbers of best posible RX580 clocks and worst case Vega clocks, you still have to explain a 21% difference between all performance metrics on Vega (and not just FP32) and actual gaming performance. Meanwhile 3080's scaling vs 2080s is 30% faster than its increase in both pixel and texture fillrate. Totally same situation. Totally! Not.

    Confirming that Vega had scaling issues that have nothing to do with FP32, because for the nth time, Vega56 not only had less FP32 , it had less memory bandwidth, less texture fillrate, etc. by the exact same amount. It is not only FP32 that was unused.

    It wouldn't gain anything, because nothing else was holding it back, because it has the exact same amount of extra texture fillrate, pixel fillrate and memory bandwidth as it does FP32, so using more of one type is not going to change the existing balance. Vega is not held back by anything in particular, except probably triangle setup like it was mentioned (this again qualifies as scaling issues). Ampere has only an excess of FP32, and lacks fillrate and bandwidth. Ampere didn't get its FP32 increase by adding more of everything (aka CU, SMs), they did it by increasing FP32 SIMDs.

    Which is the definition of having scaling issues. CUs do a lot more than just FP32. It does INT. It does textures. It does special functions. It does load/store. It does atomics. Etc.

    They didn't sacrifice anything. It was a pixel and texel fillrate monster compared to RX580 with an over 2X increase!!!!! They didn't only increase FP32 SIMDs, they more than doubled everything. It just couldn't scale.

    No one has done that. No one has blamed game engines. Pointing out that an architecture that is very heavy in a single metric would see gains when that single metric is used more extensively than the others, is not blaming anything, it's pointing out the obvious. Pointing out that an architecture which was much stronger than its predecessor in every metric, yet didn't scale performance accordingly, has scaling issues, is not the same as saying it is "broken and hopeless". No one has said that either, so stop putting words on other people's mouths, in the most hilarious and pathetic attempt at a strawman that I've seen in months. And maybe then we can start talking about class.
     
    Konan65, Picao84, pharma and 2 others like this.
  11. Geeforcer

    Geeforcer Harmlessly Evil
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,320
    Likes Received:
    525
    Has AMD been notified yet?

    upload_2020-10-6_12-3-26.png
     

    Attached Files:

    Jensen Krage, Konan65, pharma and 3 others like this.
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,308
    Likes Received:
    1,587
    Location:
    London
    GA102 has an extra GPC compared with TU102.
     
  13. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    394
    Likes Received:
    425
    The 3080 doesn't.
     
    PSman1700 likes this.
  14. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    2,414
    Likes Received:
    1,963
    Location:
    msk.ru/spb.ru
    Not really.
    FP32 is math precision, and you have to be very specific in what you're actually increasing here for it to be of actual benefit on Vega or GCN specifically.
    If we're talking about compute then yes it is pretty likely that Vega will scale at least better than it does in games which don't push as much FP32 compute (not that precision matter for Vega which runs all math in the same way but with different throughputs). Have to say that this isn't an obscure case really with more and more engines using compute for tasks which were previously all graphics (FF to a degree). So in this sense sure Vega is likely to do better in the future - not because of games using more FP32 specifically but because of games moving more and more tasks from older graphics pipelines to generic compute approaches.
    If we're talking about graphics shading though things get a lot more complex and simply increasing the shading workload - even with math in it being exclusively FP32 - you're very likely to run into one of various Vega bottlenecks which won't allow it to scale as well as you think it would.
    And this is really a key difference between GCN and Ampere here: Ampere doesn't require a straight up full refactoring of your renderer to make full use of its FP32 capabilities, GCN did.
     
    Konan65, Picao84, pharma and 2 others like this.
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,308
    Likes Received:
    1,587
    Location:
    London
    If you want to make a decent argument you had better not cherry pick for random reasons.
     
    ToTTenTranz likes this.
  16. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    394
    Likes Received:
    425
    What? We are talking about the 3080. Why should we care about GPCs that aren't there?
     
  17. ToTTenTranz

    Legend Veteran

    Joined:
    Jul 7, 2008
    Messages:
    12,294
    Likes Received:
    7,248
    I'm sure you didn't forget the part where you inflated Vega's clocks, you're only pretending you did.


    Nah, they're just typical load frequencies when playing games, or the same benchmarks seen in your techpowerup reference.


    What exactly is "all performance metrics!!!111oneone" ?

    Vega 56 (typical 1.3GHz clock) vs. RX580 (typical >1.4GHz clock):
    1.43x FLOPs (9.3 vs. 6.5 TFLOPs)
    1.6x memory BW (256 vs. 410GB/s)
    1.84x pixel fillrate (45 vs. 83 GPixel/s)
    1.45x texel fillrate (201 vs. 291 GTexel/s)
    1.38x higher performance at 1440p, 1.45x higher performance at 4K.

    So where is your 21% between all performance metrics here?


    BTW, would you like to make a similar comparison between e.g. the RTX 3090 and the RTX 2060? We would all like to see how "Ampere is broken" after looking at that comparison.



    Notified about what? That their 3rd party partners only launched factory-overclocked variations of the RX580, that clocked above 1.4GHz? Or that Vega 10 reference cards thermal throttle with time?
    I think they know about it already, but you can send them an e-mail to warn them about the fact if you want.


    So Vega 10 was bad because it needed "full refactoring of a renderer to make use of its FP32 capabilities". Ampere is good because "it needs a new game engine".
    Got it.
     
  18. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    13,566
    Likes Received:
    16,623
    Location:
    The North
    The whole sort of, execute 1 instruction every 4 cycles is the challenge it faces and they resolved it for RDNA. It seems to be exceptionally penalized for poor optimization. And we see this comparing RDNA cards vs Vega and RVII. CDNA continues the trend of 1 instruction 4 cycles still IIRC, not ideal for gaming workloads but it's very good for pure compete stuff.

    For very big compute loads, I think it still works and will be beneficial.
     
    pharma, BRiT and PSman1700 like this.
  19. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    2,414
    Likes Received:
    1,963
    Location:
    msk.ru/spb.ru
    Show me where I've said that Ampere "needs a new game engine". And also do explain why please.
     
    PSman1700 likes this.
  20. Picao84

    Veteran Regular

    Joined:
    Feb 15, 2010
    Messages:
    2,003
    Likes Received:
    1,053
    I think you mean Vega 64 and not Vega 10? Regardless of that you are greatly simplifying what he's saying for effect or you don't understand his point. What he is saying is:

    a) Vega 64 had problems with the game engines of the time yes (was 25% faster only on average than its predecessor Fury X, despite having double the RAM as well!), because it was a design ahead of its time and it shows since it aged relatively well with new engines, finally fulfilling its birth right of 50% performance at least over the Fury X (https://overclock-then-game.com/ind...iew-kana-s-finewine-edition?showall=&start=13).

    https://www.techpowerup.com/review/amd-radeon-rx-vega-64/31.html


    b) RTX3080 does not have a problem with current engines at all as it performs well (often 50-60% faster than RTX2080 it replaces) and on top it's design is ahead as well and should age better than previous Nvidia GPUs when and if game engines make more use of FP32.

    https://www.techpowerup.com/review/nvidia-geforce-rtx-3080-founders-edition/34.html

    There is a nuance between the two, the starting points are widely different.

    Edit - Before you come and say it's not an Apples to Apples comparison because RTX3080 is using GA102 and not GA104, even if we apply the performance points released by Nvidia for RTX3070 (which is not even using the full GA104 die), it would still be a difference of 40-50% faster than RTX2070 Super on current engines.
     
    #660 Picao84, Oct 6, 2020
    Last edited: Oct 6, 2020
    Konan65, pharma and PSman1700 like this.
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...