Speculation: GPU Performance Comparisons of 2020 *Spawn*

Status
Not open for further replies.
Crysis Remastered does use RT h/w but in a weird way - via a Vulkan interop from D3D11 renderer.
20 series cards fare a bit better here thanks to this than in a purely s/w RT Neon Noir demo.

Huh - today I learned. Thanks for this! Details are pretty sparse on exactly how they implemented it.
What few references I could find to it seemed to mention that they were only using the Vulkan VKRay extensions to accelerate RT reflections and not for any of the GI stuff.

CryTek also said that the remaster also only sporadically uses RT reflections - when it would cause a significantly different result. Most reflections are still done the old way in screen space.

https://www.pcgamer.com/crysis-rema...nd-hardware-ray-tracing-and-is-built-on-dx11/
 
Last edited:
Quake 2 RTX is an OpenGL rendered game that uses Vulkan extensions to add hardware based ray tracing.

Crysis Remastered ray tracing adds to CPU bottlenecks:


I've timestamped the section of the video showing CPU bottlenecking with RT.
 
Building and updating BVHs for RT adds to CPU work of course.
It certainly does. But, there are plenty of spare cores available on the tested systems in the video, so CPU threading for RT looks pretty broken, if that's the cause.

RT also adds data that traverses PCI Express, if we assume BVH is updated on CPU. But in the video the BVH looks like it would mostly be static where "CPU bottleneck from RT" is being shown.

RT itself seems to add to the "draw call count" problem, on PC. Here is the file src/refresh/vkpt/shader/path_tracer.h from NVidia's Quake 2 RTX implementation:

https://github.com/NVIDIA/Q2RTX/blob/master/src/refresh/vkpt/shader/path_tracer.h

which documents RT usage:

Code:
The path tracer is separated into 4 stages for performance reasons:
  1. `primary_rays.rgen` - responsible for shooting primary rays from the
     camera. It can work with different projections, see `projection.glsl` for
     more information. The primary rays stage will find the primary visible
     surface or test the visibility of a gradient sample. For the visible
     surface, it will compute motion vectors and texture gradients.
     This stage will also collect transparent objects such as sprites and
     particles between the camera and the primary surface, and store them in
     the transparency channel. Surface information is stored as a visibility
     buffer, which is enough to reconstruct the position and material parameters
     of the surface in a later shader stage.
     The primary rays stage can potentially be replaced with a rasterization
     pass, but that pass would have to process checkerboarding and per-pixel
     offsets for temporal AA using programmable sample positions. Also, a
     rasterization pass will not be able to handle custom projections like
     the cylindrical projection.
  2. `reflect_refract.rgen` - shoots a single reflection or refraction ray
     per pixel if the G-buffer surface is a special material like water, glass,
     mirror, or a security camera. The surface found with this ray replaces the
     surface that was in the G-buffer originally. This shaded is executed a
     number of times to support recursive reflections, and that number is
     specified with the `pt_reflect_refract` cvar.
     To support surfaces that need more than just a reflection, the frame is
     separated into two checkerboard fields that can follow different paths:
     for example, reflection in one field and refraction in another. Most of
     that logic is implemented in the shader for stage (2). For additional
     information about the checkerboard rendering approach, see the comments
     in `checkerboard_interleave.comp`.
     Between stages (1) and (2), and also between the first and second
     iterations of stage (2), the volumetric lighting tracing shader is
     executed, `god_rays.comp`. That shader accumulates the inscatter through
     the media (air, glass or water) along the primary or reflection ray
     and accumulates that inscatter.
  3. `direct_lighting.rgen` - computes direct lighting from local polygonal and
     sphere lights and sun light for the surface stored in the G-buffer.
  4. `indirect_lighting.rgen` - takes the opaque surface from the G-buffer,
     as produced by stages (1-2) or previous iteration of stage (4). From that
     surface, it traces a single bounce ray - either diffuse or specular,
     depending on whether it's the first or second bounce (second bounce
     doesn't do specular) and depending on material roughness and incident
     ray direction. For the surface hit by the bounce ray, this stage will
     compute direct lighting in the same way as stage (3) does, including
     local lights and sun light.
     Stage (4) can be invoked multiple times, currently that number is limited
     to 2. First invocation computes the first lighting bounce and replaces
     the surface parameters in the G-buffer with the surface hit by the bounce
     ray. Second invocation computes the second lighting bounce.
     Second bounce does not include local lights because they are very
     expensive, yet their contribution from the second bounce is barely
     noticeable, if at all.

You can see that there are multiple-invocations of some stages, which I'm guessing corresponds with a "draw call" in conventional rasterisation-based rendering.

So far I've not been able to find a description of the hardware/driver execution model for ray generation versus intersection testing. It seems that intersection results are written to a VRAM buffer, and so would be consumed by the next stage of ray generation (e.g. for the later iterations of stages 2 and 4).

Ideally it would be possible to form a pipeline for all these stages, so that the entire pipeline is fully operational, so that intersection test results would not have to leave the GPU die. But my idea of "ideally" may be a mismatch for the quantities of data and GPU capability. It might also be a mismatch when considering "cache coherency" when considering BVH traversal.

So lots unknown about the hardware, the nature of "draw calls" with respect to RT invocations and where CPU limitations would arise.

Obviously Q2RTX is a "path tracer" so makes much more intensive use of RT than seen in Crysis Remastered. But at least it provides some prompts on how to think about the hardware, the driver and CPU threading.
 
Since this seem to be such a hard thing to grasp for some:

2080 has 10 TFLOPS + 10 TeraInt32OPS - that's 20 teraOPS in total.
3080 has 30 TFLOPS.

If we have a game which is 50/50 math between FP32 and INT32 then what is the maximum theoretical performance gain in it between 2080 and 3080? And does it mean that Ampere "reaches 30 TFLOPs when rendering" such game?

There's no game with 50/50 math between FP32 and INT32. This is information coming directly from nvidia who saw an average of ~25% of total operations in games being INT32. The best they found was Battlefield 1 using 33% of operations being INT32.
Just like the 2080 never reaches 10 TFLOPs + 10 TIOPs, GA102 never reaches 30 TFLOPs in games because the shader processors will halt while waiting for other bottlenecks in the chip (e.g. memory bandwidth, triangle setup, fillrate, etc.).



Yeah, that was far from my point, keep trying.
Wow. So classy.


My point was actually that contrary to claims being made that just like Ampere, Vega also had "too much FP32", so it could just be "fixed" by increasing FP32 usage. that is sadly not true at all. All metrics in Vega increased by the same amount of 2x. Trying to hilariously argue about 5% differences in clocks doesn't change that fact.
Except your metrics aren't wrong by 5%. They're wrong by a lot more than that. Where you claimed 12.6 TFLOPs on Vega 64 vs. 6.1 TFLOPs on RX580, it's actually 11.5 TFLOPs on Vega 64 vs. 6.5 TFLOPs RX580. We just went from 106% higher throughput to 77% higher throughput. It's a 30% difference from your claims, not 5%.
In the case of Vega 56, it's 9.3 TFLOPs vs. 6.5 TFLOPs, which means it has 43% higher FP32 throughput than the RX580.


Ampere has a disproportionate amount of FP32 and FP32 only and an increase in that usage will undoubtely increase game performance.
And if you only increased FP32 math on games and nothing else (somehow without increasing bandwidth requirements, which might be impossible in any real life scenario unless other loads are reduced), then the Vega 64 would also increase game performance. At ISO clocks, the Vega 56 and Vega 64 have the exact same gaming performance, meaning the Vega 64 always has, at least, 8 CUs / 512sp / 1.4 TFLOPs / 1xXBOne-worth-of-compute-power just idling. It didn't mean the architecture was broken then, it just meant RTG at the time had very limited resources and by only being able to launch one 14nm GCN5 chip for desktop, they chose to make one that served several markets at the same time, while sacrificing gaming performance.

You either believe that games will push higher proportions of FP32 or you don't. You don't get to select that only a favorite IHV gets to blame game engines while the other gets accused of being broken and hopeless.
 
There's no game with 50/50 math between FP32 and INT32. This is information coming directly from nvidia who saw an average of ~25% of total operations in games being INT32. The best they found was Battlefield 1 using 33% of operations being INT32.
Feel free to adjust the proportion to whatever you want really. 25% INT on average means that you have to add 25% to Turing FLOPS number to see them in Ampere FLOPS metrics.

Just like the 2080 never reaches 10 TFLOPs + 10 TIOPs, GA102 never reaches 30 TFLOPs in games because the shader processors will halt while waiting for other bottlenecks in the chip (e.g. memory bandwidth, triangle setup, fillrate, etc.).
Well, that's what I call progress. Next step is substituting 2080 and GA102 there with "any GPU" really.
The difference between Turing and Ampere here is also pretty obvious: while it's somewhat unlikely that INT usage in games will increase above these 25% on average in the future, it is far more likely that FP32 usage in games will in fact increase in the future.
Thus while Turing will never be able to fully utilize its INT h/w in gaming, Ampere may well reach its peak FLOPS eventually, in games.
 
Quake 2 RTX is an OpenGL rendered game that uses Vulkan extensions to add hardware based ray tracing.

Crysis Remastered ray tracing adds to CPU bottlenecks:


I've timestamped the section of the video showing CPU bottlenecking with RT.

One other thing I'd be very curious about - Crysis Remastered's settings are not particularly granular. From what I've seen, unlike for example Control, there isn't fine grained options for things like reflections.
Obviously SVOGI on the shaders is there for every card with RT enabled. However, the question I have is: are the RT reflections there on non-Turing and non-Ampere cards, or do they just use the screen space fallback code path 100% of the time?

And if the RT reflections *are* done on the shaders for the other cards - are they the same quality, resolution, draw distance, etc? Makes apples-to-apples comparisons a bit difficult. Is there a way to completely disable all the RTX extensions in software on Turing or Ampere and make it behave like all the other cards?

Seems almost like Crytek added the tiny bit of RTX hardware support as an afterthought / marketing push so that people who have RTX enabled cards don't dismiss Crysis Remastered as 'all software, for AMD cards' and not buy it. Lets them check the little checkbox on the spec sheet saying 'Look, we use your RTX hardware too! Buy our game!"
 
One other thing I'd be very curious about - Crysis Remastered's settings are not particularly granular. From what I've seen, unlike for example Control, there isn't fine grained options for things like reflections.
Obviously SVOGI on the shaders is there for every card with RT enabled. However, the question I have is: are the RT reflections there on non-Turing and non-Ampere cards, or do they just use the screen space fallback code path 100% of the time?
RT reflections are there even on current gen consoles, distance is much lower on them though. For PC I haven't seen anyone mention range being dependent on acceleration
 
Except your metrics aren't wrong by 5%. They're wrong by a lot more than that. Where you claimed 12.6 TFLOPs on Vega 64 vs. 6.1 TFLOPs on RX580, it's actually 11.5 TFLOPs on Vega 64 vs. 6.5 TFLOPs RX580. We just went from 106% higher throughput to 77% higher throughput. It's a 30% difference from your claims, not 5%.

6.2 vs 6.5 is what again? Ammm 5%? Wow!

We just went from 106% higher throughput to 77% higher throughput. It's a 30% difference from your claims, not 5%.

And even with your nitpìcked numbers of best posible RX580 clocks and worst case Vega clocks, you still have to explain a 21% difference between all performance metrics on Vega (and not just FP32) and actual gaming performance. Meanwhile 3080's scaling vs 2080s is 30% faster than its increase in both pixel and texture fillrate. Totally same situation. Totally! Not.

In the case of Vega 56, it's 9.3 TFLOPs vs. 6.5 TFLOPs, which means it has 43% higher FP32 throughput than the RX580.

Confirming that Vega had scaling issues that have nothing to do with FP32, because for the nth time, Vega56 not only had less FP32 , it had less memory bandwidth, less texture fillrate, etc. by the exact same amount. It is not only FP32 that was unused.

And if you only increased FP32 math on games and nothing else (somehow without increasing bandwidth requirements, which might be impossible in any real life scenario unless other loads are reduced), then the Vega 64 would also increase game performance.

It wouldn't gain anything, because nothing else was holding it back, because it has the exact same amount of extra texture fillrate, pixel fillrate and memory bandwidth as it does FP32, so using more of one type is not going to change the existing balance. Vega is not held back by anything in particular, except probably triangle setup like it was mentioned (this again qualifies as scaling issues). Ampere has only an excess of FP32, and lacks fillrate and bandwidth. Ampere didn't get its FP32 increase by adding more of everything (aka CU, SMs), they did it by increasing FP32 SIMDs.

At ISO clocks, the Vega 56 and Vega 64 have the exact same gaming performance, meaning the Vega 64 always has, at least, 8 CUs / 512sp / 1.4 TFLOPs / 1xXBOne-worth-of-compute-power just idling.

Which is the definition of having scaling issues. CUs do a lot more than just FP32. It does INT. It does textures. It does special functions. It does load/store. It does atomics. Etc.

It didn't mean the architecture was broken then, it just meant RTG at the time had very limited resources and by only being able to launch one 14nm GCN5 chip for desktop, they chose to make one that served several markets at the same time, while sacrificing gaming performance.

They didn't sacrifice anything. It was a pixel and texel fillrate monster compared to RX580 with an over 2X increase!!!!! They didn't only increase FP32 SIMDs, they more than doubled everything. It just couldn't scale.

You either believe that games will push higher proportions of FP32 or you don't. You don't get to select that only a favorite IHV gets to blame game engines while the other gets accused of being broken and hopeless.

No one has done that. No one has blamed game engines. Pointing out that an architecture that is very heavy in a single metric would see gains when that single metric is used more extensively than the others, is not blaming anything, it's pointing out the obvious. Pointing out that an architecture which was much stronger than its predecessor in every metric, yet didn't scale performance accordingly, has scaling issues, is not the same as saying it is "broken and hopeless". No one has said that either, so stop putting words on other people's mouths, in the most hilarious and pathetic attempt at a strawman that I've seen in months. And maybe then we can start talking about class.
 
Except your metrics aren't wrong by 5%. They're wrong by a lot more than that. Where you claimed 12.6 TFLOPs on Vega 64 vs. 6.1 TFLOPs on RX580, it's actually 11.5 TFLOPs on Vega 64 vs. 6.5 TFLOPs RX580. We just went from 106% higher throughput to 77% higher throughput. It's a 30% difference from your claims, not 5%.

Has AMD been notified yet?

upload_2020-10-6_12-3-26.png
 

Attachments

  • upload_2020-10-6_12-2-39.png
    upload_2020-10-6_12-2-39.png
    19.4 KB · Views: 4
And if you only increased FP32 math on games and nothing else (somehow without increasing bandwidth requirements, which might be impossible in any real life scenario unless other loads are reduced), then the Vega 64 would also increase game performance.
Not really.
FP32 is math precision, and you have to be very specific in what you're actually increasing here for it to be of actual benefit on Vega or GCN specifically.
If we're talking about compute then yes it is pretty likely that Vega will scale at least better than it does in games which don't push as much FP32 compute (not that precision matter for Vega which runs all math in the same way but with different throughputs). Have to say that this isn't an obscure case really with more and more engines using compute for tasks which were previously all graphics (FF to a degree). So in this sense sure Vega is likely to do better in the future - not because of games using more FP32 specifically but because of games moving more and more tasks from older graphics pipelines to generic compute approaches.
If we're talking about graphics shading though things get a lot more complex and simply increasing the shading workload - even with math in it being exclusively FP32 - you're very likely to run into one of various Vega bottlenecks which won't allow it to scale as well as you think it would.
And this is really a key difference between GCN and Ampere here: Ampere doesn't require a straight up full refactoring of your renderer to make full use of its FP32 capabilities, GCN did.
 
6.2 vs 6.5 is what again? Ammm 5%? Wow!
I'm sure you didn't forget the part where you inflated Vega's clocks, you're only pretending you did.


And even with your nitpìcked numbers of best posible RX580 clocks and worst case Vega clocks
Nah, they're just typical load frequencies when playing games, or the same benchmarks seen in your techpowerup reference.


you still have to explain a 21% difference between all performance metrics on Vega (and not just FP32) and actual gaming performance.
What exactly is "all performance metrics!!!111oneone" ?

Vega 56 (typical 1.3GHz clock) vs. RX580 (typical >1.4GHz clock):
1.43x FLOPs (9.3 vs. 6.5 TFLOPs)
1.6x memory BW (256 vs. 410GB/s)
1.84x pixel fillrate (45 vs. 83 GPixel/s)
1.45x texel fillrate (201 vs. 291 GTexel/s)
1.38x higher performance at 1440p, 1.45x higher performance at 4K.

So where is your 21% between all performance metrics here?


BTW, would you like to make a similar comparison between e.g. the RTX 3090 and the RTX 2060? We would all like to see how "Ampere is broken" after looking at that comparison.



Has AMD been notified yet?
Notified about what? That their 3rd party partners only launched factory-overclocked variations of the RX580, that clocked above 1.4GHz? Or that Vega 10 reference cards thermal throttle with time?
I think they know about it already, but you can send them an e-mail to warn them about the fact if you want.


And this is really a key difference between GCN and Ampere here: Ampere doesn't require a straight up full refactoring of your renderer to make full use of its FP32 capabilities, GCN did.
So Vega 10 was bad because it needed "full refactoring of a renderer to make use of its FP32 capabilities". Ampere is good because "it needs a new game engine".
Got it.
 
So Vega 10 was bad because it needed "full refactoring of a renderer to make use of its FP32 capabilities". Ampere is good because "it needs a new game engine".
Got it.
The whole sort of, execute 1 instruction every 4 cycles is the challenge it faces and they resolved it for RDNA. It seems to be exceptionally penalized for poor optimization. And we see this comparing RDNA cards vs Vega and RVII. CDNA continues the trend of 1 instruction 4 cycles still IIRC, not ideal for gaming workloads but it's very good for pure compete stuff.

For very big compute loads, I think it still works and will be beneficial.
 
So Vega 10 was bad because it needed "full refactoring of a renderer to make use of its FP32 capabilities". Ampere is good because "it needs a new game engine".
Got it.

I think you mean Vega 64 and not Vega 10? Regardless of that you are greatly simplifying what he's saying for effect or you don't understand his point. What he is saying is:

a) Vega 64 had problems with the game engines of the time yes (was 25% faster only on average than its predecessor Fury X, despite having double the RAM as well!), because it was a design ahead of its time and it shows since it aged relatively well with new engines, finally fulfilling its birth right of 50% performance at least over the Fury X (https://overclock-then-game.com/ind...iew-kana-s-finewine-edition?showall=&start=13).

https://www.techpowerup.com/review/amd-radeon-rx-vega-64/31.html


b) RTX3080 does not have a problem with current engines at all as it performs well (often 50-60% faster than RTX2080 it replaces) and on top it's design is ahead as well and should age better than previous Nvidia GPUs when and if game engines make more use of FP32.

https://www.techpowerup.com/review/nvidia-geforce-rtx-3080-founders-edition/34.html

There is a nuance between the two, the starting points are widely different.

Edit - Before you come and say it's not an Apples to Apples comparison because RTX3080 is using GA102 and not GA104, even if we apply the performance points released by Nvidia for RTX3070 (which is not even using the full GA104 die), it would still be a difference of 40-50% faster than RTX2070 Super on current engines.
 
Last edited:
Status
Not open for further replies.
Back
Top