GPU Ray Tracing Performance Comparisons [2021-2022]

In path traced games: like Quake 2 RTX, the A770 is massively behind NVIDIA, to the point that it's more like an AMD GPU, as the 3060 is 70% faster than A770, and 80% faster than RX 6650XT.

Elsewhere, the A770 is 40% faster than the 3060 in Metro Exodus, Dying Light 2 and Hitman 3 (as expected), but less than 10% in Cyberpunk 2077, Doom Eternal, Deathloop and Ghostwire Tokyo.

Worth remembering that A770 is using a chip which is is fact more complex than GA104, so unless it beats 3070 I don't see how we can say that Intel has a (relatively) good RT performance.
 
A770 is a good alternative to RDNA2, not so much a 3060/Ti. Though it sports a generous 16gb framebuffer which is nice. Its their first GPU also.
 
Worth remembering that A770 is using a chip which is is fact more complex than GA104
Agreed, the A770 enjoys a healthy lead in memory bandwidth vs the RTX 3060 (560GB/s vs 360GB/s), operates at higher clocks (2100MHz vs 1800MHz), has higher count of FP32 cores (4096 vs 3584), and higher count of ray tracing cores (32 vs 28), so it's only natural the A770 beats the 3060 in Ray Tracing .. though, It's obvious it's beating it by raw brute force specs alone (35% higher FLOPs, and 55% higher bandwidth), and not by some better RT engine that Intel has.
 
and not by some better RT engine that Intel has.

I don't know, some of the results for the games that been 'optimised' (I use that term loosely) seem to perform better than the difference in RT units suggests.

Factor in certain games will have built with RTX in mind (Quake 2 RTX as example) and I think Intel's RT implementation may actually be better than Nvidia's 3000 series efforts.

I would love to know why Metro Exodus and Hitman are so much faster on the A770 compared to the 3060.
 
I would love to know why Metro Exodus and Hitman are so much faster on the A770 compared to the 3060.
The difference in FLOPs alone explain that.
seem to perform better than the difference in RT units suggests
You also have FP32 and memory bandwidth difference.

will have built with RTX in mind (Quake 2 RTX as example)
Quake 2 is optimized for VulkanRT now, for all GPUs.

I think Intel's RT implementation may actually be better than Nvidia's 3000 series efforts
I don't think that to be honest. The 3060Ti is far a head of the A770, despite being a closer match on spec than the 3060.
 
In path traced games: like Quake 2 RTX, the A770 is massively behind NVIDIA, to the point that it's more like an AMD GPU, as the 3060 is 70% faster than A770, and 80% faster than RX 6650XT.

Elsewhere, the A770 is 40% faster than the 3060 in Metro Exodus, Dying Light 2 and Hitman 3 (as expected), but less than 10% in Cyberpunk 2077, Doom Eternal, Deathloop and Ghostwire Tokyo.

Great find for the Q2 RTX benchmarks. The one thing I found quite curious is how the A770 LE 16GB and the A750 8GB are within <1% of each other, essentially within the range of measurement error.

The A770 LE has a ~10% memory bandwidth uplift, along with ~14% more functional units across the board, including RT ones, and a small clockspeed bump vs the A750 8GB.
And yet performance is identical. If they were overflowing on-chip caches you'd figure the memory bandwidth bump would help, I almost wonder if there's a CPU limitation or some other odd driver bottleneck at play there. I haven't been able to find any published benchmarks for BabyArc (A380) but I'd be very curious to see how it performed in Q2 RTX.

EDIT: Aha, I just need to search in German:

It's only 720p, but A380 is doing way better than you'd think there given that it's essentially 1/4 of an A770LE.
 

Attachments

  • q2rtx-a380.PNG
    q2rtx-a380.PNG
    13.2 KB · Views: 15
Last edited:
One more curious Intel Arc performance of note, comparing A770 LE 16GB vs A750 8GB:


Check out the Riftbreaker benchmarks with RT on. The A750 is roughly half the speed of the A770LE, even at 1080p, which makes no sense at all.
It doesn't appear that the 8GB of VRAM is limiting anything, as it's being beaten by the 6GB 2060, along with all the other 8GB cards in the comparison.

Even weirder, and more evidence that the VRAM capacity has nothing to do with it, relative to the rest of the product stack it somehow does better at 4k.
It performs strangely poorly at all resolutions though, so it's also clearly not an artifact from one benchmark run go wrong.
 

Attachments

  • riftbreaker-a750vs770le.PNG
    riftbreaker-a750vs770le.PNG
    64.5 KB · Views: 15
Sorry if it's not the right place to ask, but have we a game with RT that supports both Vulkan and DX12 ? I'm curious to know if Vulkan can be faster than DX for RT, or the opposite, or the same...
 
Sebbi made some very important discussion points regarding the thread sorting hardware, I will list them here for quick read.

Let's discuss about shader permutation hell. With latest hardware: Intel Thread Sorting Unit (TSU) and Nvidia Shader Execution Reordering (SER). Now that RTX 4090 is massively CPU bound, could we spend 1% of that perf to get rid of shader permutations?

These new hardware blocks shuffle the registers of multiple SIMDs in a way that each SIMD can run coherent threads. This is super important for ray-tracing and explains why Intel's mid range GPU is so good at ray-tracing, but also explains why RTX 4090 is such a best in RT apps.

But these hardware blocks are not just a great fit for ray-tracing. They could be used to make GPU dynamic branching faster in all shaders. As a result, we could write CPU-style shader code with branches, instead of compiling (hundreds of) thousands of permutations.

Even with hardware like this, it's not free to shuffle SIMD data around. There would be a slight performance hit. CPUs have to pay similar costs for branches too. But CPUs are now fast enough to make this a minor annoyance. I think these GPUs are starting to be there too.

Also RTX 4090 is so fast that we desperately need better API support GPU-driven rendering. We need a fine grained way of spawning new GPU work from shaders. Mesh shaders are great, but they are still lacking the ability to select the shader like ray-tracing does.

These thread sorting units finally make me want to do some HW ray-tracing. The DXR API is still not a perfect fit for GPU SIMD execution, but at least it's not dead stupid anymore. But please, give me access to this magical HW block also for traditional shaders!


 
Last edited:
Let's discuss about shader permutation hell. With latest hardware: Intel Thread Sorting Unit (TSU) and Nvidia Shader Execution Reordering (SER). Now that RTX 4090 is massively CPU bound, could we spend 1% of that perf to get rid of shader permutations?
I believe this is the case:
SER requires developer integration
TSU is handled automatically

So wonder if it's even possible to really expose TSU.
Definitely be interesting if they got exposed how sebbi wants.
 
I don't see why not, it should be s/w controllable in any case.
A more interesting question is would Alchemist get a performance boost in some situations if it will be disabled.
I would need to double check what they said, but I wouldn't assume it's software controllable if it's totally invisible and automatically used.
Especially if they didn't expect to expose it.
Remember this is first gen also, so things like making it accessible could've been very far down design considerations if its automatic.
I'm not saying definitely not viable.
 
I would need to double check what they said, but I wouldn't assume it's software controllable if it's totally invisible and automatically used.
I'm fairly sure that all GPU vendors can control most of GPU's innards through s/w (BIOS/MC and drivers). Whether this can be exposed through drivers into a public API is another issue which may not have a positive answer if such exposure would lead to more problems than performance wins.
 
Sebbi made some very important discussion points regarding the thread sorting hardware, I will list them here for quick read.

Let's discuss about shader permutation hell. With latest hardware: Intel Thread Sorting Unit (TSU) and Nvidia Shader Execution Reordering (SER). Now that RTX 4090 is massively CPU bound, could we spend 1% of that perf to get rid of shader permutations?

These new hardware blocks shuffle the registers of multiple SIMDs in a way that each SIMD can run coherent threads. This is super important for ray-tracing and explains why Intel's mid range GPU is so good at ray-tracing, but also explains why RTX 4090 is such a best in RT apps.

But these hardware blocks are not just a great fit for ray-tracing. They could be used to make GPU dynamic branching faster in all shaders. As a result, we could write CPU-style shader code with branches, instead of compiling (hundreds of) thousands of permutations.

Even with hardware like this, it's not free to shuffle SIMD data around. There would be a slight performance hit. CPUs have to pay similar costs for branches too. But CPUs are now fast enough to make this a minor annoyance. I think these GPUs are starting to be there too.

Also RTX 4090 is so fast that we desperately need better API support GPU-driven rendering. We need a fine grained way of spawning new GPU work from shaders. Mesh shaders are great, but they are still lacking the ability to select the shader like ray-tracing does.

These thread sorting units finally make me want to do some HW ray-tracing. The DXR API is still not a perfect fit for GPU SIMD execution, but at least it's not dead stupid anymore. But please, give me access to this magical HW block also for traditional shaders!


the least CPU bound metrics I've seen for the RTX 4090 was in a GTA V video I shared in another thread, where the CPU shows a 4% usage but the 4090 gets a 100% usage -when running GTA V at 16K-.

Whenever I get the A770 delivered, one of the games I want to play the most and haven't seen benchmarked in any A770's video or article review is Resident Evil 2 Remak,eone of my favourite games ever, just to check how efficient it can be now that in the latest patch Capcom added RT under DirectX 12.
 
Back
Top