AMD Radeon RDNA2 Navi (RX 6500, 6600, 6700, 6800, 6900 XT)

No they mix Voxel and mesh based tracing the fallback is not SSR but Voxel cone tracing when mesh based tracing is not needed. For very little object like bullet case they used mesh based raytracing all the time. I think many engine will do like this in the future.

To me it reads like they're using RT hardware for tracing triangle meshes when the hardware is available, but fall back to cone-tracing of voxels when that hardware is not present. It seems like they have a distance cut-off for meshes as well where they switch to some other representation.

Edit:
Nope, I'm wrong. It explicitly says they don't use the RT hardware.

At the moment we don’t benefit from any additional performance that modern APIs like Vulkan or DX12 or dedicated hardware like the latest generation of graphics cards could give us. But of course, we will optimize the feature to achieve an uplift performance from these APIs and graphics cards.

This is pure software based, so any performance gains of Ampere over Turing are going to be related to parts of the architecture other than the RT cores.
 
To me it reads like they're using RT hardware for tracing triangle meshes when the hardware is available, but fall back to cone-tracing of voxels when that hardware is not present. It seems like they have a distance cut-off for meshes as well where they switch to some other representation.

Edit:
Nope, I'm wrong. It explicitly says they don't use the RT hardware.



This is pure software based, so any performance gains of Ampere over Turing are going to be related to parts of the architecture other than the RT cores.

Yes but the part where they use mesh tracing will benefit in the future of hardware accelerated raytracing. I suppose we will begin to see later this type of engine.


EDIT: They will not replace the current system but use hardware accelerated raytracing for better performance.

However, RTX will allow the effects to run at a higher resolution. At the moment on GTX 1080, we usually compute reflections and refractions at half-screen resolution. RTX will probably allow full-screen 4k resolution. It will also help us to have more dynamic elements in the scene, whereas currently, we have some limitations. Broadly speaking, RTX will not allow new features in CRYENGINE, but it will enable better performance and more details.
 
So looking at that Crysis Remaster benchmark, based on the clocks they're showing (why so low?), the RTX 3090 should be 36.8 TFLOPS, the RTX 3080 should be 31 TFLOPS and the RX5700XT should be 90 TFLOPS.

The 3090 is scoring almost exactly 4x the 5700XT which scales perfectly with TFLOPS. The 3080 is scoring 3x the 5700XT which is a little short of the 3.4x TFLOPS differential, but it's fairly close. It actually looks like this title is ALU limited.
 
So looking at that Crysis Remaster benchmark, based on the clocks they're showing (why so low?), the RTX 3090 should be 36.8 TFLOPS, the RTX 3080 should be 31 TFLOPS and the RX5700XT should be 90 TFLOPS.
Crysis Remastered uses RT cores on NVIDIA GPUs through NVIDIA's proprietary RT Vulkan extension, which is super imposed on the DX11 path that the game is using.
Hardware-assisted ray tracing is planned for the next CRYENGINE release and is already
available in the PC version of Crysis Remastered
.

We are using the Vulkan extensions to enable hardware ray tracing on NVIDIA RTX cards.
This gives us a significant performance boost in the game. The differences you will see in the
game and the reflections of animated objects, besides the main character, and performance.

Hardware support gives us a 5-9ms of rendering time performance boost with ray tracing enabled. In areas where ray tracing is not 100% present, like on a wooden floor, you won't see many differences in performance between software and hardware ray tracing, but for 95% on the game, you will feel the performance benefits.
 
So looking at that Crysis Remaster benchmark, based on the clocks they're showing (why so low?), the RTX 3090 should be 36.8 TFLOPS, the RTX 3080 should be 31 TFLOPS and the RX5700XT should be 90 TFLOPS.

The 3090 is scoring almost exactly 4x the 5700XT which scales perfectly with TFLOPS. The 3080 is scoring 3x the 5700XT which is a little short of the 3.4x TFLOPS differential, but it's fairly close. It actually looks like this title is ALU limited.

Yes, there are actual use cases for the amount of compute performance. Since UE5 it seems we are moving into that direction. Much against NV, i dont actually think they are doing it wrong. But as with anything, time will tell.
 
I've read on amd website what they wrote about Smart Access Memory, but I don't get where the speedup is.

Data sent to the gpu by the cpu could do cpu=>pcie=>gpu directly ? Right now it's going to the main pc ram first ?

What kind of transfert or instructions would benefit from that ?
 
I've read on amd website what they wrote about Smart Access Memory, but I don't get where the speedup is.

Data sent to the gpu by the cpu could do cpu=>pcie=>gpu directly ? Right now it's going to the main pc ram first ?

What kind of transfert or instructions would benefit from that ?
perhaps transfers to the superfast Infinity Cache, it's similar to RTX I/O but inverse, iirc, where the CPU can access DIRECTLY to part of the GPU memory, it works at a driver level and you can enable disable it depending on teh game and if games are designed to use it, this technology performs better.
 
this should be taught in schools. This guy knows his stuff. He is spaniard -use subs, it's worth it- but I haven't seen or listened to a better explanation about the advantages of the new AMD GPUs -specially why the Infinity Cache is such a great idea where "slow" memories can't handle everything super fast and be efficient-. He mostly uses nVidia in his rigs, so he is not your typical suspicious biased fanboy. He has a way with words to explain it.

 
Yeah it was interesting. But we really need the arch day to really see whats inside that die. The artistic render is beautiful but I'm very curious to see the real die. And for sure this will be a game changer for mobile but mostly APUs with the extra bandwidth, simplified design and lower power draw.

One thing is for sure. I'm enjoying the Hitler videos of this AMD beating Nvidia thing. :LOL:
 
I've read on amd website what they wrote about Smart Access Memory, but I don't get where the speedup is.

Data sent to the gpu by the cpu could do cpu=>pcie=>gpu directly ? Right now it's going to the main pc ram first ?

What kind of transfert or instructions would benefit from that ?
It's what you need to do after the transfer. You only had a small window for uploading via CPU with push semantics, and if you couldn't fit in there, you had to take the detour to write to CPU address space first, ending up in RAM, and then to trigger either shader or the copy engines to perform the transfer from RAM into VRAM.

Twice the transfer size wasted in memory bandwidth on the CPU (with a good chance of cache miss on the CPU), plus the transfer size as wasted buffer space on the CPU, plus time further wasted as the shaders/copy engine are stalled by the slow PCIe bus.

Allowing the entire VRAM to be directly accessible from the CPU eliminates that buffer in RAM, and not only for a few selected resources as with the previous 256MB, but now for all of them. Effectively this is shifting the stall in the slow PCIe bus from the GPU to the CPU, but that is actually mostly fine. You can afford to spare a CPU core (or even one per direction) to drive the data transfer.

The speedup occurs only in titles which did already make use of the AMD "exclusive" host visible GPU memory pool. Only exclusive as none of the other vendors for this market did use the APIs, it's not as if it was locked away behind any proprietary extension.
 
Last edited:
If the 3080 cannot utilize it's bandwidth then neither can Turing and Pascal cards, since their bandwidth deltas correspond with perfomance deltas.
Only the 3090 has somewhat excessive bandwidth.

RTX 3080 offers 52-59 %* higher performance than RTX 2080 while having 69 % higher bandwidth.
RTX 3090 offers 39-45 %* higher performance than RTX 2080 Ti while having 52 % higher bandwidth.

The difference is not big, but when comparing product to product, bandwidth efficiency is worse for Ampere than for Turing. For many previous generations it was better when compared to the preceding one. RTX 3070 shows that the problem is not in architecture (which is in fact more bandwidth efficient), but in configuration of particular products.

*1440p-2160k, ComputerBase
 
What kind of transfert or instructions would benefit from that ?
It's probably the main difference between PC and console that is now addressed? I think Intel has this feature too (for their new discrete GPUs), and i hope we get it for any CPU / GPU vendor combination soon.

The slow PCI data transfer forced me to implement everything on GPU itself, including things like BVH build / refit, work generation, etc. Those are small tasks that can not saturate the GPU, but doing them on CPU would require too much data transfer ending up much slower.
(For debug purposes i may still download all data from GPU. If i do this, framerate drops to 1fps, not using GCNs shared 256MB memory feature.)
So that allows to do some things very differently. One thing where this is very useful is doing simulations, e.g. fluid sim on GPU, rigid bodies on CPU and having proper interaction between them.
Currently we might not want to do this because data transfer and sync within a single frame could add too much.

Both vendors did much better than expected for me this time. I thought Moores Law is dead and stagnation is ahead, but seems we're not there yet.
I'm surprised RDNA2 can compete Ampere with almost only half of shader cores. I don't think infinity cache alone can explain this. Maybe frontend situation has now reversed between vendors as well, but for now it looks like 1 AMD TF > 1 NV TF in general.
I'll get me a 6800 next year to arrive at next gen. : )
 
RTX 3080 offers 52-59 %* higher performance than RTX 2080 while having 69 % higher bandwidth.
RTX 3090 offers 39-45 %* higher performance than RTX 2080 Ti while having 52 % higher bandwidth.

The difference is not big, but when comparing product to product, bandwidth efficiency is worse for Ampere than for Turing. For many previous generations it was better when compared to the preceding one. RTX 3070 shows that the problem is not in architecture (which is in fact more bandwidth efficient), but in configuration of particular products.

*1440p-2160k, ComputerBase

Right, so don't say they are less bandwidth efficient when you deny it in next sentence.
We can also make product to product (probably you wanted to use something like tiers) comparison between 3080 and 2080 Ti and see the same bandwidth utilization. Just to explain my carefulness regarding the claim "Ampere cannot utilize it's bandwidth".

The mystery is why did Nvidia bother with GDDR6X in the first place.

Is it possible Nvidia overshoot just because they can. But there is also the option of games today not being a good workload for GA102 cards.
 
I'm surprised RDNA2 can compete Ampere with almost only half of shader cores. I don't think infinity cache alone can explain this. Maybe frontend situation has now reversed between vendors as well, but for now it looks like 1 AMD TF > 1 NV TF in general.
I'll get me a 6800 next year to arrive at next gen. : )

Doubling shader cores is so much more than doubling the number of FP32 ALUs.
There is the instruction decoder/dispatcher, the register file size, the number of INT32 units, the number of load/store units, number of special functions units, number of texture units, size of L0/L1/L2 cache all of which have not doubled.
 
I'm surprised RDNA2 can compete Ampere with almost only half of shader cores. I don't think infinity cache alone can explain this. Maybe frontend situation has now reversed between vendors as well, but for now it looks like 1 AMD TF > 1 NV TF in general.

Doubling shader cores is so much more than doubling the number of FP32 ALUs.

Because that's just shady marketing from Nvidia's. The amount of ALUs is identical between Turing and Ampere, they just extended the integer ALUs to handle FP data. Basically going from FP32 + INT32 per cycle to
FP32 + FP32/INT32 per cycle. That does allow you to claim that your peak FP32 throughput has doubled, but it will almost never happen in practice. Don't get me wrong, I think it was a smart move by Nvidia, modern GPU programs do need more floating point performance, but marketing it as 2x shader cores is grossly misleading.

At the same time, Ampere's focus on FP throughput should give it an edge in many GPGPU workloads. I would love to see some comparisons between Navi 2 and Ampere for compute.
 
Because that's just shady marketing from Nvidia's. The amount of ALUs is identical between Turing and Ampere, they just extended the integer ALUs to handle FP data. Basically going from FP32 + INT32 per cycle to
FP32 + FP32/INT32 per cycle. That does allow you to claim that your peak FP32 throughput has doubled, but it will almost never happen in practice. Don't get me wrong, I think it was a smart move by Nvidia, modern GPU programs do need more floating point performance, but marketing it as 2x shader cores is grossly misleading.

At the same time, Ampere's focus on FP throughput should give it an edge in many GPGPU workloads. I would love to see some comparisons between Navi 2 and Ampere for compute.

They also never marketed the INT cores on Turing so that was also misleading. Essentially they undersold Turing and are overselling Ampere in comparison. However, relative to Pascal and AMDs stuff Ampere’s marketing is fine.

Either way RDNA 2 looks to be a far more balanced architecture for gaming. Idle ALUs are no good and AMD seems to be tackling that head on with their cache implementation. Will be interesting to see how it holds up in a broader set of games in 3rd party reviews.
 
this should be taught in schools. This guy knows his stuff. He is spaniard -use subs, it's worth it- but I haven't seen or listened to a better explanation about the advantages of the new AMD GPUs -specially why the Infinity Cache is such a great idea where "slow" memories can't handle everything super fast and be efficient-. He mostly uses nVidia in his rigs, so he is not your typical suspicious biased fanboy. He has a way with words to explain it.


I may be going blind but I can't find captions.
 
Either way RDNA 2 looks to be a far more balanced architecture for gaming. Idle ALUs are no good and AMD seems to be tackling that head on with their cache implementation. Will be interesting to see how it holds up in a broader set of games in 3rd party reviews.
It all boils down to balancing the die budget between compute and memory resources. With RDNA2, AMD put the stops on chasing the raw FLOPS numbers and shifted the budget to the memory side of the equation with the Infinity Cache. I hope it pays off in the long term.
 
Back
Top