AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
I wouldn't look at the 5700XT's 4K results as means to compare N22 with the 2080Ti. The former is clearly bandwidth limited in that case.

At TPU, the difference between the two is 46% at 1440p and 35% at 1080p.

Navi 22 doesn't need to be faster than the RTX3070, it just needs to be within 5-10% (biting at the heels) to leave the 3070 in an uncomfortable position, especially if it's significantly cheaper.
 
On a 192-bit bus? This cache must really be something special.
Isn't the 256bit GDDR6 Navi 21 going head to head against a RTX 3080 with 320bit GDDR6X?

Higher bandwidth effectiveness on RDNA2 should be the least suprising factor at this point.
 
I wouldn't look at the 5700XT's 4K results as means to compare N22 with the 2080Ti. The former is clearly bandwidth limited in that case.

At TPU, the difference between the two is 46% at 1440p and 35% at 1080p.

Navi 22 doesn't need to be faster than the RTX3070, it just needs to be within 5-10% (biting at the heels) to leave the 3070 in an uncomfortable position, especially if it's significantly cheaper.

And with 192 bit memory, it can offer 12 GB of VRAM against the 8 GB 3070, which is in a bit of a tight spot as it isnt an upgrade even after 4 years (compared to say GTX 1000 series or RX 400/500 series). I don't expect the 3070 to actually beat the 2080 Ti anyway, I'd expect it to be behind by 5-10%. It's gonna be a close fight.
All we have is AMDs 4K teaser. I don't know how to extrapolate that to NV22 class hardware.

If anything, a 40CU part with a 192 bit interface should do even better than a 64/72/80 CU part with 256 bit.
 
Isn't the 256bit GDDR6 Navi 21 going head to head against a RTX 3080 with 320bit GDDR6X?

Higher bandwidth effectiveness on RDNA2 should be the least suprising factor at this point.

The 3080 has 70% more bandwidth and 48% more (boost) FP32 than the 3070 and It will be ~35% faster. I dont think that bandwidth is such a huge factor outside certain workloads like raytracing.
 
The RT unit returns intermediate nodes too, not just leaves.
Yup, but I was thinking about treelet leaves, not of the whole BVH. Regardless, it was just a way to say that there is not a strict need to read the BVH nodes twice
 
The patent indicated that the RT hardware would pass back what the algorithm considered relevant for further BVH instructions. That would at least mean intersection results if the rays in a given wavefront's group reached leaf nodes, or metadata indicating how traversal needed to continue and the pointers to be fed into the next BVH instruction.

Whether the actual instructions fully match that will hopefully be detailed once the architecture is fully published, but at least in theory it seemed like the ray tracing hardware would make the evaluation of the next step in the traversal process and give that recommendation to the SIMD.
In theory, involving the SIMD could mean there's the possibility for the programmable portion to not follow those recommendations, or use additional data to control the evaluation. Such a change wouldn't be necessary in the default case. At least some of Nvidia's RT method can be substituted with custom shaders for intersections, though at least with Turing the recommendation for performance was to keep to the built-in methods.

Ok, it’s probably true that the shader doesn’t need to inspect the contents of the node in order to schedule it. But that doesn’t seem to be a notable benefit of shader based scheduling given it’s also the case for Nvidia’s fixed function approach.

AMD’s patent calls for storing traversal state in registers and the texture cache. It would seem the shader is responsible for managing the traversal stack for each ray and that stack presumably lives in L0. I don’t see how you would avoid thrashing the cache if you try to do anything else alongside RT. Unless of course you have an “infinite” amount of cache :eek:
 
But how much are the GPUs bandwidth bound in reality? I mean, look at the projected performance level of the 3070 - same as 2080Ti even at 4K but with 73% of the latter's available bandwidth (and the same as the 5700XT). Rumors say that Navi21 will use 16 Gbps GRRD6 (+15% bandwidth compared to 14 Gbps) and while i'm skeptical about this "magic cache" I am not so sure that everything about performance level can be described with the available bandwidth.
 
Last edited:
But how much are the GPUs bandwidth bound in reality? I mean, look at the projected performance level of the 3070 - same as 2080Ti even at 4K but with 73% of the latter's available bandwidth (and the same as the 5700XT). Rumors say that Navi21 will use 16 Gbps GRRD6 (+15% bandwidth compared to 14 Gbps) and while i'm skeptical about this "magic cache" I am not so sure that everything about performance level can be described with the available bandwidth.
Aren’t ROPs typically bandwidth limited except at lower precision? I’d you’re writing out a lot of buffers this could matter.
 
Do better as in be less bandwidth bound?

Yes, the bandwidth to flops ratio should be the best for the 40CU part, then the 64,72 and 80CU parts respectively. Assuming that the rumoured bus widths of 192 bit and 256 bit are true of course.
But how much are the GPUs bandwidth bound in reality? I mean, look at the projected performance level of the 3070 - same as 2080Ti even at 4K but with 73% of the latter's available bandwidth (and the same as the 5700XT). Rumors say that Navi21 will use 16 Gbps GRRD6 (+15% bandwidth compared to 14 Gbps) and while i'm skeptical about this "magic cache" I am not so sure that everything about performance level can be described with the available bandwidth.

Agreed. And as we saw with 5600XT, even with 25% less bandwidth, the performance hit was in the single digit performance range.
 
Yeah but in RDNA2 RBEs are not physically connected to RAM controllers, they are clients of L2 cache instead.
curious, aren't all ROPS typically tied to caches past and current gen?
IIRC the difference with RDNA is that compute is now tied in with the L2 cache, whereas with GCN it went directly to the memory controller. But I think ROPS are unchanged.

ergo this older post by sebbbi:
https://forum.beyond3d.com/posts/1934106/
Some years ago I did ROP cache experiments with AMD GCN (7970) in order to optimize particle rendering. GCN has dedicated ROP caches (16 KB color, 4 KB depth). In my experiment I split the rendering to 64x64 tiles (= 16 KB). This resulted in huge memory bandwidth savings (and over 100% performance increase), especially when the overdraw was large (lots of full screen alpha blended particles close to the camera). You can certainly get big bandwidth advantages also on AMD hardware, as long as you sort your workload (by screen locality) before submitting it.

with respect to RDNA
The final fixed-function graphics stage is the RB, which performs depth, stencil, and alpha tests and blends pixels for anti-aliasing. Each of the RBs in the shader array can test, sample, and blend pixels at a rate of four output pixels per clock. One of the major improvements in the RDNA architecture is that the RBs primarily access data through the graphics L1 cache, which reduces the pressure on the L2 cache and saves power by moving less data.
it does look like they changed how they accessed data however for the RBs.
 
Last edited:
curious, aren't all ROPS typically tied to caches past and current gen?
IIRC the difference with RDNA is that compute is now tied in with the L2 cache, whereas with GCN it went directly to the memory controller. But I think ROPS are unchanged

Well, we don't know if ROPs are unchanged. Someone pointed out here the opposite. But we will see. What I mean is that being physically decoupled by L2 cache the RBEs will be served first by the internal bandwidth and only if the data is not present in the internal cache then external memory is accessed. So yes, there will be some limitation due to bandwidth but there are also techniques for lowering the bandwidth needs within certain limits (data compression, larger caches, and so on).
 
Yes, the bandwidth to flops ratio should be the best for the 40CU part, then the 64,72 and 80CU parts respectively. Assuming that the rumoured bus widths of 192 bit and 256 bit are true of course.

You may be right that it’s more balanced. In terms of absolute performance though it’ll be really interesting to see where the chips fall.
 
Well, we don't know if ROPs are unchanged. Someone pointed out here the opposite. But we will see. What I mean is that being physically decoupled by L2 cache the RBEs will be served first by the internal bandwidth and only if the data is not present in the internal cache then external memory is accessed. So yes, there will be some limitation due to bandwidth but there are also techniques for lowering the bandwidth needs within certain limits (data compression, larger caches, and so on).

Given the streaming nature of graphics workloads I understand GPU caches are mostly helpful for spatial locality on reads. Do the RDNA L1 and L2 caches also buffer writes to render targets and UAVs? i.e. are those caches facilitating multiple writes before flushing to vram.
 
Given the streaming nature of graphics workloads I understand GPU caches are mostly helpful for spatial locality on reads. Do the RDNA L1 and L2 caches also buffer writes to render targets and UAVs? i.e. are those caches facilitating multiple writes before flushing to vram.

This is a good question, I think we will have an answer on the 28th
 
Status
Not open for further replies.
Back
Top