AMD: Navi Speculation, Rumours and Discussion [2019-2020]

SimBy · Oct 21, 2020

Leoneazzurro5 said:
According to Techpowerup, that is the difference, but at 1080p. In 4K, it is around 50%

My bad, I thought we are comparing to 4K.

Deleted member 13524 · Oct 21, 2020

I wouldn't look at the 5700XT's 4K results as means to compare N22 with the 2080Ti. The former is clearly bandwidth limited in that case.

At TPU, the difference between the two is 46% at 1440p and 35% at 1080p.

Navi 22 doesn't need to be faster than the RTX3070, it just needs to be within 5-10% (biting at the heels) to leave the 3070 in an uncomfortable position, especially if it's significantly cheaper.

Deleted member 13524 · Oct 21, 2020

trinibwoy said:
On a 192-bit bus? This cache must really be something special.

Isn't the 256bit GDDR6 Navi 21 going head to head against a RTX 3080 with 320bit GDDR6X?

Higher bandwidth effectiveness on RDNA2 should be the least suprising factor at this point.

trinibwoy · Oct 21, 2020

ToTTenTranz said:
Isn't the 256bit GDDR6 Navi 21 going head to head against a RTX 3080 with 320bit GDDR6X?

Higher bandwidth effectiveness on RDNA2 should be the least suprising factor at this point.

All we have is AMDs 4K teaser. I don't know how to extrapolate that to NV22 class hardware.

Erinyes · Oct 22, 2020

ToTTenTranz said:
I wouldn't look at the 5700XT's 4K results as means to compare N22 with the 2080Ti. The former is clearly bandwidth limited in that case.

At TPU, the difference between the two is 46% at 1440p and 35% at 1080p.

Navi 22 doesn't need to be faster than the RTX3070, it just needs to be within 5-10% (biting at the heels) to leave the 3070 in an uncomfortable position, especially if it's significantly cheaper.

And with 192 bit memory, it can offer 12 GB of VRAM against the 8 GB 3070, which is in a bit of a tight spot as it isnt an upgrade even after 4 years (compared to say GTX 1000 series or RX 400/500 series). I don't expect the 3070 to actually beat the 2080 Ti anyway, I'd expect it to be behind by 5-10%. It's gonna be a close fight.

trinibwoy said:
All we have is AMDs 4K teaser. I don't know how to extrapolate that to NV22 class hardware.

If anything, a 40CU part with a 192 bit interface should do even better than a 64/72/80 CU part with 256 bit.

troyan · Oct 22, 2020

ToTTenTranz said:
Isn't the 256bit GDDR6 Navi 21 going head to head against a RTX 3080 with 320bit GDDR6X?

Higher bandwidth effectiveness on RDNA2 should be the least suprising factor at this point.

The 3080 has 70% more bandwidth and 48% more (boost) FP32 than the 3070 and It will be ~35% faster. I dont think that bandwidth is such a huge factor outside certain workloads like raytracing.

nAo · Oct 22, 2020

trinibwoy said:
The RT unit returns intermediate nodes too, not just leaves.

Yup, but I was thinking about treelet leaves, not of the whole BVH. Regardless, it was just a way to say that there is not a strict need to read the BVH nodes twice

trinibwoy · Oct 22, 2020

3dilettante said:
The patent indicated that the RT hardware would pass back what the algorithm considered relevant for further BVH instructions. That would at least mean intersection results if the rays in a given wavefront's group reached leaf nodes, or metadata indicating how traversal needed to continue and the pointers to be fed into the next BVH instruction.

Whether the actual instructions fully match that will hopefully be detailed once the architecture is fully published, but at least in theory it seemed like the ray tracing hardware would make the evaluation of the next step in the traversal process and give that recommendation to the SIMD.
In theory, involving the SIMD could mean there's the possibility for the programmable portion to not follow those recommendations, or use additional data to control the evaluation. Such a change wouldn't be necessary in the default case. At least some of Nvidia's RT method can be substituted with custom shaders for intersections, though at least with Turing the recommendation for performance was to keep to the built-in methods.

Ok, it’s probably true that the shader doesn’t need to inspect the contents of the node in order to schedule it. But that doesn’t seem to be a notable benefit of shader based scheduling given it’s also the case for Nvidia’s fixed function approach.

AMD’s patent calls for storing traversal state in registers and the texture cache. It would seem the shader is responsible for managing the traversal stack for each ray and that stack presumably lives in L0. I don’t see how you would avoid thrashing the cache if you try to do anything else alongside RT. Unless of course you have an “infinite” amount of cache

trinibwoy · Oct 22, 2020

Erinyes said:
If anything, a 40CU part with a 192 bit interface should do even better than a 64/72/80 CU part with 256 bit.

Do better as in be less bandwidth bound?

Leoneazzurro5 · Oct 22, 2020

But how much are the GPUs bandwidth bound in reality? I mean, look at the projected performance level of the 3070 - same as 2080Ti even at 4K but with 73% of the latter's available bandwidth (and the same as the 5700XT). Rumors say that Navi21 will use 16 Gbps GRRD6 (+15% bandwidth compared to 14 Gbps) and while i'm skeptical about this "magic cache" I am not so sure that everything about performance level can be described with the available bandwidth.

iroboto · Oct 22, 2020

Leoneazzurro5 said:
But how much are the GPUs bandwidth bound in reality? I mean, look at the projected performance level of the 3070 - same as 2080Ti even at 4K but with 73% of the latter's available bandwidth (and the same as the 5700XT). Rumors say that Navi21 will use 16 Gbps GRRD6 (+15% bandwidth compared to 14 Gbps) and while i'm skeptical about this "magic cache" I am not so sure that everything about performance level can be described with the available bandwidth.

Aren’t ROPs typically bandwidth limited except at lower precision? I’d you’re writing out a lot of buffers this could matter.

Leoneazzurro5 · Oct 22, 2020

iroboto said:
Aren’t ROPs typically bandwidth limited except at lower precision? I’d you’re writing out a lot of buffers this could matter.

Yeah but in RDNA2 RBEs are not physically connected to RAM controllers, they are clients of L2 cache instead.

Erinyes · Oct 22, 2020

trinibwoy said:
Do better as in be less bandwidth bound?

Yes, the bandwidth to flops ratio should be the best for the 40CU part, then the 64,72 and 80CU parts respectively. Assuming that the rumoured bus widths of 192 bit and 256 bit are true of course.

Leoneazzurro5 said:
But how much are the GPUs bandwidth bound in reality? I mean, look at the projected performance level of the 3070 - same as 2080Ti even at 4K but with 73% of the latter's available bandwidth (and the same as the 5700XT). Rumors say that Navi21 will use 16 Gbps GRRD6 (+15% bandwidth compared to 14 Gbps) and while i'm skeptical about this "magic cache" I am not so sure that everything about performance level can be described with the available bandwidth.

Agreed. And as we saw with 5600XT, even with 25% less bandwidth, the performance hit was in the single digit performance range.

iroboto · Oct 22, 2020

Leoneazzurro5 said:
Yeah but in RDNA2 RBEs are not physically connected to RAM controllers, they are clients of L2 cache instead.

curious, aren't all ROPS typically tied to caches past and current gen?
IIRC the difference with RDNA is that compute is now tied in with the L2 cache, whereas with GCN it went directly to the memory controller. But I think ROPS are unchanged.

ergo this older post by sebbbi:
https://forum.beyond3d.com/posts/1934106/

Some years ago I did ROP cache experiments with AMD GCN (7970) in order to optimize particle rendering. GCN has dedicated ROP caches (16 KB color, 4 KB depth). In my experiment I split the rendering to 64x64 tiles (= 16 KB). This resulted in huge memory bandwidth savings (and over 100% performance increase), especially when the overdraw was large (lots of full screen alpha blended particles close to the camera). You can certainly get big bandwidth advantages also on AMD hardware, as long as you sort your workload (by screen locality) before submitting it.

with respect to RDNA

The final fixed-function graphics stage is the RB, which performs depth, stencil, and alpha tests and blends pixels for anti-aliasing. Each of the RBs in the shader array can test, sample, and blend pixels at a rate of four output pixels per clock. One of the major improvements in the RDNA architecture is that the RBs primarily access data through the graphics L1 cache, which reduces the pressure on the L2 cache and saves power by moving less data.

it does look like they changed how they accessed data however for the RBs.

Leoneazzurro5 · Oct 22, 2020

iroboto said:
curious, aren't all ROPS typically tied to caches past and current gen?
IIRC the difference with RDNA is that compute is now tied in with the L2 cache, whereas with GCN it went directly to the memory controller. But I think ROPS are unchanged

Well, we don't know if ROPs are unchanged. Someone pointed out here the opposite. But we will see. What I mean is that being physically decoupled by L2 cache the RBEs will be served first by the internal bandwidth and only if the data is not present in the internal cache then external memory is accessed. So yes, there will be some limitation due to bandwidth but there are also techniques for lowering the bandwidth needs within certain limits (data compression, larger caches, and so on).

trinibwoy · Oct 22, 2020

Erinyes said:
Yes, the bandwidth to flops ratio should be the best for the 40CU part, then the 64,72 and 80CU parts respectively. Assuming that the rumoured bus widths of 192 bit and 256 bit are true of course.

You may be right that it’s more balanced. In terms of absolute performance though it’ll be really interesting to see where the chips fall.

trinibwoy · Oct 22, 2020

Leoneazzurro5 said:
Well, we don't know if ROPs are unchanged. Someone pointed out here the opposite. But we will see. What I mean is that being physically decoupled by L2 cache the RBEs will be served first by the internal bandwidth and only if the data is not present in the internal cache then external memory is accessed. So yes, there will be some limitation due to bandwidth but there are also techniques for lowering the bandwidth needs within certain limits (data compression, larger caches, and so on).

Given the streaming nature of graphics workloads I understand GPU caches are mostly helpful for spatial locality on reads. Do the RDNA L1 and L2 caches also buffer writes to render targets and UAVs? i.e. are those caches facilitating multiple writes before flushing to vram.

Leoneazzurro5 · Oct 22, 2020

trinibwoy said:
Given the streaming nature of graphics workloads I understand GPU caches are mostly helpful for spatial locality on reads. Do the RDNA L1 and L2 caches also buffer writes to render targets and UAVs? i.e. are those caches facilitating multiple writes before flushing to vram.

This is a good question, I think we will have an answer on the 28th

PizzaKoma · Oct 22, 2020

https://twitter.com/x/status/1319221360867086336

DDH · Oct 22, 2020

PizzaKoma said:
https://twitter.com/x/status/1319221360867086336

The comparison would really only be relevant if they were both priced the same or very close

AMD: Navi Speculation, Rumours and Discussion [2019-2020]

SimBy

Deleted member 13524

Guest

Deleted member 13524

Guest

trinibwoy

Meh

Erinyes

troyan

nAo

Nutella Nutellae

trinibwoy

Meh

trinibwoy

Meh

Leoneazzurro5

iroboto

Daft Funk

Leoneazzurro5

Erinyes

iroboto

Daft Funk

Leoneazzurro5

trinibwoy

Meh

trinibwoy

Meh

Leoneazzurro5

PizzaKoma

DDH