According to Techpowerup, that is the difference, but at 1080p. In 4K, it is around 50%
My bad, I thought we are comparing to 4K.
According to Techpowerup, that is the difference, but at 1080p. In 4K, it is around 50%
Isn't the 256bit GDDR6 Navi 21 going head to head against a RTX 3080 with 320bit GDDR6X?On a 192-bit bus? This cache must really be something special.
Isn't the 256bit GDDR6 Navi 21 going head to head against a RTX 3080 with 320bit GDDR6X?
Higher bandwidth effectiveness on RDNA2 should be the least suprising factor at this point.
I wouldn't look at the 5700XT's 4K results as means to compare N22 with the 2080Ti. The former is clearly bandwidth limited in that case.
At TPU, the difference between the two is 46% at 1440p and 35% at 1080p.
Navi 22 doesn't need to be faster than the RTX3070, it just needs to be within 5-10% (biting at the heels) to leave the 3070 in an uncomfortable position, especially if it's significantly cheaper.
All we have is AMDs 4K teaser. I don't know how to extrapolate that to NV22 class hardware.
Isn't the 256bit GDDR6 Navi 21 going head to head against a RTX 3080 with 320bit GDDR6X?
Higher bandwidth effectiveness on RDNA2 should be the least suprising factor at this point.
Yup, but I was thinking about treelet leaves, not of the whole BVH. Regardless, it was just a way to say that there is not a strict need to read the BVH nodes twiceThe RT unit returns intermediate nodes too, not just leaves.
The patent indicated that the RT hardware would pass back what the algorithm considered relevant for further BVH instructions. That would at least mean intersection results if the rays in a given wavefront's group reached leaf nodes, or metadata indicating how traversal needed to continue and the pointers to be fed into the next BVH instruction.
Whether the actual instructions fully match that will hopefully be detailed once the architecture is fully published, but at least in theory it seemed like the ray tracing hardware would make the evaluation of the next step in the traversal process and give that recommendation to the SIMD.
In theory, involving the SIMD could mean there's the possibility for the programmable portion to not follow those recommendations, or use additional data to control the evaluation. Such a change wouldn't be necessary in the default case. At least some of Nvidia's RT method can be substituted with custom shaders for intersections, though at least with Turing the recommendation for performance was to keep to the built-in methods.
If anything, a 40CU part with a 192 bit interface should do even better than a 64/72/80 CU part with 256 bit.
Aren’t ROPs typically bandwidth limited except at lower precision? I’d you’re writing out a lot of buffers this could matter.But how much are the GPUs bandwidth bound in reality? I mean, look at the projected performance level of the 3070 - same as 2080Ti even at 4K but with 73% of the latter's available bandwidth (and the same as the 5700XT). Rumors say that Navi21 will use 16 Gbps GRRD6 (+15% bandwidth compared to 14 Gbps) and while i'm skeptical about this "magic cache" I am not so sure that everything about performance level can be described with the available bandwidth.
Aren’t ROPs typically bandwidth limited except at lower precision? I’d you’re writing out a lot of buffers this could matter.
Do better as in be less bandwidth bound?
But how much are the GPUs bandwidth bound in reality? I mean, look at the projected performance level of the 3070 - same as 2080Ti even at 4K but with 73% of the latter's available bandwidth (and the same as the 5700XT). Rumors say that Navi21 will use 16 Gbps GRRD6 (+15% bandwidth compared to 14 Gbps) and while i'm skeptical about this "magic cache" I am not so sure that everything about performance level can be described with the available bandwidth.
curious, aren't all ROPS typically tied to caches past and current gen?Yeah but in RDNA2 RBEs are not physically connected to RAM controllers, they are clients of L2 cache instead.
Some years ago I did ROP cache experiments with AMD GCN (7970) in order to optimize particle rendering. GCN has dedicated ROP caches (16 KB color, 4 KB depth). In my experiment I split the rendering to 64x64 tiles (= 16 KB). This resulted in huge memory bandwidth savings (and over 100% performance increase), especially when the overdraw was large (lots of full screen alpha blended particles close to the camera). You can certainly get big bandwidth advantages also on AMD hardware, as long as you sort your workload (by screen locality) before submitting it.
it does look like they changed how they accessed data however for the RBs.The final fixed-function graphics stage is the RB, which performs depth, stencil, and alpha tests and blends pixels for anti-aliasing. Each of the RBs in the shader array can test, sample, and blend pixels at a rate of four output pixels per clock. One of the major improvements in the RDNA architecture is that the RBs primarily access data through the graphics L1 cache, which reduces the pressure on the L2 cache and saves power by moving less data.
curious, aren't all ROPS typically tied to caches past and current gen?
IIRC the difference with RDNA is that compute is now tied in with the L2 cache, whereas with GCN it went directly to the memory controller. But I think ROPS are unchanged
Yes, the bandwidth to flops ratio should be the best for the 40CU part, then the 64,72 and 80CU parts respectively. Assuming that the rumoured bus widths of 192 bit and 256 bit are true of course.
Well, we don't know if ROPs are unchanged. Someone pointed out here the opposite. But we will see. What I mean is that being physically decoupled by L2 cache the RBEs will be served first by the internal bandwidth and only if the data is not present in the internal cache then external memory is accessed. So yes, there will be some limitation due to bandwidth but there are also techniques for lowering the bandwidth needs within certain limits (data compression, larger caches, and so on).
Given the streaming nature of graphics workloads I understand GPU caches are mostly helpful for spatial locality on reads. Do the RDNA L1 and L2 caches also buffer writes to render targets and UAVs? i.e. are those caches facilitating multiple writes before flushing to vram.
The comparison would really only be relevant if they were both priced the same or very close