AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
L2 (global memory) atomics are effectively a superset of ROP functions, but it could be argued that when a rasteriser generates work items, there is no need to invalidate L1 lines. Compute atomics must invalidate simply because any CU anywhere on the GPU could use an atomic on that same address, so all L1 lines that cache that L2 line need to be invalidated.

128B cache lines are "relatively large" compared with render target pixels (at least simple 32-bit per pixel formats), and of course the alignment of rasterised fragments to cache lines is quite coarse and will generally suffer "quad misalignment". So it would seem best to talk about the bandwidth amplification of L1 in terms of cache lines rather than pixels.
 
Big Navi could be faster than the 3080. Also there could be a new Navi with the speed of the 5700 GPU and 2160 SPs, which could cost as little as 250€, becoming the future substitute of the RX 580/570 in the low/medium market segment.
Maybe on blowout sale already, but you can get 5600XT's (which is basically a 5700-ish class of card) for 250 € already. If it's with 8+ GByte and DXR hardware, then ok, fair point.
 
What I want is around 2060 performance at RX580 price. So far it has been a side grade if you want to upgrade but only have RX580 money (even at launch price!)
 
so would a 4-8gig hbm2 with a 256bit gddr bus make sense for them ?

I guess it depends on how hard/easy it is to make 8x 32bit channels of GDDR6 coming out of an interposer.

If they could be used effectively at the same time, 256bit + 1 HBM2E stack would make for some very interesting combinations, though.
Cards with that chip could range from 336GB/s (e.g. cut-down 192bit 14Gbps GDDR6) all the way up to 922GB/s (256bit 16Gbps GDDR6 + 3.2Gbps HBM2e), or even more if HBM2e goes out of spec.

Desktop midrange cards could use just GDDR6, then higher-end offerings could have HBM2e + GDDR6. They could also have premium mobile versions with only HBM2e for maximum power efficiency, and then they could have cheaper high performance mobile cards with 192/256 GDDR6.
 
Some recent LLVM changes have started fleshing out details on the BVH instructions for GFX1030.
Among other things, some stubs for instruction errors dating back to June 2019 now have more context as to what they were referring to.
https://github.com/llvm/llvm-project/commit/91f503c3af190e19974f8832871e363d232cd64c

Code:
image_bvh_intersect_ray v[4:7], v[9:24], s[4:7]
// GFX10: encoding: [0x01,0x9f,0x98,0xf1,0x09,0x04,0x01,0x00]

image_bvh_intersect_ray v[4:7], v[9:16], s[4:7] a16
// GFX10: encoding: [0x01,0x9f,0x98,0xf1,0x09,0x04,0x01,0x40]

image_bvh64_intersect_ray v[4:7], v[9:24], s[4:7]
// GFX10: encoding: [0x01,0x9f,0x9c,0xf1,0x09,0x04,0x01,0x00]

image_bvh64_intersect_ray v[4:7], v[9:24], s[4:7] a16
// GFX10: encoding: [0x01,0x9f,0x9c,0xf1,0x09,0x04,0x01,0x40]

image_bvh_intersect_ray v[39:42], [v50, v46, v23, v17, v16, v15, v21, v20, v19, v37, v40], s[12:15]
// GFX10: encoding: [0x07,0x9f,0x98,0xf1,0x32,0x27,0x03,0x00,0x2e,0x17,0x11,0x10,0x0f,0x15,0x14,0x13,0x25,0x28,0x00,0x00]

image_bvh_intersect_ray v[39:42], [v50, v46, v23, v17, v16, v15, v21, v20], s[12:15] a16
// GFX10: encoding: [0x05,0x9f,0x98,0xf1,0x32,0x27,0x03,0x40,0x2e,0x17,0x11,0x10,0x0f,0x15,0x14,0x00]

image_bvh64_intersect_ray v[39:42], [v50, v46, v23, v17, v16, v15, v21, v20, v19, v37, v40, v42], s[12:15]
// GFX10: encoding: [0x07,0x9f,0x9c,0xf1,0x32,0x27,0x03,0x00,0x2e,0x17,0x11,0x10,0x0f,0x15,0x14,0x13,0x25,0x28,0x2a,0x00]

image_bvh64_intersect_ray v[39:42], [v50, v46, v23, v17, v16, v15, v21, v20, v19], s[12:15] a16
// GFX10: encoding: [0x05,0x9f,0x9c,0xf1,0x32,0x27,0x03,0x40,0x2e,0x17,0x11,0x10,0x0f,0x15,0x14,0x13]
 
What I want is around 2060 performance at RX580 price. So far it has been a side grade if you want to upgrade but only have RX580 money (even at launch price!)

Wow I forgot the 580 was $220 at launch. I think there's a very good chance of seeing 2060 performance under $250.
 
You know, the easy answer to the "small" bus width is that RDNA2 just doubles the l2$ size per slice. That'd put the CU count at exactly the same bus width and l2$ per. Makes more sense to me than any of the other proposals so far.

Regardless we can assume that somehow the CU to performance ratio has either stayed the same or improved, and with the PS5 clockspeeds, and the fact that 18gbps GDDR6 is demonstrated but not known to be shipping anywhere, we can hazard a guess at the performance metrics. Note: All performance is highly dependent on title.

SC (big): low expectations; performs between a 3080 to 15% faster than a 3090. high expectations; better than a 3080 on all tests; up to 30% faster than a 3090 (though the 3090 should be faster on Id Tech titles, they really like throwing low bandwidth shaders at the screen it seems).

flounder (mid-high): low expectations; performs between a 2080ti and a 3080. high expectations; performs between just below a 3080 to a 3090..

cavefish (mid-low): low expectations; performs same as a 5700xt, but you know, with raytracing and watnot. high expectations performs between a 2080 super and a 2080ti.

From there we can take a guess that big will cost $750+, mid high will cost between $600-800, and mid $300-450.
 
Perhaps an even easier answer is that L2 cache slices simply need not be 1:1 to 16-bit GDDR6 channels?

(or 4:1 if you prefer to express in terms of 64-bit memory controller blocks)

GCN started with Tahiti and Tonga (HD 7970, R9 280) that broke the strict power of two arrangement, albeit only for render backends. This was followed by Xbox One X (Scorpio) having a 384-bit GDDR5 bus behind the 8 L2 cache slices. Many other products are also operating not on the 1:1 ratio (the HBM ones, and Renior) albeit still under a power of two arrangement.
 
Last edited:
So this is an old and perhaps relevant thread, from 2015:

https://forum.beyond3d.com/threads/gpu-cache-sizes-and-architectures.56731/

And relevant posting by sebbbi in 2013:

(Slightly OT, continued from my above post)

I did some extra testing with the GCN ROP caches. Assuming the mobile versions have equally sized ROP caches than my Radeon 7970 (128 KB), it looks like rendering in 128x128 tiles might help these new mobile APUs very much, as APUs are very much BW limited.

Particle rendering has a huge backbuffer BW cost, especially when rendering HDR particles to 4x16f backbuffer. Our system renders all particles using a single draw call (particle index buffer is depth sorted, and we use premultiplied alpha to achieve both additive and alpha channel blended particles simultaneously). It is actually over 2x faster to brute force render a draw call containing 10k particles 60 times to 128x128 tiles (move scissor rectangle across a 1280x720 backbuffer) compared to rendering it once (single draw call, full screen). And you can achieve this kinds of gains by spending 15 minutes (just a brute force hack). With a little bit of extra code, you can skip particle quads (using geometry shader) that do not land on the active 128x128 scissor area (and save most of the extra geometry cost). This is a good way to reduce particle overdraw BW cost to zero. A 128x128 tile is rendered (alpha blended) completely inside the GPU ROP cache. This is especially a good technique for low BW APUs, but it helps even the Radeon 7970 GE (with massive 288 GB/s BW).

With this technique, soft particles gain even more, since the full screen depth texture reads (128x128 area) fits the GCN 512/768 KB L2 cache (and become BW free as well). Of course Kepler based chips should have similar gains (but I don't have one for testing).

If techniques like this become popular in future, and developers start to spend lots of time in optimizing for the modern GPU L2/ROP caches, it might make larger GPU memory pools (such as the Haswell 128 MB L4 cache) less important. It's going to be interesting too see how things pan out.
Tahiti saw benefits from this "fully-tiled" benchmark and we now have GPUs with ~6x more fillrate, but only ~3x more bandwidth (RTX 3090, of course, has tiled rasterisation - but that wouldn't be relevant to sebbbi's HDR particles test).

Still, we don't know how RDNA works and whether it uses the ROP depth/colour cache that we saw in GCN.

You know, B3D forum is epic:

Trinity and Richland are both highly bandwidth bound when the integrated GPU is used. The gaming performance scales up almost linearly when memory is overclocked. These GPUs could perform better, if they either had better/larger caches or access to faster memory.

In comparison HD 3000 (Sandy) and HD 4000 (Ivy) are not bandwidth bound. We see only minor gains by memory overclocks. This is explained by two things: First, Intel 4000 GPU is slightly slower compared to Trinity / Richland GPUs, and thus processes slightly less data, and thus requires slightly less bandwidth. Secondly, Intel processors share their L3 cache (3MB-8MB) with the GPU. A large general purpose read/write cache reduces GPU memory bandwidth requirements nicely. HD 4000 seems to be a very well balanced architecture (regarding to bandwidth). Intel did a good job.

GT3e upgrades the EU count from 16 to 40. That is 2.5x raw performance boost. In order to actually get close to that much extra performance, Intel needed to improve the bandwidth to 2.5x compared to Ivy Bridge. They could have added a quad channel DDR3 memory controller and slightly increased the memory clocks. However they chose to instead add a 128 MB L4 cache. This wasn't the cheapest choice (more research and manufacturing cost), but it was likely better for performance/watt than doubling the memory bus width.

AMD needs to do something drastic to their memory subsystem when they release Kaveri, if they intend to improve the GPU performance by 2x or more. GCN has a nice cache hierarchy compared to their old VLIW architectures (still used in Trinity and Richland). It should reduce the memory bandwidth usage, but not anywhere close enough to double the performance on the current dual channel DDR3 memory architecture. They need to either improve the caches further (larger L2 and ROP caches) and introduce triple channel memory controller (Nehalem had it already, and we all remember the 6GB and 12GB memory configurations), or go directly to a wide (and expensive) quad channel memory controller (similar to Sandy Bridge-E). I doubt AMD is going to introduce a huge L4 cache like Intel did for GT3e to solve their bandwidth issues. However AMD cannot just do nothing. If they don't get more bandwidth by some means (or save bandwidth by clever tricks / caches), they cannot improve their integrated GPU performance anymore. Their current APUs are already heavily bandwidth bound.
128MB L4 :)
 
No, but being on a 1:1 saves you one very complex crossbar in design.
Complexity is relative. In absolute terms, that is four 2x3 crossbars for Xbox One X (Scorpio), and presumably the same for Tahiti/Tonga’s RBEs. It is simpler in switching complexity than an Infinity Fabric router node, IIRC routing maximally 5x5 (mesh), and we have these IF nodes everywhere in AMD’s SoCs.

Bespoke crossbars are also not the only option. One could make use of the existing Infinity Fabric mesh-like setup (introduced in Vega) that likely has already the per-node switching capacity (L2/GMC, DCT, two neighbours), if not wider data bus width.
 
Last edited:
Efficient L2 Cache Management to Boost GPGPU Performance

This paper, from 2019, is a direct study of GCN, which improves an existing simulator (to achieve substantially closer simulated performance versus actual chip performance) and then goes on to propose a new cache architecture:

The FRC approach is based on a small auxiliary cache structure that efficiently unclogs the memory subsystem, enhancing the GPU performance up to 118% on average compared to the studied baseline. In addition, the FRC approach reduces the energy consumption of the memory hierarchy by a [sic] 57%.
 
Complexity is relative. In absolute terms, that is four 2x3 crossbars for Xbox One X (Scorpio), and presumably the same for Tahiti/Tonga’s RBEs. It is simpler in switching complexity than an Infinity Fabric router node, IIRC routing maximally 5x5 (mesh), and we have these IF nodes everywhere in AMD’s SoCs.

Bespoke crossbars are also not the only option. One could make use of the existing Infinity Fabric mesh-like setup (introduced in Vega) that likely has already the per-node switching capacity (L2/GMC, DCT, two neighbours), if not wider data bus width.
I think there's a reason, why IdF is in addition to IcF and dedicated memory busses in AMDs chips. What's the current transfer rate of IF in the server grade IOD?
 
https://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3

The Crystalwell enabled graphics driver can choose to keep certain things out of the eDRAM. The frame buffer isn’t stored in eDRAM for example.
The focus here, for graphics, appears to be purely textures.

On 22nm:

The Crystalwell die measures 7mm x 12mm (84mm^2)
https://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/4

20mm² on TSMC 7nm for 128MB?
 
Status
Not open for further replies.
Back
Top