AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
While GPUs will always crave bandwidth I believe GPUs these are slowly getting more latency sensitive.
RDNA 2, according to the capabilities matrix:

https://videocardz.com/newz/amd-navi-21-to-feature-80-cus-navi-22-40-cus-and-navi-23-32-cus

reduces the count of hardware threads per SIMD: 20 down to 16.

Overall I'd agree that latency is rising in relevance. A large unanswered question is how much of a GPU's latency sensitivity is self-inflicted, due to such things as an out-dated architecture or a driver that allocates resources badly.
 
RDNA 2, according to the capabilities matrix:

https://videocardz.com/newz/amd-navi-21-to-feature-80-cus-navi-22-40-cus-and-navi-23-32-cus

reduces the count of hardware threads per SIMD: 20 down to 16.

Overall I'd agree that latency is rising in relevance. A large unanswered question is how much of a GPU's latency sensitivity is self-inflicted, due to such things as an out-dated architecture or a driver that allocates resources badly.

I don't think that one really matters. You're also limited by the number of VGPRs (512 / SIMD for 64-lane waves). The limitations here are really a problem for shaders that use > 64 VGPRS (i.e. occupancy < 8 or < 4 for GCN, I think I haven't looked at this that closely on RDNA yet to be fair) and 16 vs 20 doesn't really make much of a difference. Shaders for which we can reach an occupancy of 20 (24 VGPRs due to rounding) are typically short and simple and at that point rasterization and/or wave launch speed becomes an issue.

(On RDNA1 launch speed was <= 1 wave/clock, so if the shader is less than 40 64-lane VALU operations and less than 3 texture sampling operations then you're launch speed limited)
 
There's been an increase in context expected for some architectural elements, such as the shift to static 128 register allocations for scalar registers. It's possible with all the additional features that VGPR usage is also gradually increasing.
On top of that, other occupancy or occupancy-related pressures like instruction buffers and demands on shared caches and are growing.
As a higher-clocked architecture, shaving off a few entries from wavefront evaluation could streamline a pipeline stage as well.

Other latency measures have increased with recent generations. RDNA separated out vector writes into their own tracking category to better stream out writes, and it has a greater waitcnt capacity for memory operations in general.
 
I know I am late but, I want to chime in on the latency vs bandwidth debate. I believe that not only would a large LLC would not only provide more bandwidth at lower latency, but it it could also help AMD relax the amount of latency hiding per cu. for example, NVidia runs 32 threads per shader to hide the very long latency of gddr6x, with a large LLC AMD could play the odds and say most memory accesses hit the Cache and lower the amount of threads per shader to something like 16* or 8* threads freeing up registers and thus transistors!

*note I don't know for sure how many threads of latency hiding RDNA 1 has since i haven't worked with AMD gpu's at the low level since a very long time.
 
I know I am late but, I want to chime in on the latency vs bandwidth debate. I believe that not only would a large LLC would not only provide more bandwidth at lower latency, but it it could also help AMD relax the amount of latency hiding per cu. for example, NVidia runs 32 threads per shader to hide the very long latency of gddr6x, with a large LLC AMD could play the odds and say most memory accesses hit the Cache and lower the amount of threads per shader to something like 16* or 8* threads freeing up registers and thus transistors!

*note I don't know for sure how many threads of latency hiding RDNA 1 has since i haven't worked with AMD gpu's at the low level since a very long time.

the 32-threads per shader thing from nvidia (and 32/64 for AMD) are how wide the SIMD is. on top of that that have SMT for latency hiding. Not sure how many for nvidia, but for RDNA1 is is 2-20x depending on how many registers are used (they're going to 2-16x likely in RDNA2, but I don't think that is a material change).

I think the cache with lower latency is a great counterbalance for RT though (walking the BVH tree is multiple dependent loads, which hence is pretty latency sensitive.)
 
the 32-threads per shader thing from nvidia (and 32/64 for AMD) are how wide the SIMD is. on top of that that have SMT for latency hiding. Not sure how many for nvidia, but for RDNA1 is is 2-20x depending on how many registers are used (they're going to 2-16x likely in RDNA2, but I don't think that is a material change).

I think the cache with lower latency is a great counterbalance for RT though (walking the BVH tree is multiple dependent loads, which hence is pretty latency sensitive.)

It’s 16 warps max per scheduler in Ampere.
 
I know I am late but, I want to chime in on the latency vs bandwidth debate. I believe that not only would a large LLC would not only provide more bandwidth at lower latency, but it it could also help AMD relax the amount of latency hiding per cu. for example, NVidia runs 32 threads per shader to hide the very long latency of gddr6x, with a large LLC AMD could play the odds and say most memory accesses hit the Cache and lower the amount of threads per shader to something like 16* or 8* threads freeing up registers and thus transistors!
AMD's GPU frequencies have roughly tripled over the course of about 10 years (900MHz 6870 in October 2010), meanwhile memory latencies are the same or longer.

(Well I admit I don't know actual GDDR5 and GDDR6 latencies.)
 
I know I am late but, I want to chime in on the latency vs bandwidth debate. I believe that not only would a large LLC would not only provide more bandwidth at lower latency, but it it could also help AMD relax the amount of latency hiding per cu. for example, NVidia runs 32 threads per shader to hide the very long latency of gddr6x, with a large LLC AMD could play the odds and say most memory accesses hit the Cache and lower the amount of threads per shader to something like 16* or 8* threads freeing up registers and thus transistors!

*note I don't know for sure how many threads of latency hiding RDNA 1 has since i haven't worked with AMD gpu's at the low level since a very long time.
GCN has 10 wavefronts per SIMD, RDNA1 has 20, at least some of the RDNA2-based consoles have 20, and Sienna Cichlid appears to have 16.

The different SIMD widths and cadences make comparisons more complex. A single GCN wavefront stalling on an instruction might have up to 9 other wavefronts that might have one non-stalling instruction each on a 4-cycle cadence, giving 36 cycles if they each issued one instruction. An RDNA CU would have 19 other wavefronts, but they would be single-cycle. AMD provided a wave64 mode, which among other things emulates a 2-cycle cadence and brings the number of lanes per wave into a more equivalent relationship with a GCN wavefront. The apparent recommendation for that mode is apparently workloads like pixel shading where latency hiding is a major need, and the 2-cycle cadence gives roughly the same number of cycles as GCN.
Sienna Cichlid reduces the amount somewhat in this scenario, although the majority of the hiding is still possible.

Not evaluated is how many memory accesses can be issued sequentially per stalling. Vega doubled the amount over earlier GCN, and RDNA further split reads and writes to allow for more flexible batching of accesses and to avoid stalls on generally fire-and-forget writes.

Wave32 would be more latency-sensitive, but it may make more sense for less parallel workloads or ones better contained in the LDS or register file.

Guys, why are you still mentioning MALL in reference to cache subsystem?

It's part of DCN (Display Controller) and it's about timings of framebuffers etc.

Or do I read that code incorrectly?
It depends on whether this MALL functionality is interpreted as only applying to the display controller, or that it's a larger functionality change with a subset relating to the display controller.
Changes relating to cache allocation masks and a change to page table entries seems to imply things like depth tiling, SDMA, and other functions could be related.
https://lists.freedesktop.org/archives/amd-gfx/2020-October/055005.html
https://lists.freedesktop.org/archives/amd-gfx/2020-October/055006.html
Values introduced in the first show up in the second, which deals with MALL.
 
Only 18 hours left...I was very surprise the reviews for the 3070 came today...Nvidia gave AMD another target and price to beat...
 
It depends on whether this MALL functionality is interpreted as only applying to the display controller, or that it's a larger functionality change with a subset relating to the display controller.
Changes relating to cache allocation masks and a change to page table entries seems to imply things like depth tiling, SDMA, and other functions could be related.
https://lists.freedesktop.org/archives/amd-gfx/2020-October/055005.html
https://lists.freedesktop.org/archives/amd-gfx/2020-October/055006.html
Values introduced in the first show up in the second, which deals with MALL.
Within the patch, there are declarations with the term LLC, which is usually an abbreviation to last-level cache. Combined with terms like “no alloc” and “GCMC”, these patches sound like they are adding support for memory-side LLC bypass* on a per page basis, while some blocks (e.g. SDMA copy engine) can override the page level settings.

* probably like SLC=1 policy for L2: write no-allocate, read miss-evict
 
Looks like the LLC is enabled by default which is why we've seen so little driver stuff about it. I suspect MALL may be something like skipping the LLC so that it can be powergated (though with it being only enabled for size < 128MiB maybe it is the other way around, use the cache so you can stop or clock down the memory)? It sounds like it is only enabled when it "idle", i.e. nothing changes about the image, so that very much sounds like an opportunity to shut things off.
 
Guess also worth noting that the earlier link contains a condition with a magic number: surface_size < 128 * 1024 * 1024.

So, ehm, maybe an interpretation is:
  • It has 128 MB last level cache
  • The hardware feature (& hence the flag) is called Memory Access at Last Level (MALL)
  • It can be turned on & off. (for lower idle power?)
  • The driver probably allows only render targets to allocate in LLC in some phases, in which the display controller can be assured that any <128MB RT to be presented always hit the LLC, and uses way tighter timing. (Eh, or maybe all the times? It is an IMR GPU after all)
  • Edit: ^^ is nonsense if you consider basics like double buffering... So maybe it is like what andermans said, MALL allows the 128MB LLC to be used as a scratchpad (hence "Memory Access"), while the GDDR6 pool is powered off?
?_?
 
Last edited:
Status
Not open for further replies.
Back
Top