AMD: Navi Speculation, Rumours and Discussion [2019-2020]

3dilettante · Oct 28, 2020

pTmdfx said:
Within the patch, there are declarations with the term LLC, which is usually an abbreviation to last-level cache. Combined with terms like “no alloc” and “GCMC”, these patches sound like they are adding support for memory-side LLC bypass* on a per page basis, while some blocks (e.g. SDMA copy engine) can override the page level settings.

* probably like SLC=1 policy for L2: write no-allocate, read miss-evict

The L2 was at least implicitly the LLC in previous generations, so having a separate reference now may mean a distinct layer is present.
There may be other considerations. There are various no_alloc values, but the L2-related ones are no_alloc without LLC in the name, while others like SDMA have no-allocation values that do name the LLC.
Would that mean those flags are for bypassing the L2 in favor of a separate layer, or is it something like the L2 has no need for a redundant LLC designation since that is what it is?

pTmdfx said:
Guess also worth noting that the earlier link contains a condition with a magic number: surface_size < 128 * 1024 * 1024.

So, ehm, maybe an interpretation is:

It has 128 MB last level cache

The hardware feature (& hence the flag) is called Memory Access at Last Level (MALL)

It can be turned on & off. (for lower idle power?)

The driver probably allows only render targets to allocate in LLC in some phases, in which the display controller can be assured that any <128MB RT to be presented always hit the LLC, and uses way tighter timing. (Eh, or maybe all the times? It is an IMR GPU after all)

Edit: ^^ is nonsense if you consider basics like double buffering... So maybe it is like what andermans said, MALL allows the 128MB LLC to be used as a scratchpad (hence "Memory Access"), while the GDDR6 pool is powered off?

?_?

Some kind of display controller self-refresh from a local memory might work.
It's a possible interpretation, although 128MB has shown up as a limit for buffers in compute or graphics in other instances.
Another possibility is that 128 * 1024 * 1024 isn't a size in bytes. Some references have values like maxTexelBufferElements = 128 * 1024 * 1024, which may explain the curious way of subdividing 2^27.
https://phabricator.pmoreau.org/file/data/5btjflw6ul4wk3qrodo2/PHID-FILE-s3kiruzwymgchgag3eid/file
That wouldn't point to a cache that's literally 128MB in size, just a possible addressing limit for some of the hardware that might require additional units or the driver to intervene, and that might be self-defeating going by code that has microsecond time constants and might be for low-power operation.

pTmdfx · Oct 28, 2020

3dilettante said:
The L2 was at least implicitly the LLC in previous generations, so having a separate reference now may mean a distinct layer is present.
There may be other considerations. There are various no_alloc values, but the L2-related ones are no_alloc without LLC in the name, while others like SDMA have no-allocation values that do name the LLC.

Yeah, that is what I meant by "memory-side LLC".

3dilettante said:
Would that mean those flags are for bypassing the L2 in favor of a separate layer, or is it something like the L2 has no need for a redundant LLC designation since that is what it is?

I am leaning towards it independently controlling the access behaviours when it hits the memory-side LLC controller, whereas L2 as a GPU internal cache continues to be controlled by the instruction-level SLC bit (likewise for GLC/L0 and DLC/L1). Otherwise, if one assumes LLC=L2, it would mean a slight departure from GCN/RDNA 1's approach of per-instruction cache policy selection for all levels of GPU internal caches.

andermans · Oct 28, 2020

3dilettante said:
Some kind of display controller self-refresh from a local memory might work.
It's a possible interpretation, although 128MB has shown up as a limit for buffers in compute or graphics in other instances.
Another possibility is that 128 * 1024 * 1024 isn't a size in bytes. Some references have values like maxTexelBufferElements = 128 * 1024 * 1024, which may explain the curious way of subdividing 2^27.
https://phabricator.pmoreau.org/file/data/5btjflw6ul4wk3qrodo2/PHID-FILE-s3kiruzwymgchgag3eid/file
That wouldn't point to a cache that's literally 128MB in size, just a possible addressing limit for some of the hardware that might require additional units or the driver to intervene, and that might be self-defeating going by code that has microsecond time constants and might be for low-power operation.

I think you're linking to Intel though, which indeed has weird limits, but I don't think I've seen that particular limit before for GCN/RDNA? From the next patch actually initializing surface_size I think it clearly shows it is bytes: https://lists.freedesktop.org/archives/amd-gfx/2020-October/055215.html

Though I agree that the 128 Mi might only point to a cache limit but we do not know for sure. Furthermore, from just an enable/disable it is hard to know for sure what MALL is anyway.

3dilettante · Oct 28, 2020

andermans said:
I think you're linking to Intel though, which indeed has weird limits, but I don't think I've seen that particular limit before for GCN/RDNA? From the next patch actually initializing surface_size I think it clearly shows it is bytes: https://lists.freedesktop.org/archives/amd-gfx/2020-October/055215.html

Though I agree that the 128 Mi might only point to a cache limit but we do not know for sure. Furthermore, from just an enable/disable it is hard to know for sure what MALL is anyway.

I was giving the Intel link as an example of where that convention of breaking a size value down into that expression comes up for a non-capacity reason.

edit: nevermind, missed a parenthetical
That code seems to clarify what the source is.

Erinyes · Oct 28, 2020

Wasmachineman_NL said:
So I guess I answered my own question, more than half a year later: RDNA2 will come with AV1 decode. I wonder how Vegas will run on RDNA2? GPU acceleration would be awesome for the stuff I do.

It was more or less a given that both Ampere and RDNA2 would have AV1 decode acceleration, especially for the consoles. I expect their Cezanne APU to also get the updated decode block even though it's rumored to be based on Vega.

~9 hours to go. The wait is almost over!

CarstenS · Oct 28, 2020

Erinyes said:
~9 hours to go. The wait is almost over!

It seems like an infinitely long time.

Putas · Oct 28, 2020

Wrap yourself in a cozy fabric.

CarstenS · Oct 28, 2020

Maybe I'll open my good ol Can of Whoopass(tm) tonight.

Wasmachineman_NL · Oct 28, 2020

Has it been confirmed somewhere if RDNA2 is SIMD or VLIW?

andermans · Oct 28, 2020

Instruction set should be almost equal to Navi1x (which is pretty close to GCN), so no VLIW involved.

BRiT · Oct 28, 2020

Dedicated thread for the event and Navi Reveal: https://forum.beyond3d.com/threads/amd-radeon-rdna2-navi-2020-10-28.62091/

AMD: Navi Speculation, Rumours and Discussion [2019-2020]

3dilettante

pTmdfx

andermans

3dilettante

Erinyes

CarstenS

Moderator

Putas

CarstenS

Moderator

Wasmachineman_NL

andermans

BRiT

(>• •)>⌐■-■ (⌐■-■)