AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Frenetic Pony · May 29, 2020

Huh, an interesting idea. Take JPEG Xl, the jpeg foundation's finally produced next spec, for an example compression scheme and hey 20:1 or better compression ratio with little artifacts at the cost of... whatever the decompression shader cost is. Could easily be worth it for the right titles, tons of ultra high res textures with zero pop in for the cost of X performance.

I wonder if this will actually make it into PS5/Xsx, or if this is just some random patent they felt like applying for.

Ext3h · May 29, 2020

Frenetic Pony said:
I wonder if this will actually make it into PS5/Xsx, or if this is just some random patent they felt like applying for.

Pretty sure this is already in. There isn't much missing. The first stage lookup is effectively provided by the normal texturing hardware, second stage is text book tiled resources but with spare / transient tiles. From there on, what's missing is essentially "cache line lock" intrinsic, for providing a concurrent access save approach to filling in the tiled resource tile-wise.

I assume even the later one had almost been in place already. It wouldn't actually be tied to the L1/L2 cache, but rather arbitrary memory region with initialization protected under a critical section. Required feature is to block until the generating shader is done, and only to invoke the generating shader if not hitting the cache.

And what makes me suspect that AMD is adding these capabilities? Because it's a building block for another extension AMD can't provide in GCN / RDNA yet: VK_EXT_fragment_shader_interlock (specifically see beginInvocationInterlockARB() and endInvocationInterlockARB() in
https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_fragment_shader_interlock.txt ). In other terms, device wide critical sections on "arbitrary" tags.

The patent just describes a clever application of that missing feature, respectively of the generalized form which applies to tags other than fragments.

Actually, that application might have been devised in the process of implementing ROP independent device wide critical sections. Sounds like a typical AMD move, to head straight for a generic hardware implementation (not patent-able on it's own), and then to figure out what else it could be good for later on.

Lurkmass · May 29, 2020

Ext3h said:
And what makes me suspect that AMD is adding these capabilities? Because it's a building block for another extension AMD can't provide in GCN / RDNA yet: VK_EXT_fragment_shader_interlock (specifically see beginInvocationInterlockARB() and endInvocationInterlockARB() in
https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_fragment_shader_interlock.txt ). In other terms, device wide critical sections on "arbitrary" tags.

The patent just describes a clever application of that missing feature, respectively of the generalized form which applies to tags other than fragments.

Actually, that application might have been devised in the process of implementing ROP independent device wide critical sections. Sounds like a typical AMD move, to head straight for a generic hardware implementation (not patent-able on it's own), and then to figure out what else it could be good for later on.

Actually, shader interlocks are supported on recent AMD HW. The reason why they don't expose them in either their GL/VK drivers is because it's a bad idea to use them since executing critical sections is a high latency operation. Their recommendation is that you're better off using linked lists for doing arbitrary blending or OIT.

Ext3h · May 29, 2020

Lurkmass said:
Actually, shader interlocks are supported on recent AMD HW. The reason why they don't expose them in either their GL/VK drivers is because it's a bad idea to use them since executing critical sections is a high latency operation. Their recommendation is that you're better off using linked lists for doing arbitrary blending or OIT.

Ordered shader interlock is implemented, you mean? And only on Vega / RDNA family. (SOPP, S_SENDMSG, MSG_ORDERED_PS_DONE).

There is no native unordered shader interlock support, and the ordered one appears to be hard-wired with severe implications on efficiency of rasterization (not just denoted CS is blocked off, but whole work generation is stalled).
With the instructions supported yet, you could only construct unordered CS support by using mix of atomics and sleeps. Worst case scenario, as you get serialized execution with with random latency in between serialized parts on top.

Special CS support with first-to-arrive logic is actually simpler (as you may cache locally once shared mode has been reached), but still inefficient to implement in software:

Code:

if(*init_guard == 2) {
   // NOP, lucky cache hit
} else {
   int state = atomicCompSwap(init_guard, 0, 1);
   if(state == 0) {
       init();
       atomicExchange(init_guard, 2);
   } else while(state != 2)  {
       sleep();
       state = atomicCompSwap(init_guard, 2, 2);
   }
}

With sleep instruction (SOPP S_SLEEP) not exposed by any intrinsic, atomicCompSwap loop is still a bad choice. So there got to be some hardware arbitration or at least an intrinsic to handle that properly without an (unthrottled) spin-lock.

The whole thing is then probably interleaved with memory management. No visible page fault handler in RDNA, but in order to provide the benefits as described in the patent, that logic has at least to operate on a virtual memory segment which is subject to being dropped on L2 cache eviction. LDS or GDS don't fit the size requirements, and spilling to main memory is failing the point of using texture compression.
At least for RDNA 1.0, I don't see such a capability documented yet, but doesn't sound too far off either.
For the purpose of texture space shading, sub-allocations linked from instanced lookup table should suffice. Effectively good old tiled / partially resident texture, but with device managed allocation strategy.

OlegSH · May 29, 2020

DavidGraham said:
In the past, some games used NVIDIA's CUDA cores to do texture decompression, like Rage 2 and Wolfenstein Old Blood, with no measurable hit to performance, advantages were limited too.

It's still used these days, just for other tasks, for exaple, nvJPEG is used to make NN's training faster.
There is even the nvJPEG hw decompression block in A100 for these purposes

Lurkmass · May 29, 2020

Ext3h said:
Ordered shader interlock is implemented, you mean? And only on Vega / RDNA family. (SOPP, S_SENDMSG, MSG_ORDERED_PS_DONE).

Yes ...

There is no native unordered shader interlock support, and the ordered one appears to be hard-wired with severe implications on efficiency of rasterization (not just denoted CS is blocked off, but whole work generation is stalled).
With the instructions supported yet, you could only construct unordered CS support by using mix of atomics and sleeps. Worst case scenario, as you get serialized execution with with random latency in between serialized parts on top.

'Unordered' shader interlock doesn't require any special HW support since it was the "default case" prior to ordered interlocks. Any HW/driver combination can implement unordered interlocks with UAVs or images by doing atomic R/W ops on those resources and then observe the race conditions afterwards a result.

The reason why interlocks are a bad idea on AMD HW is that you're trading off decreased memory bandwidth consumption for decreased parallelism so it has a crappy pay off for them in the end. It cannot be a good idea performance wise to stall fragment shader execution unless you're Intel or one of those tiler GPUs that you'd see on mobile devices.

I think we might've painted ourselves in a corner with shader interlocks since it has massive performance implications for future discrete GPU HW designs ...

yuri · Jun 1, 2020

Apparently, one of the upcoming RDNA 2-based Navi 2x/GFX103x GPUs is now called: Sienna Cichlid.

The info comes from a new batch of AMD's Linux driver patches. There are interesting power-related patches enabling features like "Ultra Low Voltage" or "Deep Sleep".

https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next-sienna_cichlid

Malo · Jun 1, 2020

Cichlid? As in freshwater fish?

Kaotik · Jun 1, 2020

yuri said:
Apparently, one of the upcoming RDNA 2-based Navi 2x/GFX103x GPUs is now called: Sienna Cichlid.

The info comes from a new batch of AMD's Linux driver patches. There are interesting power-related patches enabling features like "Ultra Low Voltage" or "Deep Sleep".

https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next-sienna_cichlid

Plot twist: "AMD Ryzen Mobile" is real despite mispelling Gauguin and it's 4CU Navi2x is Sienna Cichlid

Deleted member 90741 · Jun 2, 2020

https://twitter.com/x/status/1267583876865699846

Slide is not AMD, just an analysis of the patches by Locuza and he made it.
I did find some multiple clock domains for the DCN 3.0 and VCN 3.0 from the source code.
The SDMA engine is indeed updated. v5.2.
Also SMU is updated.
But hard to say anything about the important parts.

Frenetic Pony · Jun 2, 2020

Better performance per watt was one of the expected RDNA2 benefits, especially since they need to put in the next mobile APUs and licensed it to Samsung for mobile GPUs.

Tinkering with voltage domains and sleep states and stuff should be pretty par for the course as such. Wonder what "display/video core next 3.0" and whatever will entail. AV1 decode I hope?

eastmen · Jun 2, 2020

Kaotik said:
Plot twist: "AMD Ryzen Mobile" is real despite mispelling Gauguin and it's 4CU Navi2x is Sienna Cichlid

plot twist , it was lockheart all along

3dilettante · Jun 2, 2020

ethernity said:
https://twitter.com/x/status/1267583876865699846

Slide is not AMD, just an analysis of the patches by Locuza and he made it.
I did find some multiple clock domains for the DCN 3.0 and VCN 3.0 from the source code.
The SDMA engine is indeed updated. v5.2.
Also SMU is updated.
But hard to say anything about the important parts.

There was a second graphics queue referenced in driver commits for Navi as well, though it would frequently be left off or inactivated. Not sure what would distinguish it for the successor.
The ACE queue count being 4 sounds like a possible reduction if confirmed.
The ACE count and at least one reference to 128-bit GDDR6 sound more appropriate for a portable or lower-range product.

yuri · Jun 2, 2020

3dilettante said:
The ACE count and at least one reference to 128-bit GDDR6 sound more appropriate for a portable or lower-range product.

The 128b reference is strange since they have been obfuscating this sensitive pre-launch info by using binary firmware - check all those calls to amdgpu_atomfirmware.

Deleted member 90741 · Jun 2, 2020

3dilettante said:
The ACE queue count being 4 sounds like a possible reduction if confirmed.

From the RDNA architecture whitepaper, what I could find is that Navi10 has four ACEs. Each ACE handles one shader array.

3dilettante said:
The ACE count and at least one reference to 128-bit GDDR6 sound more appropriate for a portable or lower-range product.

Yeah, I saw this too, but then I see that it is only for emulation mode so not sure what to make of it, I think it will disappear soon because the value is read from the firmware usually.

Some stuffs that I could quickly glean from the patches

The DCN and VCN indeed seems to be major changes. VCN 2.0 was introduced with Navi. And now DCN and VCN 3.0
There are two clock sources for VCN and two clock sources for DCN from the patches.
There are 2 additional SDMA engines for a total of 4 compared to 2 from Navi10.
Firmware identifies if the chip is aircooled or liquid cooled
XGMI support!? So far this was seen only for Arcturus. I wonder what is up here. This means that the chip is foreseen to link up with other chips to do workload sharing.
PP table clocks conveniently removed with TODO
New PCI audio device.

Ext3h · Jun 2, 2020

ethernity said:
From the RDNA architecture whitepaper, what I could find is that Navi10 has four ACEs. Each ACE handles one shader array.

There has been only one ACE since early GCN, and there still only is. What's shown as 2 ACEs is a single core with 2x SMT, and each thread polls from a number of queues.

AMDs presentation of that implementation detail is more artistic freedom than anything else.

PSman1700 · Jun 2, 2020

ethernity said:
New PCI audio device.

What could this be?

BRiT · Jun 2, 2020

ethernity said:
New PCI audio device.

So a different or new PCI ID for AMD TrueAudio Next Next?

Deleted member 90741 · Jun 2, 2020

BRiT said:
So a different or new PCI ID for AMD TrueAudio Next Next?

Yes it seems like a new PCI device. TrueAudio Next Next.

Deleted member 13524 · Jun 2, 2020

3dilettante said:
The ACE count and at least one reference to 128-bit GDDR6 sound more appropriate for a portable or lower-range product.

Playstation 4 Portable confirmed!

:runaway:

ethernity said:
XGMI support!? So far this was seen only for Arcturus. I wonder what is up here. This means that the chip is foreseen to link up with other chips to do workload sharing.

Well one of my theories for the PC to catch up with next-gen consoles in I/O speed is that future graphics cards may get a direct connection to a fast SSD, without having to send data through the main system RAM.

AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Frenetic Pony

Ext3h

Lurkmass

Ext3h

OlegSH

Lurkmass

yuri

Malo

Yak Mechanicum

Kaotik

Drunk Member

Deleted member 90741

Guest

Frenetic Pony

eastmen

3dilettante

yuri

Deleted member 90741

Guest

Ext3h

PSman1700

BRiT

(>• •)>⌐■-■ (⌐■-■)

Deleted member 90741

Guest

Deleted member 13524

Guest