AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
Huh, an interesting idea. Take JPEG Xl, the jpeg foundation's finally produced next spec, for an example compression scheme and hey 20:1 or better compression ratio with little artifacts at the cost of... whatever the decompression shader cost is. Could easily be worth it for the right titles, tons of ultra high res textures with zero pop in for the cost of X performance.

I wonder if this will actually make it into PS5/Xsx, or if this is just some random patent they felt like applying for.
 
I wonder if this will actually make it into PS5/Xsx, or if this is just some random patent they felt like applying for.
Pretty sure this is already in. There isn't much missing. The first stage lookup is effectively provided by the normal texturing hardware, second stage is text book tiled resources but with spare / transient tiles. From there on, what's missing is essentially "cache line lock" intrinsic, for providing a concurrent access save approach to filling in the tiled resource tile-wise.

I assume even the later one had almost been in place already. It wouldn't actually be tied to the L1/L2 cache, but rather arbitrary memory region with initialization protected under a critical section. Required feature is to block until the generating shader is done, and only to invoke the generating shader if not hitting the cache.

And what makes me suspect that AMD is adding these capabilities? Because it's a building block for another extension AMD can't provide in GCN / RDNA yet: VK_EXT_fragment_shader_interlock (specifically see beginInvocationInterlockARB() and endInvocationInterlockARB() in
https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_fragment_shader_interlock.txt ). In other terms, device wide critical sections on "arbitrary" tags.

The patent just describes a clever application of that missing feature, respectively of the generalized form which applies to tags other than fragments.

Actually, that application might have been devised in the process of implementing ROP independent device wide critical sections. Sounds like a typical AMD move, to head straight for a generic hardware implementation (not patent-able on it's own), and then to figure out what else it could be good for later on.
 
Last edited:
And what makes me suspect that AMD is adding these capabilities? Because it's a building block for another extension AMD can't provide in GCN / RDNA yet: VK_EXT_fragment_shader_interlock (specifically see beginInvocationInterlockARB() and endInvocationInterlockARB() in
https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_fragment_shader_interlock.txt ). In other terms, device wide critical sections on "arbitrary" tags.

The patent just describes a clever application of that missing feature, respectively of the generalized form which applies to tags other than fragments.

Actually, that application might have been devised in the process of implementing ROP independent device wide critical sections. Sounds like a typical AMD move, to head straight for a generic hardware implementation (not patent-able on it's own), and then to figure out what else it could be good for later on.

Actually, shader interlocks are supported on recent AMD HW. The reason why they don't expose them in either their GL/VK drivers is because it's a bad idea to use them since executing critical sections is a high latency operation. Their recommendation is that you're better off using linked lists for doing arbitrary blending or OIT.
 
Actually, shader interlocks are supported on recent AMD HW. The reason why they don't expose them in either their GL/VK drivers is because it's a bad idea to use them since executing critical sections is a high latency operation. Their recommendation is that you're better off using linked lists for doing arbitrary blending or OIT.
Ordered shader interlock is implemented, you mean? And only on Vega / RDNA family. (SOPP, S_SENDMSG, MSG_ORDERED_PS_DONE).

There is no native unordered shader interlock support, and the ordered one appears to be hard-wired with severe implications on efficiency of rasterization (not just denoted CS is blocked off, but whole work generation is stalled).
With the instructions supported yet, you could only construct unordered CS support by using mix of atomics and sleeps. Worst case scenario, as you get serialized execution with with random latency in between serialized parts on top.

Special CS support with first-to-arrive logic is actually simpler (as you may cache locally once shared mode has been reached), but still inefficient to implement in software:
Code:
if(*init_guard == 2) {
   // NOP, lucky cache hit
} else {
   int state = atomicCompSwap(init_guard, 0, 1);
   if(state == 0) {
       init();
       atomicExchange(init_guard, 2);
   } else while(state != 2)  {
       sleep();
       state = atomicCompSwap(init_guard, 2, 2);
   }
}
With sleep instruction (SOPP S_SLEEP) not exposed by any intrinsic, atomicCompSwap loop is still a bad choice. So there got to be some hardware arbitration or at least an intrinsic to handle that properly without an (unthrottled) spin-lock.

The whole thing is then probably interleaved with memory management. No visible page fault handler in RDNA, but in order to provide the benefits as described in the patent, that logic has at least to operate on a virtual memory segment which is subject to being dropped on L2 cache eviction. LDS or GDS don't fit the size requirements, and spilling to main memory is failing the point of using texture compression.
At least for RDNA 1.0, I don't see such a capability documented yet, but doesn't sound too far off either.
For the purpose of texture space shading, sub-allocations linked from instanced lookup table should suffice. Effectively good old tiled / partially resident texture, but with device managed allocation strategy.
 
Last edited:
In the past, some games used NVIDIA's CUDA cores to do texture decompression, like Rage 2 and Wolfenstein Old Blood, with no measurable hit to performance, advantages were limited too.
It's still used these days, just for other tasks, for exaple, nvJPEG is used to make NN's training faster.
There is even the nvJPEG hw decompression block in A100 for these purposes:D
 
Ordered shader interlock is implemented, you mean? And only on Vega / RDNA family. (SOPP, S_SENDMSG, MSG_ORDERED_PS_DONE).

Yes ...

There is no native unordered shader interlock support, and the ordered one appears to be hard-wired with severe implications on efficiency of rasterization (not just denoted CS is blocked off, but whole work generation is stalled).
With the instructions supported yet, you could only construct unordered CS support by using mix of atomics and sleeps. Worst case scenario, as you get serialized execution with with random latency in between serialized parts on top.

'Unordered' shader interlock doesn't require any special HW support since it was the "default case" prior to ordered interlocks. Any HW/driver combination can implement unordered interlocks with UAVs or images by doing atomic R/W ops on those resources and then observe the race conditions afterwards a result.

The reason why interlocks are a bad idea on AMD HW is that you're trading off decreased memory bandwidth consumption for decreased parallelism so it has a crappy pay off for them in the end. It cannot be a good idea performance wise to stall fragment shader execution unless you're Intel or one of those tiler GPUs that you'd see on mobile devices.

I think we might've painted ourselves in a corner with shader interlocks since it has massive performance implications for future discrete GPU HW designs ...
 
Better performance per watt was one of the expected RDNA2 benefits, especially since they need to put in the next mobile APUs and licensed it to Samsung for mobile GPUs.

Tinkering with voltage domains and sleep states and stuff should be pretty par for the course as such. Wonder what "display/video core next 3.0" and whatever will entail. AV1 decode I hope?
 

Slide is not AMD, just an analysis of the patches by Locuza and he made it.
I did find some multiple clock domains for the DCN 3.0 and VCN 3.0 from the source code.
The SDMA engine is indeed updated. v5.2.
Also SMU is updated.
But hard to say anything about the important parts.

There was a second graphics queue referenced in driver commits for Navi as well, though it would frequently be left off or inactivated. Not sure what would distinguish it for the successor.
The ACE queue count being 4 sounds like a possible reduction if confirmed.
The ACE count and at least one reference to 128-bit GDDR6 sound more appropriate for a portable or lower-range product.
 
The ACE queue count being 4 sounds like a possible reduction if confirmed.
From the RDNA architecture whitepaper, what I could find is that Navi10 has four ACEs. Each ACE handles one shader array.

The ACE count and at least one reference to 128-bit GDDR6 sound more appropriate for a portable or lower-range product.
Yeah, I saw this too, but then I see that it is only for emulation mode so not sure what to make of it, I think it will disappear soon because the value is read from the firmware usually.

Some stuffs that I could quickly glean from the patches
  • The DCN and VCN indeed seems to be major changes. VCN 2.0 was introduced with Navi. And now DCN and VCN 3.0
  • There are two clock sources for VCN and two clock sources for DCN from the patches.
  • There are 2 additional SDMA engines for a total of 4 compared to 2 from Navi10.
  • Firmware identifies if the chip is aircooled or liquid cooled
  • XGMI support!? So far this was seen only for Arcturus. I wonder what is up here. This means that the chip is foreseen to link up with other chips to do workload sharing.
  • PP table clocks conveniently removed with TODO
  • New PCI audio device.
 
From the RDNA architecture whitepaper, what I could find is that Navi10 has four ACEs. Each ACE handles one shader array.
There has been only one ACE since early GCN, and there still only is. What's shown as 2 ACEs is a single core with 2x SMT, and each thread polls from a number of queues.

AMDs presentation of that implementation detail is more artistic freedom than anything else.
 
The ACE count and at least one reference to 128-bit GDDR6 sound more appropriate for a portable or lower-range product.
Playstation 4 Portable confirmed!

:runaway::runaway::runaway:

XGMI support!? So far this was seen only for Arcturus. I wonder what is up here. This means that the chip is foreseen to link up with other chips to do workload sharing.
Well one of my theories for the PC to catch up with next-gen consoles in I/O speed is that future graphics cards may get a direct connection to a fast SSD, without having to send data through the main system RAM.
 
Status
Not open for further replies.
Back
Top