AMD: Navi Speculation, Rumours and Discussion [2019-2020]

tunafish · Aug 15, 2019

w0lfram said:
Lastly, will we be able to see stand-alone chiplets that aid in ray tracing, or as a graphics co-processor? (Even if there is a slight performance hit)

If the patents we have seen are any indication, this will definitely not happen, as it appears AMD is going with an approach where the texture units are modified to also sort of act as the RT units.

Frenetic Pony · Aug 15, 2019

tunafish said:
If the patents we have seen are any indication, this will definitely not happen, as it appears AMD is going with an approach where the texture units are modified to also sort of act as the RT units.

IMO a good idea, and probably true as at least one major vendor (UNITY) is already publicly experimenting with custom BVHs (bounding tetahedrons) so programmability seems likely.

Besides, raytracing is often throughput bound, the amount of bandwidth you'd need is still crazy.

For GPU chiplets I can see maybe a severely limited near future where some low bandwidth, low power common areas of GPUs are done in a chiplet. Do all the external output areas that would be on die (leading out to HDMI, displayport) and associated need to be built into the monolithic die? Maybe not, but I'm not sure how much cost saving, if any, that would even net you, you'd still need to transfer the final output over some interconnect after all. I want Navi to be a step towards chiplets, but I seriously doubt it is.

JoeJ · Aug 15, 2019

tunafish said:
If the patents we have seen are any indication, this will definitely not happen, as it appears AMD is going with an approach where the texture units are modified to also sort of act as the RT units.

As far as i understand the patent, the BVH format is hardwired, only the traversal can be eventually programmable to some degree?
Not sure if this leak has been discussed here: https://pastebin.com/y8qXme7b
If true, the patent would not apply for XBox and they use one RT core per CU, maybe similar to NV.

w0lfram · Aug 15, 2019

Frenetic Pony said:
IMO a good idea, and probably true as at least one major vendor (UNITY) is already publicly experimenting with custom BVHs (bounding tetahedrons) so programmability seems likely.

Besides, raytracing is often throughput bound, the amount of bandwidth you'd need is still crazy.

For GPU chiplets I can see maybe a severely limited near future where some low bandwidth, low power common areas of GPUs are done in a chiplet. Do all the external outputs (HDMI, displayport) and associated need to be built into the monolithic die? Maybe not, but I'm not sure how much cost saving, if any, that would even net you, you'd still need to transfer the final output over some interconnect after all. I want Navi to be a step towards chiplets, but I seriously doubt it is.

Goes back to what does AMD mean by RDNA being scalable. (How wide?)

I also suspect, the reason Navi10 is locked down on bios is because they don't want you to know how 5700 operates @ 800MHz, or downclocked in a laptop. We know how Vega scaled up & down, but not RDNA.

3dilettante · Aug 16, 2019

w0lfram said:
I now understand the difficulties of getting two GPUs to scale properly in Gaming. I quoted the paragraph above because it is most of what I was getting at. Is it that, two command processors cant ever have such functionality(?), or there is no economic reason for getting so elaborate and adding it, etc?

AMD doesn't comment much about the command processor or the related controllers that make up many parts of the control logic of the GPU.
Some Linkedin posts and the PS4 hack discuss how the command processor and ACEs are custom "F32" microprocessors, which are designed with a straightforward set of operations designed to load command packets and reference them with a loaded microcode store, and then performing the defined actions or setting hardware state, interrupts, or internal signals.
This discusses the multiple processors that make up the "command processor" for the PS4 https://fail0verflow.com/media/33c3-slides/#/74.
Radeon driver notes discuss how the command processor contains a multiple internal processors, and other mentions indicate the ACEs also have a history of using F32 cores.
GFX10 mentions yet another microcontroller in the command processor, an MES which appears to control the scheduling of the command processor's main elements like the ME and PFP.

I've tried searching for older slides to confirm how far back this architecture goes, with limited success. There's vague allusions to the custom command processor as far back as the VLIW days with Cypress.
There might have been a reference to a custom RISC-like core for graphics chips even prior to that, but I cannot find it now.

There's nothing theoretically preventing multiple controllers from cooperating, if the architecture makes room for it. AMD hasn't discussed such a use case. Descriptions of the behavior of these cores involves multiple front ends, with some of them arranged in a hierarchy, and they all talk to their local hardware or each other with internal paths and hidden state that don't make it outside of their little domains or the same device. They all manage segments of a big black box of graphics state with many parts that don't migrate outside of the chip.
If AMD wanted to create an SMP-capable form of an architecture that appears to predate GCN, some/all of Terascale, and even somewhat coherent GPU memory, I suppose it could invest in doing so.
The last time something like this was sort of asked, AMD opted for explicit multi-GPU, which is closer to saying "treat each front end as an isolated stupid slave device and manage with the API".
Since the latest leadership took over, AMD's stance is that multi-GPU won't happen unless they can make the GPUs appear as a single unit--but with no mention of how they intend to do so or if they are seriously evaluating it at present.

What choices they'd make for the architecture, and what sort of problem space is hidden in the GFX space that AMD doesn't talk about is unknown at this point.

JoeJ said:
One option here would be to assign regions of memory to be writeable only from a certain chiplet at a time, and that region would be invisible to others. (e.g. parts of current framebuffer)
Texture / mesh / last frame data could be in regions declared as read only. This way no sync of cache would be necessary.

Many of these techniques want that prior frame data. Making it invisible is prompting them to error out or pull in garbage data. If the hardware sits on a barrier or lock until the data is ready, it's back to the problem of heavy synchronization and spin-up latency that currently exists. Also, at some point these areas need to be made writable in order to fill them.
This also leaves unexplained how the properties of these regions are defined with properties like "read-only" and how ownership is handled. A popular way is to use page table attributes, but this is not trivial. Updates to make something writable or read-only with TLBs in play are a serious pain for OS writers and CPUs already. TLBs being a system-critical resource that the usually coherent x86 CPU domain does not treat as coherent.

bridgman · Aug 17, 2019

3dilettante said:
I've tried searching for older slides to confirm how far back this architecture goes, with limited success. There's vague allusions to the custom command processor as far back as the VLIW days with Cypress.

Rage 128 was the first GPU to include the Command Processor, which supported the move from what we called Programming Model 3 and Programming Model 4 (the pm4 packets you see mentioned in driver code get their name from the new HW programming model). I don't remember if PFP (pre-fetch parser) was there from the start; believe it was added later.

If you ignore PFP and Constant Engine (CE) all the parts had a single micro-engine up to GFX6/SI; in GFX6 the same ME handled 1 GFX/compute ring and 2 compute-only rings. Starting with GFX7 we added one or two more engines dedicated to handling compute rings (so MEC for Micro-Engine Compute). Each MEC supported 4 compute pipelines, and each compute pipeline multiplexed between up to 8 queues. I believe one of those compute pipelines is what we call an ACE in marketing materials.

On GFX7 and up one of the compute pipes can be used to run a HW scheduler rather than a compute pipeline, which allows the use of a much larger number of compute queues by switching process/queue sets onto the remaining compute queues. You can see some of the code for this in amdkfd/kfd_device_queue_manager.c.

Not sure how much we are saying about Navi yet so I'll stick with history for now.

3dilettante · Aug 19, 2019

bridgman said:
Rage 128 was the first GPU to include the Command Processor, which supported the move from what we called Programming Model 3 and Programming Model 4 (the pm4 packets you see mentioned in driver code get their name from the new HW programming model). I don't remember if PFP (pre-fetch parser) was there from the start; believe it was added later.

Thank you for the historical context. Perhaps this portion of the GPU is too unglamorous to warrant more than the simple boxes in the front-end diagrams, or there's not much reason to dive into the details of an architecture whose job is to set the stage for more visible parts of the GPU.
It does seem like these processors and their bespoke architectures have a persistent place across publicly heralded architecture shifts, and a presence/influence in some platforms like HSA.

If you ignore PFP and Constant Engine (CE) all the parts had a single micro-engine up to GFX6/SI; in GFX6 the same ME handled 1 GFX/compute ring and 2 compute-only rings. Starting with GFX7 we added one or two more engines dedicated to handling compute rings (so MEC for Micro-Engine Compute). Each MEC supported 4 compute pipelines, and each compute pipeline multiplexed between up to 8 queues. I believe one of those compute pipelines is what we call an ACE in marketing materials.

The difference between the external marketing and what happens underneath is interesting.
It sounds like an ME or MEC can serve as a bound on the number of microcode payloads that can be loaded or simultaneously active, but then the individual pipes sound like they're the independent elements despite being lumped into an engine.

On GFX7 and up one of the compute pipes can be used to run a HW scheduler rather than a compute pipeline, which allows the use of a much larger number of compute queues by switching process/queue sets onto the remaining compute queues. You can see some of the code for this in amdkfd/kfd_device_queue_manager.c.

It seems like this kind of front end organization provides a decent amount of flexibility, at least in scenarios where the throughput needs or compute intensity don't make something like programmable gate arrays or more conventional cores necessary.

bridgman · Aug 19, 2019

3dilettante said:
... but then the individual pipes sound like they're the independent elements despite being lumped into an engine.

Correct... each pipe has its own fixed-function hardware. The microcode mostly does PM4/AQL packet decoding.

PizzaKoma · Aug 21, 2019

https://www.amd.com/system/files/documents/rdna-whitepaper.pdf

yuri · Aug 21, 2019

So now we know what the "scalability" stands for. Also smartphone usage (by Samsung?) is confirmed. It's funny how fast they dismissed Vega as being a "bruteforce" hog

DavidGraham · Aug 21, 2019

So the main differentiators from GCN5/Vega are:

-the increased cache bandwidth to ALUs (minimizing their idle state and working them harder)
-the massive increase in triangle culling rate.

RDNA still rasterizes only 4 triangles per clock though. And there is no mention of Primitive Shaders or the DSRB anywhere.

Picao84 · Aug 21, 2019

DavidGraham said:
So the main differentiators from GCN5/Vega are:

-the increased cache bandwidth to ALUs (minimizing their idle state and working them harder)
-the massive increase in triangle culling rate.

RDNA still rasterizes only 4 triangles per clock though. And there is no mention of Primitive Shaders or the DSRB anywhere.

There are the smaller 32 wave fronts as well, increasing parallelism?

anexanhume · Aug 21, 2019

Some Navi variants will have enhanced INT capabilities. (This is from the white paper)

https://twitter.com/x/status/1164118110372130816

yuri · Aug 21, 2019

Aren't these the ops revealed in Navi LLVM code?

Bondrewd · Aug 21, 2019

yuri said:
Aren't these the ops revealed in Navi LLVM code?

Yes.

Frenetic Pony · Aug 21, 2019

Picao84 said:
There are the smaller 32 wave fronts as well, increasing parallelism?

Along with the much larger cache structure, the fact that primitive shaders seem to be there and work and have their own block, etc. etc.

There are a good amount of changes, that per transistor (the more relevant metric) makes Navi solidly more efficient than Vega when it comes specifically to gaming. Not terribly efficient, but at least a definitive improvement. I do wonder what AMD's costs per GPU are...

Digidi · Aug 21, 2019

In whitpaper ther is written that one prime unit can cull 2 primitives per clock. That means 8 primitive per clock for navi.

I remembered that AMD Vega had 17 Primitive per clock. What the hell went wrong?

Maybe @CarstenS will supply us with some nice polygon data

Link to white paper:
https://www.amd.com/system/files/documents/rdna-whitepaper.pdf

Rootax · Aug 22, 2019

Well Vega had squat in the end. Seems it was way too ambitious to exploit in the real world.

If Navi can do less on paper but really do it in real world, it's progress imo.

Shaklee3 · Aug 22, 2019

Speaking of Vega, did they ever release the Instinct version of the Vega? They announced the MI50/60 in November of last year, published a questionable benchmark a few months ago, but it's nowhere to be found. You can't buy one if you wanted to as far as I can tell. It's fairly deceiving given that their website has had info on it for a while.

Bondrewd · Aug 22, 2019

Shaklee3 said:
Speaking of Vega, did they ever release the Instinct version of the Vega? They announced the MI50/60 in November of last year, published a questionable benchmark a few months ago, but it's nowhere to be found. You can't buy one if you wanted to as far as I can tell. It's fairly deceiving given that their website has had info on it for a while.

They shipped it to hyperscale and you don't look like one so you're not supposed to see it.