I now understand the difficulties of getting two GPUs to scale properly in Gaming. I quoted the paragraph above because it is most of what I was getting at. Is it that, two command processors cant ever have such functionality(?), or there is no economic reason for getting so elaborate and adding it, etc?
AMD doesn't comment much about the command processor or the related controllers that make up many parts of the control logic of the GPU.
Some Linkedin posts and the PS4 hack discuss how the command processor and ACEs are custom "F32" microprocessors, which are designed with a straightforward set of operations designed to load command packets and reference them with a loaded microcode store, and then performing the defined actions or setting hardware state, interrupts, or internal signals.
This discusses the multiple processors that make up the "command processor" for the PS4
https://fail0verflow.com/media/33c3-slides/#/74.
Radeon driver notes discuss how the command processor contains a multiple internal processors, and other mentions indicate the ACEs also have a history of using F32 cores.
GFX10 mentions yet another microcontroller in the command processor, an MES which appears to control the scheduling of the command processor's main elements like the ME and PFP.
I've tried searching for older slides to confirm how far back this architecture goes, with limited success. There's vague allusions to the custom command processor as far back as the VLIW days with Cypress.
There might have been a reference to a custom RISC-like core for graphics chips even prior to that, but I cannot find it now.
There's nothing theoretically preventing multiple controllers from cooperating, if the architecture makes room for it. AMD hasn't discussed such a use case. Descriptions of the behavior of these cores involves multiple front ends, with some of them arranged in a hierarchy, and they all talk to their local hardware or each other with internal paths and hidden state that don't make it outside of their little domains or the same device. They all manage segments of a big black box of graphics state with many parts that don't migrate outside of the chip.
If AMD wanted to create an SMP-capable form of an architecture that appears to predate GCN, some/all of Terascale, and even somewhat coherent GPU memory, I suppose it could invest in doing so.
The last time something like this was sort of asked, AMD opted for explicit multi-GPU, which is closer to saying "treat each front end as an isolated stupid slave device and manage with the API".
Since the latest leadership took over, AMD's stance is that multi-GPU won't happen unless they can make the GPUs appear as a single unit--but with no mention of how they intend to do so or if they are seriously evaluating it at present.
What choices they'd make for the architecture, and what sort of problem space is hidden in the GFX space that AMD doesn't talk about is unknown at this point.
One option here would be to assign regions of memory to be writeable only from a certain chiplet at a time, and that region would be invisible to others. (e.g. parts of current framebuffer)
Texture / mesh / last frame data could be in regions declared as read only. This way no sync of cache would be necessary.
Many of these techniques want that prior frame data. Making it invisible is prompting them to error out or pull in garbage data. If the hardware sits on a barrier or lock until the data is ready, it's back to the problem of heavy synchronization and spin-up latency that currently exists. Also, at some point these areas need to be made writable in order to fill them.
This also leaves unexplained how the properties of these regions are defined with properties like "read-only" and how ownership is handled. A popular way is to use page table attributes, but this is not trivial. Updates to make something writable or read-only with TLBs in play are a serious pain for OS writers and CPUs already. TLBs being a system-critical resource that the usually coherent x86 CPU domain does not treat as coherent.