I don't know what RISC cores you're referring to. I don't know the architecture of Maxwell's command processing. Is there parallelism there for command processing?
Are we seeing a combination of a parallelised command processor + coalescing?
Internally, the front ends run microcode with custom RISC or RISC-like processors. This was stated as such in the VLIW days in a few AMD presentations, and it is also mentioned here
http://www.beyond3d.com/content/reviews/52/7.
Later resources like the Xbox SDK discuss a microcode engine, but they do mention a prefetch processor that feeds into it. How new that is, I'm not sure.
There are vendors for inexpensive licensable and customizable RISC cores for this sort of thing. The other command front ends likely have something similar, possibly with feature support for the graphics-specific functionality stripped out.
Various custom units probably have them as well, like the DMA engines. UVD and TrueAudio are the most obvious examples of licensed Tensilica cores, though I forget who is the go-to company for the sort of cores used in the command processors.
It looks like the processor read in a queue command, pulls in the necessary program from microcode, runs it, then fetches the next command. Within a queue, this seems pretty serial. There is parallelism for compute, at least between queues but not within one.
The ACEs are able to be freely used in parallel, and those are a subset of command processing. I guess the more complex graphics state hinders doing the same for graphics, although it seems AMD has been moving towards tidying it up if preemption is coming.
Nvidia's solution was not so plainly described as such, but it has to do the same things, so there's some kind of simple processor hiding inside of the front end. AMD and Nvidia at this time do make non-RISC cores, but simple cores have sufficed until recently.
I have no idea how you make the leap to an order of magnitude.
I admit that was probably overestimating it for rhetorical effect.
The 980's submission time is 3.3 ms, while the 290X is 4.8 and 9.0 for DX12 and Mantle, respectively.
4.2ms of Oxide's work is somehow being fit into a fraction of that 3.3ms, and a large fraction of that is going to be devoted to the actual submission process rather than analyzing batch contents.
Nvidia would be very good at doing what Oxide is doing, or its inherent overhead is low enough to give that much leeway for the analysis, or some combination of the two.
"only allows one to be used by the developer". Let's see if there really is a second processor and if it's any use for games... It's why I used the word "apparently".
I'm pretty sure there is. Whether it can be readily exposed to games, I'm not sure. It would have benefits in a system that is split into two guest VMs.
So it seems to me that AMD's recently moved in the wrong direction with regard to driver interaction with GPU command processing. I'm not making excuses, merely pointing out that there's plenty of room to manoeuvre.
It is unfortunate that this is still such an acute problem, given how important freely mixing different workloads is to AMD's vision of the future.
It would be interesting to compare CPU load in a D3D12 benchmark like this where the GPU framerate is the same on both AMD and NVidia (e.g. with a locked 20fps, say), i.e. the application's draw call rate is known and we can then observe whether one driver is doing more work on the CPU.
If the AI could be ramped, it could also be approached from the direction of loading the CPU until the frame rate suffers. It doesn't seem like anything happens with all the core cycles that get freed up.