For Maxwell, the driver on the other hand is unaware that a compute engine was "exposed", it only provides access to a single instance of the 3D engine. (And obviously also copy engine etc.) Nvidia isn't lying when they said they never enabled the compute engine for Maxwell and older generations.
So when an application makes a request for the driver to allocate a queue or to use it, the driver doesn't receive the request?
What you don't see, is that compatibility/emulation layer part of the D3D12 runtime Microsoft patched in, which provides scheduling for hardware where over allocation of the queues provided by the driver becomes necessary.
Then why can't Fermi support be finalized despite the non-presence of Nvidia's drivers?
For hardware exposing multiple queues / engine instances, this layer only acts as a message broker for synchronization via fences. For hardware without, it also performs arbitration.
Going by AMD's HSA work or other driver documentation, multiple hardware queues and engines are exposed and instantiated by the driver. Does it forget all of this information later and allow compatibility layer that doesn't have any of that data take over?
Why is that? If GCN has 8 ACEs, and each can handle 8 queues, we're talking about being able to schedule 64 queues to the compute units. Even the RX 480 is able to handle up to 36 queues.
64 queues is at least partly influenced by Sony's desire to trick out its console with room to grow for middleware and system services.
It's admitted to be overkill.
The CU count does not yield how many queues a GPU can handle. A minimum granularity would be one wavefront, and a single CU can support 40.
Even then, with HWS and oversubscription directed by the front end and driver, the GPU can track far more than 64.
Currently, the workloads being demanded of compute are not leveraging it so heavily that they feel a need to use more than one. Doom's developers said for the purposes of their engine that it didn't matter. The front end processor's handling of a dispatch command is a single part of the process, and due to how many commands create more work on the back end a single queue can process a lot of commands in a limited number of clock cycles for shaders whose wavefronts can last for milliseconds from initial launch to final release.
What having more queues would be about is handling cases of general types of compute and graphics where there's differing behaviors in terms of synchronization and burstiness. For the current uses cases within an application, that can be generally satisfied with direct and one compute. The game or benchmark has a set of operations with a reasonable set of inputs and dependences, and a pretty straightforward path from frame start to end.
What Sony wanted is a front end that could handle multiple clients: an arbitrary amount of middleware, system services, virtualized/secured resources, and potentially compute types with very different scales of synchronization.
For the PC space, it might matter to a hypervisor trying to host multiple compute clients, but neither Time Spy or Doom have 64 different clients or workloads that have no relation to each other.