WDDM 2.0 and beyond

What I expect from beyond WDDM 2.0 (WDDM 2.1?): command buffer submission in user mode (actually they should be still in kernel mode), this should easily allow async command list execution (since there should be no more the risk of application depending on driver flushing kernel vs user mode) and reduce even more CPU overhead in multithreading. Also, I would like the idea of supporting more then one graphics/default command queue per single node (if supported by the driver/hardware).
 
I believe that's absolutely possible with WDDM 2.0.

In Direct3D 12, command list submission works through multiple command queues, and since the lists and pipeline states are pre-compiled and read-only, they are inherently thread-safe with no need for inter-process synchronization. You can use multiple threads to open multiple queues in each thread, provided the driver uses one of the new virtual addressing models above - so the kernel-mode driver portion does not have to patch/verify every virtual address location on each draw call.

This was the single reason for WDDM 1.x call submission inefficiencies, and as I said in another thread, AMD, NVidia, Intel and Microsoft were all well aware of this limitation since at least WinHEC 2006, and all of them were apparently planning their new hardware designs accordingly:
https://forum.beyond3d.com/posts/1854837/


I can only speculate but they probably came to realization that these improvements still require a major overhaul of pipeline state management and draw call submission in the Direct3D API, and Microsoft were probably not comfortable to undertake under the pressure of Windows 8 and Xbox One development... might as well related to Stefen Sinofsky's (mis)management of the Windows division.
 
Last edited:
Yes, all that could be possible, but it is not implemented yet.

We can naturally have multiple thread creating and submitting command lists, lists can be submitted multiple times (though the application need to pay attention they are not still in execution).
However, only one command lists at time is executed by the system and the order of submission is preserved by a runtime serialization. Moreover methods on a specific cmd-list object are not free-threaded, nor are cmd-allocators, so the application must take care of cmd-lists and cmd-allocators synchronization. On the other side the command queues are free-threaded, but we can currently have only one graphics queue (copy and compute queue should not have such restriction).

I am not aware about details (someone here could probably give you a better answer ( : ), but looks like that the context switches caused by async command buffer execution (that happen actually on kernel mode) could cause some sort of issues.
I would be not surprised though if in a future update the execution of command will happen completely in user mode, with a true async command list model.

Finally, I am not aware of a single GPU having more then one graphics engine per node, so I do not know how well would suit having multiple graphics command queues.
 
Back
Top