DirectX 12 API Preview

This may be inaccurate but I think of the levels as categories of features that are not necessarily dependent on any other category, and a higher feature level does not necessarily indicate more advanced features. As Andrew mentioned they are nothing more than a grouping of feature functionality ...
 
Last edited:
a higher feature level does not necessarily indicate more advanced features
Feature levels directly expose very specific capabilities of the underlying hardware - these are "advanced" by definition, since working around unsupported capabilities could be very costly or outright impossible.

Level 3 contains everything in level 1 and level 2 so just have level 3! Just because a version has certain features doesnt mean the game has to use them and just because a game doesnt use certain features doesnt mean there needs to be a seperate dx version with those features omitted.
We discussed this in Direct3D feature levels discussion.

"Let's just have the highest level" logic doesn't work here, because there is still graphics hardware that doesn't support the higher levels (and also uses a simpler version of the driver API (WDDM/DXGK) that does not expose advanced features of the runtime), and there is still code which uses these lower levels and would not really benefit from a higher level without much refactoring and creating new graphics assets.


The feature levels were not designed from top to bottom. If you recall, DirectX 10 was designed as a clean break to control the capability bits (CAPS) problem in DirectX 8.x-9.0, where multiple optional features make it hard to maintain code paths for different vendors. So Direct3D 10.0 eliminated the feature bits almost completely and required a strict set of features, including a set of supported DXGI texture formats - however many operations on these formats (filtering, multisample, MRT etc). still had to be queried with D3D10_FORMAT_SUPPORT.

As more capable hardware appeared with in Direct3D 10.1, new "global" features will have to be advertised for the programmer to discover. This is how feature levels first appeared, and there were only two of them: 10_0 for existing hardware and 10_1 as a strict superset which includes new capabilities. This was further expanded to 11_0 and 9_x (10Level9) in Direct3D 11, level 11_1 and a few options were added in Direct3D 11.1 for Windows 8.0, and even more options in 11.2 for Windows 8.1 and 11.3 for Windows 10.



Now from the system architecture point of view, the device driver doesn't really have to support all lower levels when it supports the higher ones. It could only advertise the highest possible capabilities and let the Direct3D runtime handle the rest, since the capabilities of the higher levels are nested in a strict superset of the lower level - and this is exactly how this works for levels 10_x and 11_x in Direct3D 11.1/11.2 (though the runtime still uses DDI9 for 10level9 even on level 11_x hardware.)

In Direct3D12 developers can have explicit control over this with Direct3D 11on12 layer.

The only reason that makes sense to me as to why ms would not just have dx12 level 3 as dx12 with no feature levels is if a ihv put pressure on them saying "our gpu supports all of dx12 except for one or two features which devs can work around or arnt that important and because of that we will have to market our gpu's as only being dx11 compliant and we will lose sales, you need to come up with a solution so that we can market our gpu's as dx12 compliant"
Hence this feature level nonsense.
I think the logic was quite different.

Level 12_0 is supported on the Xbox One.
Level 12_1 requires Conservative Rasterization and Rasterizer Ordered Views - they provide a very efficient way to implement occlusion culling, order-independent transparensy and ambient shadows, which require a lot of effort on current hardware.
 
Now that I've had more time to mull over it, it's a bit of a shame that FL12_1 is actually not FL12_0. As a developer to have those 2 features standard in every DX12 card would be quite massive I think. Then again, perhaps the difference in performance between emulation vs hardware is not as large as I think it is.
 
Now that I've had more time to mull over it, it's a bit of a shame that FL12_1 is actually not FL12_0. As a developer to have those 2 features standard in every DX12 card would be quite massive I think. Then again, perhaps the difference in performance between emulation vs hardware is not as large as I think it is.
Yeah but you can't retroactively decide what features are in what hardware. The reality is if FL12_1 was FL12 then there would be very few FL12_1 cards out there and you'd have to treat everything as 11_1 devices, despite most of it being fully capable of "bindless" stuff (i.e. FL12).

I'm with you in that those two features are great and useful and I want to see them everywhere ASAP but bindless is important too.
 
http://channel9.msdn.com/Events/Build/2015/3-673

Advanced DirectX12 Graphics and Performance
  • Date: April 30, 2015 from 2:00PM to 3:00PM
  • Speakers: Max McMullen
DirectX12 enables graphics intensive apps to deliver better performance with greater flexibility and control. This technical session goes deep into the DirectX12 APIs you can use to reduce CPU rendering overhead, manage GPU resource usage more efficiently, and express the most cutting-edge 3D graphics possible across the spectrum of Windows devices. Whether you are building an app for the phone, PC, or Xbox, you don't want to miss this session.


http://channel9.msdn.com/Events/Build/2015/2-637

Game Developers: Get the Most Out of Windows 10
In this session, we will tour the new APIs, learn techniques and design considerations for building multi-device Windows games, explore how to integrate Windows games with Xbox Live, and discuss updates on the most popular gaming middleware and engines now ready for Windows 10.
 
http://channel9.msdn.com/Events/Build/2015/3-673

Advanced DirectX12 Graphics and Performance
  • Date: April 30, 2015 from 2:00PM to 3:00PM
  • Speakers: Max McMullen
DirectX12 enables graphics intensive apps to deliver better performance with greater flexibility and control. This technical session goes deep into the DirectX12 APIs you can use to reduce CPU rendering overhead, manage GPU resource usage more efficiently, and express the most cutting-edge 3D graphics possible across the spectrum of Windows devices. Whether you are building an app for the phone, PC, or Xbox, you don't want to miss this session.

Thanks Dmitry for posting the Microsoft gaming related talks again. Originally my talk was planned to be a repeat of my GDC talk this year, thus the same title and description, but it now has a lot of new content with one more new Direct3D 12 API feature that I haven't talked about yet.
 
Now that I've had more time to mull over it, it's a bit of a shame that FL12_1 is actually not FL12_0. As a developer to have those 2 features standard in every DX12 card would be quite massive I think. Then again, perhaps the difference in performance between emulation vs hardware is not as large as I think it is.
Emulation of ROV and conservative rasterization is very difficult and would likely have unsolvable corner cases.

Conservative rasterization could be (at least) partially emulated by doing edge expansion in a geometry shader and adding lots of custom math instead of relying on fixed function rasterization hardware. However this would mean that the driver had to transparently add completely new shader stages (or combine them intelligently if geometry shader was already present), reroute the data and change the communication behind the scenes. This is counterintuitive for the design goal of creating an low level API with less abstraction. Obviously there would also be a huge performance drop (as geometry shaders are dead slow, especially on AMD hardware).

ROV emulation would need driver generated data stuctures for custom global atomic synchronization. DX12 has manual resource management. The programmer manages the memory. It would make the API really bad if you had to ask the driver whether it needs some extra temporary buffers and pass the resource descriptors to it through some side channel. If a programmer wants to emulate ROV, he/she can write the necessary code.

I don't like that the driver modifies my shaders and data structures based on some arcane IHV specific logic. There would definitely be corner cases where this failsmwith you particular resource layout or your particular shader. It is impossible to prove the correctness of complex shaders (that include flow control and synchronization between other threads). I dont believe that the driver should try to do some massive structural transformations to our shader code. Automatic code refactoring should always be verifiable by the programmer. In this case it would be completely hidden.
 
FYI, there are a few mismatches. Firstly, on Resource Binding Tier 2 the maximum number of UAVs in all stages is 64 in your slides but "full heap" in the MSDN docs. Secondly, on Tier 1 the maximum size of descriptor heap is 2^20 in your slides but "~55K" in the MSDN docs.

The MSDN docs are based on an earlier version of the spec. A hardware vendor came along with a hardware limitation of 64 UAVs in Tier 2 but meeting all the other specs. We (D3D) didn't want to fork the binding tiers again and so limited all of Tier 2 to 64. My team worked with the hardware vendor in Tier 1 that had the 55K limit to find alternate means of programming the GPU. Micro-optimizing the CPU overhead leads to the 65k limit but there's an alternate path that has slightly more CPU overhead in the state management but overall seems a win given app complexity dealing with 55K. As you might guess, the real hardware limit is 65k in actuality with some reserved for the driver.

My slides are correct and MSDN should be updated soon.
 
But if the Tier 1 limit for descriptor heap is ~64K, shouldn't it show as 2^16 in your slide, not 2^20 which is actually 1M? It was 2^16 in an earlier IDF2014 presentation BTW.

I wasn't clear. My team worked with the hardware vendor to get rid of the 55K limit by avoiding the hardware limited to 2^16 descriptors. That GPU now has a 2^20 limit by using a little more CPU for state management, so Tier 1 is 2^20 per my slide. The increased CPU overhead is mitigated by simpler app logic for dealing with descriptor heaps.
 
My team worked with the hardware vendor to get rid of the 55K limit by avoiding the hardware limited to 2^16 descriptors. That GPU now has a 2^20 limit by using a little more CPU for state management
Very interesting, thank you for the explanation.

These are probably the same programmers who implemented Windows multi-tasking in x86 real mode :)
 
For the curious, the limitation was on Haswell specifically, related to how the GPU manages binding updates and versioning. I discussed a little bit about how the new implementation works in my GDC talk:
https://software.intel.com/sites/de...ndering-with-DirectX-12-on-Intel-Graphics.pdf

The increased CPU overhead is mitigated by simpler app logic for dealing with descriptor heaps.
I'm not sure if it having increased CPU overhead in the driver is even true anymore. Because of the extra information provided by root signatures in DX12, we avoid a lot of overhead that would have been associated with using this alternate path in previous APIs (DX11, etc). Thus in my experience this new path is now lower overhead in pretty much all cases across the board (driver, application, runtime, GPU).

It's one of the reasons I'm a big fan of the DX12 resource binding model vs. alternatives - it efficiently maps to quite a wide variety of hardware architectures while exposing bindless and other great features in a straightforward manner. Kudos again to you guys on that :)
 
Last edited:
Interesting read. Thanks for the detailed info!

So, did I get it correctly: ExecuteIndirect only needs the extra driver generated compute shader call if I change the bindings? Is the basic version (similar to OpenGL MDI) directly supported by the command processor? If I have generated the ExecuteIndirect input arrays ahead of time (on GPU), is there any way to execute that driver generated compute shader so that there is enough time between that dispatch and and the multiple draw calls (indirect parameter setup in compute shader + directly following draw causes a pipeline stall)?

In the future it would be awesome to allow compute shaders to write directly to the command queue. But I understand the limitations of PC (multiple vendors with completely different command processor designs). OpenCL 2.1 manages to do this however (but it only supports compute, not draw call generation).
 
Interesting read. Thanks for the detailed info!
No problem! Generally we try to be as transparent as possible about how out driver and hardware works so that game developers can understand the performance they see and optimize appropriately.

So, did I get it correctly: ExecuteIndirect only needs the extra driver generated compute shader call if I change the bindings? Is the basic version (similar to OpenGL MDI) directly supported by the command processor?
Kind of - some of this is still up in the air as to how the driver will implement it on different hardware. The command processor on Haswell can do "draw indirect" natively. It can do MDI with a CPU-side count (by unrolling it on the CPU). While the command processor can kind-of loop, it's not terribly efficient so GPU-side counts probably imply the compute shader path.

There's also a trade-off based on the number of commands. While the command processor can indeed fetch the arguments for regular draw indirect, it's memory path is not as fast as the execution units. Thus for a sufficient number of draws, it's better to do the compute shader version as well. Where exactly that line is will depend on a few factors but certainly if you're going to be doing hundreds or thousands of draws it's likely worth doing the CS.

If I have generated the ExecuteIndirect input arrays ahead of time (on GPU), is there any way to execute that driver generated compute shader so that there is enough time between that dispatch and and the multiple draw calls (indirect parameter setup in compute shader + directly following draw causes a pipeline stall)?
There's an opportunity for the driver to do this by making use of the resource barrier that is required for the indirect arguments buffer. Remember that compute->3D or 3D->compute transitions cause a pipeline stall on Haswell already though, so the key optimization is to group compute work (including an indirect CS work) together.

OpenCL 2.1 manages to do this however (but it only supports compute, not draw call generation).
I need to check the details again but IIRC OpenCL's solution was more like execute indirect than exposing the command buffer format. In general I'll just say that OpenCL drivers typically play a lot of games to give the illusion of self dispatch... I wouldn't assume that a given feature in OCL is "native" or even particularly efficient without testing a specific implementation.
 
Last edited:
No problem! Generally we try to be as transparent as possible about how out driver and hardware works so that game developers can understand the performance they see and optimize appropriately.
That is highly appreciated. Intel has clearly stepped up in the recent years, thanks to you and the other enthusiastic people in the GPU team.
The command processor on Haswell can do "draw indirect" natively. It can do MDI with a CPU-side count (by unrolling it on the CPU). While the command processor can kind-of loop, it's not terribly efficient so GPU-side counts probably imply the compute shader path.

There's also a trade-off based on the number of commands. While the command processor can indeed fetch the arguments for regular draw indirect, it's memory path is not as fast as the execution units. Thus for a sufficient number of draws, it's better to do the compute shader version as well. Where exactly that line is will depend on a few factors but certainly if you're going to be doing hundreds or thousands of draws it's likely worth doing the CS.
We do have GPU side count and GPU side data. CPU knows nothing about the rendered scene. Round-trip back to CPU sounds like an horrible option (but its likely not as bad on Intel integrated GPUs compared to discrete GPUs). Our pipeline performs less than 50 MDI (ExecuteIndirect on DX12) in total (including all shadow maps), so the compute shader solution sounds like the best option.

Wouldn't there be possibility for a fast path when the draw call count comes from the GPU side, and otherwise the operation is comparable with OpenGL 4.3 (CPU draw count) MDI: Perform a (driver generated) compute shader that overwrites the draw call count in the MDI packet. This way you don't need to write N indirect draw calls to the buffer (instead just write a single 4 byte value).
There's an opportunity for the driver to do this by making use of the resource barrier that is required for the indirect arguments buffer. Remember that compute->3D or 3D->compute transitions cause a pipeline stall on Haswell already though, so the key optimization is to group compute work (including an indirect CS work) together.
This is exactly what I was wondering about. If I for example need to render 20 shadow maps, and perform a single ExecuteIndirect for each, am I able to write the code so that the GPU first performs the 20 (driver generated) compute dispatches that write the commands to the command buffer and then the GPU performs the draw calls for the shadow maps (with no compute between them)?

Are the Intel GPUs able to render multiple draw calls simultaneously if I change the render target between them (change RT -> ExecuteIndirect -> change RT -> ExecuteIndirect -> ...)? Obviously I could have my shadow maps in a texture array and use SV_RenderTargetArrayIndex to push different triangles to different shadow maps. This results in a single ExecuteIndirect call that renders all the shadow maps at once. I will PM you about the details.
I need to check the details again but IIRC OpenCL's solution was more like execute indirect than exposing the command buffer format. In general I'll just say that OpenCL drivers typically play a lot of games to give the illusion of self dispatch... I wouldn't assume that a given feature in OCL is "native" or even particularly efficient without testing a specific implementation.
Robert Ioffe (Intel) has a nice article that shows big gains from self dispatch on Broadwell (he doesn't give the exact numbers however):
https://software.intel.com/en-us/ar...ted-parallelism-and-work-group-scan-functions

This algorithm spawns an unknown amount of multiple different kernels. Simple indirect dispatch doesn't solve this case. Self enqueue on OpenCL 2.0 needs to either write to the command queue, or have some command processor support (indirect dispatch + command buffer loop would be enough, since the number of shader permutations is known in advance).
 
Last edited:
Back
Top