DirectX 12 API Preview

Same hardware, same game settings, same graphics driver:


The only difference is that on the left we have Windows 10 and on the right Windows 8.1.

HsFjhEW.png
 
I think it's a side effect of AMD's driver, to be honest. The game is controversial because AMD performance is generally looking very bad, but less bad under W10.

There may be an interaction with PhysX CPU code, since the game is known to use that library.

Perhaps it's some something to do with multi-threading and whether AMD's driver behaves differently with respect to threading under W10.
 
Incase anyone wanted to see whats in Dx11.3

Direct3D 11.3 Features

Adaptive Scalable Texture Compression

ASTC provides developers with greater control over the size verses quality tradeoff with textures. ASTC is a lossy format, but one that is designed to provide an inexpensive route to greater quality textures. The idea is that a developer can choose the optimum format without having to support multiple compression schemes.

Conservative Rasterization

Conservative rasterization adds some certainty to pixel rendering, which is helpful in particular to collision detection algorithms.

Default Texture Mapping

The use of default texture mapping reduces copying and memory usage while sharing image data between the GPU and the CPU. However, it should only be used in specific situations. The standard swizzle layout avoids copying or swizzling data in multiple layouts.

Rasterizer Order Views

Rasterizer ordered views (ROVs) allow pixel shader code to mark UAV bindings with a declaration that alters the normal requirements for the order of graphics pipeline results for UAVs. This enables Order Independent Transparency (OIT) algorithms to work, which give much better rendering results when multiple transparent objects are in line with each other in a view.

Shader Specified Stencil Reference Value

Enabling pixel shaders to output the Stencil Reference Value, rather than using the API-specified one, enables a very fine granular control over stencil operations.

Typed Unordered Access View Loads

Unordered Access View (UAV) Typed Load is the ability for a shader to read from a UAV with a specific DXGI_FORMAT.

Unified Memory Architecture

Querying for whether Unified Memory Architecture (UMA) is supported can help determine how to handle some resources.

Volume Tiled Resources

Volume (3D) textures can be used as tiled resources, noting that tile resolution is three-dimensional.
 
So, the biggest performance boosting features were left out of DX11.3.

There is no ExecuteIndirect. It would have been a super nice feature especially for DX11, as DX11 is so slow on draw calls. This feature would have single handledly increased DX11 draw call performance on par with DX12 (in cases where you don't need to change the GPU state betweeen the draws).

And there is no asynchronous compute. This was of course expected, as it would have required big API changes.
 
This feature would have single handledly increased DX11 draw call performance on par with DX12 (in cases where you don't need to change the GPU state betweeen the draws).
Which is one of the main public features they're touting with DX12 which also requires Windows 10. Hence there's no way they'd want it ending up in DX11?
 
There is no ExecuteIndirect. It would have been a super nice feature especially for DX11, as DX11 is so slow on draw calls. This feature would have single handledly increased DX11 draw call performance on par with DX12 (in cases where you don't need to change the GPU state betweeen the draws).
ExecuteIndirect needs root constants/descriptors and resource barriers to be useful and efficient on a variety of hardware.

IMHO while Microsoft did expose some of the easier GPU capabilities on DX11.3, if developers want to plan ahead it's better to start transitioning to DX12, even if it means making use of the DX11on12 layer in the short term to do partial ports.
 
Last edited:
ExecuteIndirect needs root constants/descriptors and resource barriers to be useful and efficient on a variety of hardware.
That's true. Full ExecuteIndirect needs root constants/descriptors. However they could have implemented a limited subset (equal to OpenGL multiDrawIndirect) to DX 11.3. That would't need any API refactorings at all.
 
That's true. Full ExecuteIndirect needs root constants/descriptors. However they could have implemented a limited subset (equal to OpenGL multiDrawIndirect) to DX 11.3. That would't need any API refactorings at all.
I think it would need to at least have support for something like draw parameters to be very useful though. If you literally just need a sequence of DrawIndirect calls that's not terribly inefficient to do today; GPUs are pretty efficient at throwing out 0-length draws if you need to cull some of them out.

Don't get me wrong, I like the feature but it's really the binding changes that make it cool.
 
I think it would need to at least have support for something like draw parameters to be very useful though. If you literally just need a sequence of DrawIndirect calls that's not terribly inefficient to do today; GPUs are pretty efficient at throwing out 0-length draws if you need to cull some of them out.

Don't get me wrong, I like the feature but it's really the binding changes that make it cool.
We ONLY need the ability to control the draw call count from the GPU side. Pushing a constant number of draw calls (most empty) from the CPU side wastes lots of GPU performance (empty draws cost surprisingly much). We don't need binding changes since we use virtual texturing (and all our mesh data is in a single big raw buffer). SV_DrawId would obviously be mandatory.
 
With D3D11? What about 12?
I am talking about the GPU cost. The command processor will be a big bottleneck if you push the maximum worst case (let's say 50k, mostly empty) draws for each viewport (let's say main + 4 shadow cascades + 10 shadow casting local lights). If you don't know what you are going to render on CPU side, it is hard to estimate tight (conservative) maximums that are never exceeded, especially when you use fine grained (sub object precision) occlusion culling for all viewports (including shadows).

If DX11 had GPU buffer predicates (skip over set of commands if a GPU buffer memory location contains zero), you could divide the (potentially empty) draws in groups (of 1000 for example) and pay only for the GPU overhead of the last group on each viewport. Unfortunately this would not save any CPU cost.
 
A quick question, perhaps not so quick answer. Is it true that D3D12's lower cost to draw calls only helps bad console ports and bad coding in general? Is it true that writing better code and draw things in batches would overcome every benefits D3D12 have with low cost to draw calls?
 
A quick question, perhaps not so quick answer. Is it true that D3D12's lower cost to draw calls only helps bad console ports and bad coding in general? Is it true that writing better code and draw things in batches would overcome every benefits D3D12 have with low cost to draw calls?
Simple answer : NO.
Spending less time in the API is spending more time on game's computations, which is always good.
 
A quick question, perhaps not so quick answer. Is it true that D3D12's lower cost to draw calls only helps bad console ports and bad coding in general? Is it true that writing better code and draw things in batches would overcome every benefits D3D12 have with low cost to draw calls?
There are tradeoffs when you batch draw calls. Good code should leverage draw calls where necessary and batch where necessary.

As the scene gets more graphically complex, and you want to reach a certain level of graphical fidelity, draw calls are likely going to increase with scene complexity.
 
Last edited:
So is this in term of cost effciency, as in less time in optimizing/minimizing draw calls more time for other things. Or is it also a pure technical limitation with D3D11 which no optimizing can overcome? Either way, is the lower overhead going to be a big step forward in practice?
 
So is this in term of cost effciency, as in less time in optimizing/minimizing draw calls more time for other things. Or is it also a pure technical limitation with D3D11 which no optimizing can overcome? Either way, is the lower overhead going to be a big step forward in practice?
I can't answer technically. Senior members here can provide you more accuracy. But my understanding is that there is no way to optimize your API overhead, but you can optimize around it - hence batched draw calls. In D3D11, say you make a call to draw a triangle strip, maybe that unpacks to 50 instructions for the GPU (that the CPU needs to send), where with D3D12 maybe it only takes 8 instructions. As the instruction overhead drops, that also means that GPU saturation can increase. In this scenario the GPU is waiting for all the commands to come in before it starts doing work, so the less instructions it needs to wait before it starts doing work the better. Lower overhead should result in immediate gains, as well allows for better control over the GPU so there should be less time spent fighting against what the API is doing, and more time programming the graphics for the game.
 
So is this in term of cost effciency, as in less time in optimizing/minimizing draw calls more time for other things. Or is it also a pure technical limitation with D3D11 which no optimizing can overcome? Either way, is the lower overhead going to be a big step forward in practice?
I suppose small studios might not have people proficient enough to use low level API, that's why MS is updating D3D11, therefore the gain will be freed CPU & GPU time for those able to use those API.
Having finer control over memory and lower overhead in the API will make game streaming much easier and with current hardware flexibility it should open the door for noticable new/improved gfx.
As you said it should also free up dev time that can be spent on better shaders, less aliasing and new techniques...
 
Back
Top