DirectX 12 API Preview

Discussion in 'PC Hardware, Software and Displays' started by PeterAce, Apr 28, 2014.

  1. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,911
    Likes Received:
    1,608
    This may be inaccurate but I think of the levels as categories of features that are not necessarily dependent on any other category, and a higher feature level does not necessarily indicate more advanced features. As Andrew mentioned they are nothing more than a grouping of feature functionality ...
     
    #181 pharma, Apr 11, 2015
    Last edited: Apr 11, 2015
  2. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    646
    Likes Received:
    489
    Location:
    55°38′33″ N, 37°28′37″ E
    Feature levels directly expose very specific capabilities of the underlying hardware - these are "advanced" by definition, since working around unsupported capabilities could be very costly or outright impossible.

    We discussed this in Direct3D feature levels discussion.

    "Let's just have the highest level" logic doesn't work here, because there is still graphics hardware that doesn't support the higher levels (and also uses a simpler version of the driver API (WDDM/DXGK) that does not expose advanced features of the runtime), and there is still code which uses these lower levels and would not really benefit from a higher level without much refactoring and creating new graphics assets.


    The feature levels were not designed from top to bottom. If you recall, DirectX 10 was designed as a clean break to control the capability bits (CAPS) problem in DirectX 8.x-9.0, where multiple optional features make it hard to maintain code paths for different vendors. So Direct3D 10.0 eliminated the feature bits almost completely and required a strict set of features, including a set of supported DXGI texture formats - however many operations on these formats (filtering, multisample, MRT etc). still had to be queried with D3D10_FORMAT_SUPPORT.

    As more capable hardware appeared with in Direct3D 10.1, new "global" features will have to be advertised for the programmer to discover. This is how feature levels first appeared, and there were only two of them: 10_0 for existing hardware and 10_1 as a strict superset which includes new capabilities. This was further expanded to 11_0 and 9_x (10Level9) in Direct3D 11, level 11_1 and a few options were added in Direct3D 11.1 for Windows 8.0, and even more options in 11.2 for Windows 8.1 and 11.3 for Windows 10.



    Now from the system architecture point of view, the device driver doesn't really have to support all lower levels when it supports the higher ones. It could only advertise the highest possible capabilities and let the Direct3D runtime handle the rest, since the capabilities of the higher levels are nested in a strict superset of the lower level - and this is exactly how this works for levels 10_x and 11_x in Direct3D 11.1/11.2 (though the runtime still uses DDI9 for 10level9 even on level 11_x hardware.)

    In Direct3D12 developers can have explicit control over this with Direct3D 11on12 layer.

    I think the logic was quite different.

    Level 12_0 is supported on the Xbox One.
    Level 12_1 requires Conservative Rasterization and Rasterizer Ordered Views - they provide a very efficient way to implement occlusion culling, order-independent transparensy and ambient shadows, which require a lot of effort on current hardware.
     
    Andrew Lauritzen and liquidboy like this.
  3. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
  4. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    7,788
    Likes Received:
    6,079
    Now that I've had more time to mull over it, it's a bit of a shame that FL12_1 is actually not FL12_0. As a developer to have those 2 features standard in every DX12 card would be quite massive I think. Then again, perhaps the difference in performance between emulation vs hardware is not as large as I think it is.
     
  5. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Yeah but you can't retroactively decide what features are in what hardware. The reality is if FL12_1 was FL12 then there would be very few FL12_1 cards out there and you'd have to treat everything as 11_1 devices, despite most of it being fully capable of "bindless" stuff (i.e. FL12).

    I'm with you in that those two features are great and useful and I want to see them everywhere ASAP but bindless is important too.
     
    Lightman and iroboto like this.
  6. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    646
    Likes Received:
    489
    Location:
    55°38′33″ N, 37°28′37″ E
    http://channel9.msdn.com/Events/Build/2015/3-673

    Advanced DirectX12 Graphics and Performance
    • Date: April 30, 2015 from 2:00PM to 3:00PM
    • Speakers: Max McMullen
    DirectX12 enables graphics intensive apps to deliver better performance with greater flexibility and control. This technical session goes deep into the DirectX12 APIs you can use to reduce CPU rendering overhead, manage GPU resource usage more efficiently, and express the most cutting-edge 3D graphics possible across the spectrum of Windows devices. Whether you are building an app for the phone, PC, or Xbox, you don't want to miss this session.


    http://channel9.msdn.com/Events/Build/2015/2-637

    Game Developers: Get the Most Out of Windows 10
    In this session, we will tour the new APIs, learn techniques and design considerations for building multi-device Windows games, explore how to integrate Windows games with Xbox Live, and discuss updates on the most popular gaming middleware and engines now ready for Windows 10.
     
  7. Max McMullen

    Newcomer

    Joined:
    Apr 4, 2014
    Messages:
    20
    Likes Received:
    104
    Location:
    Seattle, WA
    Thanks Dmitry for posting the Microsoft gaming related talks again. Originally my talk was planned to be a repeat of my GDC talk this year, thus the same title and description, but it now has a lot of new content with one more new Direct3D 12 API feature that I haven't talked about yet.
     
    BRiT, mosen, liquidboy and 3 others like this.
  8. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Emulation of ROV and conservative rasterization is very difficult and would likely have unsolvable corner cases.

    Conservative rasterization could be (at least) partially emulated by doing edge expansion in a geometry shader and adding lots of custom math instead of relying on fixed function rasterization hardware. However this would mean that the driver had to transparently add completely new shader stages (or combine them intelligently if geometry shader was already present), reroute the data and change the communication behind the scenes. This is counterintuitive for the design goal of creating an low level API with less abstraction. Obviously there would also be a huge performance drop (as geometry shaders are dead slow, especially on AMD hardware).

    ROV emulation would need driver generated data stuctures for custom global atomic synchronization. DX12 has manual resource management. The programmer manages the memory. It would make the API really bad if you had to ask the driver whether it needs some extra temporary buffers and pass the resource descriptors to it through some side channel. If a programmer wants to emulate ROV, he/she can write the necessary code.

    I don't like that the driver modifies my shaders and data structures based on some arcane IHV specific logic. There would definitely be corner cases where this failsmwith you particular resource layout or your particular shader. It is impossible to prove the correctness of complex shaders (that include flow control and synchronization between other threads). I dont believe that the driver should try to do some massive structural transformations to our shader code. Automatic code refactoring should always be verifiable by the programmer. In this case it would be completely hidden.
     
    Lightman likes this.
  9. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    646
    Likes Received:
    489
    Location:
    55°38′33″ N, 37°28′37″ E
    Thank you for clarifications Max.

    Hopefully you have an update on Resource Binding Tiers as preliminary MSDN documentation is out of sync with your GDC slides...
     
  10. Max McMullen

    Newcomer

    Joined:
    Apr 4, 2014
    Messages:
    20
    Likes Received:
    104
    Location:
    Seattle, WA
    No update in the talk but I can probably take a look back through this thread after the BUILD conference is over and reply.
     
  11. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    646
    Likes Received:
    489
    Location:
    55°38′33″ N, 37°28′37″ E
    FYI, there are a few mismatches. Firstly, on Resource Binding Tier 2 the maximum number of UAVs in all stages is 64 in your slides but "full heap" in the MSDN docs. Secondly, on Tier 1 the maximum size of descriptor heap is 2^20 in your slides but "~55K" in the MSDN docs.
     
  12. Max McMullen

    Newcomer

    Joined:
    Apr 4, 2014
    Messages:
    20
    Likes Received:
    104
    Location:
    Seattle, WA
    The MSDN docs are based on an earlier version of the spec. A hardware vendor came along with a hardware limitation of 64 UAVs in Tier 2 but meeting all the other specs. We (D3D) didn't want to fork the binding tiers again and so limited all of Tier 2 to 64. My team worked with the hardware vendor in Tier 1 that had the 55K limit to find alternate means of programming the GPU. Micro-optimizing the CPU overhead leads to the 65k limit but there's an alternate path that has slightly more CPU overhead in the state management but overall seems a win given app complexity dealing with 55K. As you might guess, the real hardware limit is 65k in actuality with some reserved for the driver.

    My slides are correct and MSDN should be updated soon.
     
    pjbliverpool, mosen, Lightman and 3 others like this.
  13. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    646
    Likes Received:
    489
    Location:
    55°38′33″ N, 37°28′37″ E
    OK, thank you.

    But if the Tier 1 limit for descriptor heap is ~64K, shouldn't it show as 2^16 in your slide, not 2^20 which is actually 1M? It was 2^16 in an earlier IDF2014 presentation BTW.
     
  14. Max McMullen

    Newcomer

    Joined:
    Apr 4, 2014
    Messages:
    20
    Likes Received:
    104
    Location:
    Seattle, WA
    I wasn't clear. My team worked with the hardware vendor to get rid of the 55K limit by avoiding the hardware limited to 2^16 descriptors. That GPU now has a 2^20 limit by using a little more CPU for state management, so Tier 1 is 2^20 per my slide. The increased CPU overhead is mitigated by simpler app logic for dealing with descriptor heaps.
     
    BRiT, DmitryKo, mosen and 1 other person like this.
  15. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    646
    Likes Received:
    489
    Location:
    55°38′33″ N, 37°28′37″ E
    Very interesting, thank you for the explanation.

    These are probably the same programmers who implemented Windows multi-tasking in x86 real mode :)
     
  16. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    For the curious, the limitation was on Haswell specifically, related to how the GPU manages binding updates and versioning. I discussed a little bit about how the new implementation works in my GDC talk:
    https://software.intel.com/sites/de...ndering-with-DirectX-12-on-Intel-Graphics.pdf

    I'm not sure if it having increased CPU overhead in the driver is even true anymore. Because of the extra information provided by root signatures in DX12, we avoid a lot of overhead that would have been associated with using this alternate path in previous APIs (DX11, etc). Thus in my experience this new path is now lower overhead in pretty much all cases across the board (driver, application, runtime, GPU).

    It's one of the reasons I'm a big fan of the DX12 resource binding model vs. alternatives - it efficiently maps to quite a wide variety of hardware architectures while exposing bindless and other great features in a straightforward manner. Kudos again to you guys on that :)
     
    #196 Andrew Lauritzen, Apr 30, 2015
    Last edited: May 1, 2015
  17. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    7,788
    Likes Received:
    6,079
    Dx12 slides posted in build 2015 thread.
     
  18. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Interesting read. Thanks for the detailed info!

    So, did I get it correctly: ExecuteIndirect only needs the extra driver generated compute shader call if I change the bindings? Is the basic version (similar to OpenGL MDI) directly supported by the command processor? If I have generated the ExecuteIndirect input arrays ahead of time (on GPU), is there any way to execute that driver generated compute shader so that there is enough time between that dispatch and and the multiple draw calls (indirect parameter setup in compute shader + directly following draw causes a pipeline stall)?

    In the future it would be awesome to allow compute shaders to write directly to the command queue. But I understand the limitations of PC (multiple vendors with completely different command processor designs). OpenCL 2.1 manages to do this however (but it only supports compute, not draw call generation).
     
    BRiT likes this.
  19. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    No problem! Generally we try to be as transparent as possible about how out driver and hardware works so that game developers can understand the performance they see and optimize appropriately.

    Kind of - some of this is still up in the air as to how the driver will implement it on different hardware. The command processor on Haswell can do "draw indirect" natively. It can do MDI with a CPU-side count (by unrolling it on the CPU). While the command processor can kind-of loop, it's not terribly efficient so GPU-side counts probably imply the compute shader path.

    There's also a trade-off based on the number of commands. While the command processor can indeed fetch the arguments for regular draw indirect, it's memory path is not as fast as the execution units. Thus for a sufficient number of draws, it's better to do the compute shader version as well. Where exactly that line is will depend on a few factors but certainly if you're going to be doing hundreds or thousands of draws it's likely worth doing the CS.

    There's an opportunity for the driver to do this by making use of the resource barrier that is required for the indirect arguments buffer. Remember that compute->3D or 3D->compute transitions cause a pipeline stall on Haswell already though, so the key optimization is to group compute work (including an indirect CS work) together.

    I need to check the details again but IIRC OpenCL's solution was more like execute indirect than exposing the command buffer format. In general I'll just say that OpenCL drivers typically play a lot of games to give the illusion of self dispatch... I wouldn't assume that a given feature in OCL is "native" or even particularly efficient without testing a specific implementation.
     
    #199 Andrew Lauritzen, May 3, 2015
    Last edited: May 4, 2015
    BRiT likes this.
  20. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    That is highly appreciated. Intel has clearly stepped up in the recent years, thanks to you and the other enthusiastic people in the GPU team.
    We do have GPU side count and GPU side data. CPU knows nothing about the rendered scene. Round-trip back to CPU sounds like an horrible option (but its likely not as bad on Intel integrated GPUs compared to discrete GPUs). Our pipeline performs less than 50 MDI (ExecuteIndirect on DX12) in total (including all shadow maps), so the compute shader solution sounds like the best option.

    Wouldn't there be possibility for a fast path when the draw call count comes from the GPU side, and otherwise the operation is comparable with OpenGL 4.3 (CPU draw count) MDI: Perform a (driver generated) compute shader that overwrites the draw call count in the MDI packet. This way you don't need to write N indirect draw calls to the buffer (instead just write a single 4 byte value).
    This is exactly what I was wondering about. If I for example need to render 20 shadow maps, and perform a single ExecuteIndirect for each, am I able to write the code so that the GPU first performs the 20 (driver generated) compute dispatches that write the commands to the command buffer and then the GPU performs the draw calls for the shadow maps (with no compute between them)?

    Are the Intel GPUs able to render multiple draw calls simultaneously if I change the render target between them (change RT -> ExecuteIndirect -> change RT -> ExecuteIndirect -> ...)? Obviously I could have my shadow maps in a texture array and use SV_RenderTargetArrayIndex to push different triangles to different shadow maps. This results in a single ExecuteIndirect call that renders all the shadow maps at once. I will PM you about the details.
    Robert Ioffe (Intel) has a nice article that shows big gains from self dispatch on Broadwell (he doesn't give the exact numbers however):
    https://software.intel.com/en-us/ar...ted-parallelism-and-work-group-scan-functions

    This algorithm spawns an unknown amount of multiple different kernels. Simple indirect dispatch doesn't solve this case. Self enqueue on OpenCL 2.0 needs to either write to the command queue, or have some command processor support (indirect dispatch + command buffer loop would be enough, since the number of shader permutations is known in advance).
     
    #200 sebbbi, May 4, 2015
    Last edited: May 4, 2015
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...