DirectX 12: The future of it within the console gaming space (specifically the XB1)

Discussion in 'Console Technology' started by Shortbread, Mar 7, 2014.

  1. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,632
    Likes Received:
    1,250
    Location:
    British Columbia, Canada
    I'm talking about PC. On console it's somewhat less relevant as very few developers who are willing to target - say - PS3 are scared of a "low level" API like libgcm. And rightfully so... these APIs aren't really difficult per se, especially when they are only for one piece of pre-determined hardware.

    Also note that when I say "engines", I'm talking about technology created by graphics "experts" that are used across multiple titles. Ex. Frostbite is obviously an engine even though it is not sold as middleware.

    That further supports my point - if they haven't moved forward to even DX10/11 then it has nothing to do with the "ease of use" of the API and continuing to cater to programmers who want "safer, easier" APIs is wasted effort.

    I don't think the resourcing has ever been a huge concern to be honest... as you yourself point out, to these big companies it's peanuts. I wouldn't be surprised if some misguided notion of "protecting" the advantages of the Xbox platform vs. PC have played a role in the past, but I don't think anyone has really said "we're not doing this because it would take some time".

    Obviously DX12 as it is defined would not have worked on hardware 12 years ago, so you can hand wave about a "DX12-like API" but I think it's far from clear that you could do a similarly portable and efficient API more than a few years ago.

    They couldn't support WDDM2.0 for one, which is an important part of the new API semantics. It hasn't been that long that GPUs have had properly secured per-process VA spaces - certainly 12 years ago there were a lot of GPUs still using physical addressing and KMD patching, hence the WDDM1.x design in the first place.

    Ha, don't be fooled by their re-positioning - do the math and it's about the same cost as it was before for a AAA studio. The only difference is it's somewhat more accessible to indy devs now too, a la. Unity.

    Not sure which developers you were "preaching" to, but as I said for as long as I've been in graphics it has been clear to at least the AAA folks.

    Meh, you can already do that with "compute", and the notion isn't even well-defined with the fixed function hardware. There are sort/serialization points in the very definition of the graphics pipeline - you can't just hand wave those away and "we'll just talk *directly* to the hardware this time guys!". Talking directly to the hardware *is* talking to the units that spawn threads, create rasterizer work, etc.

    By all means expose the CP directly and make it better at what it does! But that's arguing my point: you're effectively just putting a (currently kinda crappy) CPU on the front of the GPU. That's totally fine, but the notion that the "GPU is a better CPU" is kind of silly if your idea of a GPU includes a CPU. And you're going to start to ask why you need a separate CPU-like core just to drive graphics on an SoC that already has very capable ones nearby...
     
    #881 Andrew Lauritzen, Feb 12, 2015
    Last edited: Feb 12, 2015
    Starx, Jwm, liquidboy and 2 others like this.
  2. function

    function None functional
    Legend

    Joined:
    Mar 27, 2003
    Messages:
    5,854
    Likes Received:
    4,411
    Location:
    Wrong thread
    DX12 may very well be the API (or one of them) that drives the next Xbox.

    How would you like to see CPU/GPU integration progress to support a DX12 like API on a next gen console?
     
  3. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,632
    Likes Received:
    1,250
    Location:
    British Columbia, Canada
    That's kind of a big topic for another thread I think, but at a minimum you need a tighter integration of caches between CPU/GPU (at least something like Haswell's LLC) with some basic controls for coherency (even if explicit), shared virtual memory (consoles can get away with shared physical I guess) and efficient, low-latency atomics and signaling mechanisms between the two. The multiple compute queue's stuff in current gen is a good start but ultimately I'd like to see that a bit finer grained - i.e. warp-level work stealing on the execution units themselves rather than everything having to go through the frontend.
     
    sebbbi likes this.
  4. mosen

    Regular

    Joined:
    Mar 30, 2013
    Messages:
    452
    Likes Received:
    152
    I'm eager to know more about WDDM 2.0 and it's specifications. Is it possible for all GPUs (with FL_11) to fully support WDDM 2.0 or it needs new hardware?
     
  5. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    732
    Likes Received:
    134
    Err, and why do you need that to reduce draw call pressure, exactly?

    Yeah, but the problem is: AAA studios are not buying anymore, hence the "accessible for indy" repositioning.

    Yep, and I don't see any problem with that, just need a compiler, obviously.

    No, it doesn't, I'm describing the current state of affairs, there each GPU has a "crappy CPU" attached.
    I would argue that making it possible to drive FFP parts from GPU code eliminates the need for any CPU altogether.
     
  6. forumaccount

    Newcomer

    Joined:
    Jan 30, 2009
    Messages:
    140
    Likes Received:
    86
    And here I am thinking it's already too easy to hang GPUs.
     
    iroboto and liquidboy like this.
  7. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,632
    Likes Received:
    1,250
    Location:
    British Columbia, Canada
    So that you can bake GPU virtual addresses into user-visible structures (descriptors, etc) and avoid any submission-time patching based on residency (WDDM1.x style). i.e. to get rid of the "KM Driver" block in the diagram from here:
    http://blogs.msdn.com/b/directx/archive/2014/03/20/directx-12.aspx
     
    psorcerer, mosen, sebbbi and 3 others like this.
  8. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    This is also one of the reasons why deferred contexts in DX11 are practically useless. You can't pre-patch the deferred command lists on the other cores, because you don't know when the deferred lists are going to be submitted (residency is not yet known). You need to patch when the list is submitted, meaning that the main thread (controlling the immediate context) needs to do it. It needs to go through all the lists by all the cores as it is the only thread actually submitting anything to the GPU.

    This is how you should do it in DX11:

    GCN Performance Tip 31: A dedicated thread solely responsible for making D3D calls is usually the best way to drive the API.Notes: The best way to drive a high number of draw calls in DirectX11 is to dedicate a thread to graphics API calls. This thread’s sole responsibility should be to make DirectX calls; any other types of work should be moved onto other threads (including processing memory buffer contents). This graphics “producer thread” approach allows the feeding of the driver’s “consumer thread” as fast as possible, enabling a high number of API calls to be processed.

    DirectX11 Deferred Contexts will not achieve faster results than this approach when it is implemented correctly.
     
  9. oldschoolnerd

    Newcomer

    Joined:
    Sep 13, 2013
    Messages:
    65
    Likes Received:
    8
    So fundamentally this is what DirectX12 improves upon, each core can be a producer thread, concurrently. Have you had a go on it yet? Would love to know your views on Mr Wardell and his 500% performance improvement claims. To my mind this would only be possible if the GPU was currently only 20% utilised ...
     
  10. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    732
    Likes Received:
    134
    Cool, but seems like that can be accomplished in other ways (i.e. you just need one level of indirection, it doesn't require full-fledged VA, although solving it through VA looks "right").
    Anyway, the change is not that big, and there were quite a lot of changes for DX10 or DX11 in GPUs anyway. I still fail to see why it couldn't be done earlier. Or at least why it's more complex than say tessellation.
     
  11. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,834
    Likes Received:
    18,634
    Location:
    The North
    I think you missed out on the discussion of large batched jobs and small batches jobs from earlier.
     
  12. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    I would assume that the old PC GPUs used 32 bit physical memory pointers. No paging and no virtual memory, just raw physical memory pointers to the GPU memory. Nowadays you have 64 bit pointers and virtual memory with CPU compatible page sizes and cache line sizes. I believe this is a much bigger change than tessellation or geometry shaders for example.

    Good analogy would be compute shaders. Compute shaders would have been problematic with DX9 hardware. DX10 brought us unified shaders and geometry shaders needed some on-chip scratch pad to store the results. Later the on-chip scratch pad buffer was exposed to the developers (as thread local shared memory) and it was straightforward to extend the existing unified shading cores with synchronization primitives to allow threads to cooperate with each other (= compute shaders). DX10 also introduced integer arithmetic (very useful for address calculation for example).
     
    #892 sebbbi, Feb 13, 2015
    Last edited: Feb 13, 2015
    mosen likes this.
  13. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,834
    Likes Received:
    18,634
    Location:
    The North
    I'm going to try my best to apply some of what I learned from this thread, let me know how I did.

    As soon as I read what your wrote Sebbbi, it completely and entirely brought me back to the Eurogamer Titanfall dev interview:

    .

    To note Baker writes: At the BEST case this remove frame latency : at the worst case it probably was the cause of the frame drops when there were too many draw calls

    Interestingly enough they also used a lot of compute shaders as well, if you were really GPU bottlenecked from an ALU perspective I'd imagine you'd try to do some more of those tasks back in your CPU. The 792 resolution is not reflective of the lack of available GPU resources in this case.

    Sounds like to me, after having understood everything that I can possibly understand from reading all your thoughts dx12 is massively beneficial to especially weaker CPUs. The concept of dx12 only benefitting games that are CPU bound is too generic a statement.

    In this case Respawn traded off batching all its draw calls to reduce frame latency of that batch. They wanted the render without latency so they instead submitted as many draw calls as possible as opposed to sending large batches and as you wrote Sebbbi GCN architecture demands an entire thread/core dedicated to this job. Because of this they multi-threaded culling and gathering of objects to be rendered, a situation I don't think would occur each thread could draw, cull and gather in its own. They basically have this one thread/core that is waiting to do what the main thread would normally do.

    Respawn writes:
    So synchronization between cores is another issue that needs to be contended with especially if you are sending information back to thread 0 to render but before that thread 1 needs to cull and gather items. All these moving parts in the CPU side because it's attempting to do it's absolute best at removing frame latency. If they didn't have to they would have extra time to do things properly to a degree I imagine.

    I can imagine in TFs scenario there was not enough work coming in to saturate the GPU as the CPU, but the time available to render at 60fps is still 16ms, so when certain larger jobs come into play the GPU had very limited time left to do it and with less burst muscle available the to only way to make that frame interval was ultimately to reduce resolution.

    I still think they suffered with esram, but it sounds like if they had dx12 before launch and not de facto dx11 titanfall would be performing much better.
     
    #893 iroboto, Feb 13, 2015
    Last edited: Feb 13, 2015
  14. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,834
    Likes Received:
    18,634
    Location:
    The North
  15. Jwm

    Jwm
    Veteran

    Joined:
    Feb 27, 2013
    Messages:
    1,037
    Likes Received:
    155
    Location:
    Texas
    ^ Interesting results
     
  16. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    17,884
    Likes Received:
    5,334
    I think that's going to turn out as some function runs 500% quicker, not we'll see a 500% fps increase
     
  17. forumaccount

    Newcomer

    Joined:
    Jan 30, 2009
    Messages:
    140
    Likes Received:
    86
    The most charitable interpretation is that he meant command buffer creation time could be 5x faster because it's running on 6 threads. As you state it has nothing to do with framerate at all.
     
  18. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    732
    Likes Received:
    134
    It's trivial to do real-time adress patching in this one (even more trivial if you have 64 ->32 bit). Do you remember that bytecode is completely controlled by the driver? You just never load any address without adding an offset to it, and voila - you have "VA-like" thing in software. Google did it for x86.
     
  19. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,632
    Likes Received:
    1,250
    Location:
    British Columbia, Canada
    Physical address patching/resolution must be done in kernel model for security reasons. Hence the whole allocation/patch lists and KMD design in WDDM1.0.
     
  20. Metal_Spirit

    Regular

    Joined:
    Jan 3, 2007
    Messages:
    632
    Likes Received:
    397
    And is there any other way to think??
    Since DX 12 will not improve hardware calculation performance (as no hardware can improve the hardware capability) , what we will have is a lot of bottlenecks removed, specially on the CPU part.
    A 500% increase would mean a scenario where only 20% of the GPU full power was in use. And that is not a normal scenario, but a very specific one.

    Wardell is talking about theoretical maximum gains, but lets not forget Star Swarm is a stress test. It is made to create the maximum bottleneck possible on the CPU so that we can see tha gains of a low level API.
    It is in no case a usual scenario in current games. Just check Mantle gains in BF4 and other games and you will see that average gains are much, much smaller. And must I remember that Mantle beated DX 12 on the Anandtech tests?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...