DirectX 12: The future of it within the console gaming space (specifically the XB1)

Discussion in 'Console Technology' started by Shortbread, Mar 7, 2014.

  1. oldschoolnerd

    Newcomer

    Joined:
    Sep 13, 2013
    Messages:
    65
    Likes Received:
    8
    Interesting. According to Brad, the xbox one will benefit from a 300% to 500% potential performance jump with DX12. However, only engines written from the ground up for DX12 will show this. So we have a wait on our hands...
     
  2. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,834
    Likes Received:
    18,634
    Location:
    The North
    Edit scratch that.

    The improvements to the Xbox One GCP is so that maybe it doesn't need to perform the batching job that mantle does when dealing with large amounts of small patches. Mantle, at the cost of CPU it'll grab more batches together before submitting it.

    If the CPU is weak it's actually going to be penalized by this grouping phase. The customizations done in X1 GCP seem like they could be addressing this directly, so once get it's going to be keeping as much as possible off that CPU.

    wrt to Xbox One: That is a lot more focus on the CPU side of things than I ever could have imagined. It's actually shocking, perhaps the toll on the CPU by running the separate VMs is really much larger than we expected and heavily impacts game performance more than we expected. Or perhaps compute shader performance is better by working with a lot of small jobs and tons of dispatches vs 1 dispatch where sync points are involved etc.

    I'm left to ask the obvious question however, do large batch jobs actually fully utilize the GPU? Do they monopolize the GPUs resources and not use them? Is it better to submit a lot of small jobs eat the cost of overhead but be able to cram a lot of smaller jobs into the GPU to fully saturate?
     
    #802 iroboto, Feb 8, 2015
    Last edited: Feb 8, 2015
  3. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    Option c would be horrible.
     
  4. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,834
    Likes Received:
    18,634
    Location:
    The North
    lol yea I was reading around a bit more, I scratched it after the fact and managed to figure it out lol
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    In compute code I'm working on there's about 10% performance loss from smaller jobs. That's launching about 15 kernels per second, versus about 1500 per second (by launching sub-tasks defined by regions of the "compute grid"). If I try to use substantially more kernels per second e.g. 10,000, performance falls off a cliff (85% performance loss). The kernel itself is pretty unfriendly to the GPU (not able to hide its own latency, though it spends the vast majority of its time in stretches of work without incurring any latency due to branching or memory) which prolly exacerbates the problem.

    At some point I'll have a go at setting up multiple-context enqueuing, because AMD doesn't support OpenCL's out-of-order queue processing.
     
    iroboto likes this.
  6. Metal_Spirit

    Regular

    Joined:
    Jan 3, 2007
    Messages:
    632
    Likes Received:
    397
    I wonder. Can this grouping phase ever be done by the GPGPU?
    And are the Xbox One GPC customizations hardware based, or just firmware?
     
  7. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,834
    Likes Received:
    18,634
    Location:
    The North
    Hardware from what I understand. We have a section about it in the SDK leaked thread. I'm not sure if grouping is required. The frame gain is minimal if that.

    edit: well 3-6 fps seems minimal, but it's actually close to 10-20% lol so maybe significant.
     
    #807 iroboto, Feb 8, 2015
    Last edited: Feb 8, 2015
  8. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,834
    Likes Received:
    18,634
    Location:
    The North
    Thanks Jawed! This is actually really good info. I guess the push for a faster GCP is just for CPU purposes. Unless scheduling is heavily improves its impossible to know what performance benefits it brings without benchmarks. But my thoughts about async compute tasks are dashed out. I was worried about GPU/CPU stalling as being a bad thing but it can't be as bad as a 85% drop off.

    Though, are openCL and using a dx12 compute shader be the same in terms of overhead? I guess such things will be tested in the future.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    There's a single memory copy at either end of the task, so memory copy overheads are not contributing to the performance loss. So I'm purely seeing some kind of kernel launch overhead, so I'd expect it to be the same regardless of compute API. I haven't analysed this overhead across a variety of kernels, so my experience is just a taste of the pitfalls.

    My experiment was to get a feel for the effect on system responsiveness, since this kernel can run for as long 1.3s when I really give it work to do (on 1GHz 7970), which makes Windows desktop really juddery. So my intention is to split it up into 1/100ths, roughly, to retain responsiveness.

    I'm still optimising the kernel. It was taking 3s and I have a couple more tricks up my sleeve. Hopefully that'll compensate for the loss in performance due to the use of sub-kernels.
     
  10. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    Intel GPUs seems to be getting quite big gains from OpenCL 2.0 nested parallelism (GPU side enqueue). In this example they got nice gains from separating the launch code to a tiny separate kernel: https://software.intel.com/en-us/ar...ted-parallelism-and-work-group-scan-functions

    Aren't AMD OpenCL 2.0 drivers still in beta? Performance issues might be driver related. It's kind of sad that AMD has had OpenCL 2.0 hardware available for long time, but Intel beat them in the driver race (Broadwell supported OpenCL 2.0 at launch and the drivers seem solid).
     
    Jwm, chris1515, mosen and 1 other person like this.
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    I'm writing something that I hope will run on NVidia (since the previous version does) and so sticking to OpenCL 1.1. Some time later I might try OpenCL 2.

    Unfortunately I'm working with huge amounts of intermediate data (about 10GB per kernel invocation in the extreme case) that needs to be sorted. My sort is about 10x faster than off the shelf sorts, but that's because I don't need ordering, merely to know which are the best 128 items from a 624 long list (there's millions of these lists to sort per kernel invocation). I get that performance using registers across cooperating work items to hold the list rather than a combination of global and local memory.

    A fundamental problem with OpenCL 2.0 GPU enqueue is that newly generated parent data has to go off chip to be used by the child kernels. Sure caching might work, but since I'm sticking with OpenCL 1.1 for the time being, it's going to be a while before I get to explore whether it's possible to make caching work by constructing a suitably fine-grained cluster of parents and their children. I'm doubtful.

    Ultimately I'm looking at finding the best 128 items from ~4000-item long lists for millions of lists (~80 GB of raw data), potentially using multiple kernels in succession per list, which would necessitate off-chip storage twixt kernels. Which would also require that I break up the work into sub-domain kernels, since the minimum working set is a smidgen more than 1GB, and I'd prefer not to exclude 1GB cards.

    The Intel article's use of GPU enqueue seems to be partly about not being dependent upon the CPU to set up kernel launch parameters, based on results produced. I imagine you were alluding to that aspect for my purposes, but not necessarily with varying parent-data-dependent kernel parameters. That use case, where a parent kernel simply issues sub-domain kernels, does sound like an interesting way to avoid launch overheads that arise due to CPU/GPU interaction. Though I can imagine it still being subject to the command processor bottleneck, if indeed that is what's happening.

    Catalyst Omega has OpenCL 2.0 support. I'm unclear if that is beta, per se. The APP SDK, version 3.0 with OpenCL 2.0 support is itself beta.

    http://developer.amd.com/community/blog/2014/12/09/amd-app-sdk-3-0-beta/
     
  12. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,237
    Likes Received:
    4,260
    Location:
    Guess...
    Starx likes this.
  13. Jwm

    Jwm
    Veteran

    Joined:
    Feb 27, 2013
    Messages:
    1,037
    Likes Received:
    155
    Location:
    Texas
    Mainly a CPU bound scenario though, so the DX11/low level api is only feeding the GPU through one CPU core (or so they say). So this benchmark would be nearly the same, but on the GPU side we do not yet fully know what X1 will contain (outside of those under NDA).
     
  14. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    732
    Likes Received:
    134
    No, not impressive at all: "these results are with D3D 11 deferred contexts disabled".
     
  15. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,834
    Likes Received:
    18,634
    Location:
    The North
    The only thing I was excited about from Brad Wardell's interview was the introduction of a new Star Control game. How is the world not going crazy over that? Like what gives seriously.
     
    Jwm likes this.
  16. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,237
    Likes Received:
    4,260
    Location:
    Guess...
    It's still a massive performance boost as long as you're comparing the same scenarios:

    https://forum.beyond3d.com/posts/1823756/
     
    shredenvain and iroboto like this.
  17. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,834
    Likes Received:
    18,634
    Location:
    The North
    This just seems like blatant trolling. I'm not sure what your definition of impressive is: but DX12 has enough power with 2/4 CPU cores to max out the graphics command processor of GCN GPUs.

    Now I know you know more than me, but I think it's fair to say most developers on this board would agree that DX11 was never capable of such a feat and the reason we didn't see a lot of deferred contexts used in DX11 games was because it never did a great job with it anyway. There are enough bar graphs to go around to show how full the immediate context thread is compared to the rest of the deferred ones. I'll let the numbers speak for themselves.
     
    mosen, 3dcgi and liquidboy like this.
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    I would be curious to see the test run on a console APU.
    The Anandtech tests were run on Intel desktop cores, so while they were able to launch sufficient batches, they operate in an entirely different league in terms of performance.

    The 8-core Jaguar APU is equivalent throughput-wise to a Steamroller 4 core (closely enough that Orbis was rumored to have once been that), and in many games we see AMD needing a whole 2-core module to match an Intel core.
    In the console space, it would be 6 to almost 7 cores due to system reserve, and some possible performance loss due to straddling the non-unified L2s.

    The console GPUs in this case may not be that far behind in command processor capability, as 800-850 MHz is not a large drop from the desktop GPU clocks regardless of whether the secondary command processors ever become available to games.
     
    mosen and pjbliverpool like this.
  19. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    732
    Likes Received:
    134
    It doesn't matter. The problem is that the results in the benchmark that doesn't use deferred contexts (DC) for the game that was proven to gain a lot from DC was not run with DC. I.e. that benchmark differences are inflated.
    My problem with DX12 and all the praise it gets is that it's not different from Mantle (in my opinion, it's 99.999% Mantle) and the only reason for its existence is Nvidia and Microsoft not willing to use Mantle.
     
  20. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    ... and Intel, Imagination (PowerVR) and Qualcomm. All of these companies are involved in DirectX 12. I think it is extremely valuable that Microsoft is pushing this highly efficient API also to mobile devices. Mobile devices gain the most from an efficient low level API.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...