DirectX 12: The future of it within the console gaming space (specifically the XB1)

Edit scratch that.

The improvements to the Xbox One GCP is so that maybe it doesn't need to perform the batching job that mantle does when dealing with large amounts of small patches. Mantle, at the cost of CPU it'll grab more batches together before submitting it.

If the CPU is weak it's actually going to be penalized by this grouping phase. The customizations done in X1 GCP seem like they could be addressing this directly, so once get it's going to be keeping as much as possible off that CPU.

wrt to Xbox One: That is a lot more focus on the CPU side of things than I ever could have imagined. It's actually shocking, perhaps the toll on the CPU by running the separate VMs is really much larger than we expected and heavily impacts game performance more than we expected. Or perhaps compute shader performance is better by working with a lot of small jobs and tons of dispatches vs 1 dispatch where sync points are involved etc.

I'm left to ask the obvious question however, do large batch jobs actually fully utilize the GPU? Do they monopolize the GPUs resources and not use them? Is it better to submit a lot of small jobs eat the cost of overhead but be able to cram a lot of smaller jobs into the GPU to fully saturate?
 
Last edited:
Or perhaps compute shader performance is better by working with a lot of small jobs and tons of dispatches vs 1 dispatch where sync points are involved etc.

I'm left to ask the obvious question however, do large batch jobs actually fully utilize the GPU? Do they monopolize the GPUs resources and not use them? Is it better to submit a lot of small jobs eat the cost of overhead but be able to cram a lot of smaller jobs into the GPU to fully saturate?
In compute code I'm working on there's about 10% performance loss from smaller jobs. That's launching about 15 kernels per second, versus about 1500 per second (by launching sub-tasks defined by regions of the "compute grid"). If I try to use substantially more kernels per second e.g. 10,000, performance falls off a cliff (85% performance loss). The kernel itself is pretty unfriendly to the GPU (not able to hide its own latency, though it spends the vast majority of its time in stretches of work without incurring any latency due to branching or memory) which prolly exacerbates the problem.

At some point I'll have a go at setting up multiple-context enqueuing, because AMD doesn't support OpenCL's out-of-order queue processing.
 
If the CPU is weak it's actually going to be penalized by this grouping phase. The customizations done in X1 GCP seem like they could be addressing this directly, so once get it's going to be keeping as much as possible off that CPU.

I wonder. Can this grouping phase ever be done by the GPGPU?
And are the Xbox One GPC customizations hardware based, or just firmware?
 
I wonder. Can this grouping phase ever be done by the GPGPU?
And are the Xbox One GPC customizations hardware based, or just firmware?
Hardware from what I understand. We have a section about it in the SDK leaked thread. I'm not sure if grouping is required. The frame gain is minimal if that.

edit: well 3-6 fps seems minimal, but it's actually close to 10-20% lol so maybe significant.
 
Last edited:
In compute code I'm working on there's about 10% performance loss from smaller jobs. That's launching about 15 kernels per second, versus about 1500 per second (by launching sub-tasks defined by regions of the "compute grid"). If I try to use substantially more kernels per second e.g. 10,000, performance falls off a cliff (85% performance loss). The kernel itself is pretty unfriendly to the GPU (not able to hide its own latency, though it spends the vast majority of its time in stretches of work without incurring any latency due to branching or memory) which prolly exacerbates the problem.

At some point I'll have a go at setting up multiple-context enqueuing, because AMD doesn't support OpenCL's out-of-order queue processing.
Thanks Jawed! This is actually really good info. I guess the push for a faster GCP is just for CPU purposes. Unless scheduling is heavily improves its impossible to know what performance benefits it brings without benchmarks. But my thoughts about async compute tasks are dashed out. I was worried about GPU/CPU stalling as being a bad thing but it can't be as bad as a 85% drop off.

Though, are openCL and using a dx12 compute shader be the same in terms of overhead? I guess such things will be tested in the future.
 
There's a single memory copy at either end of the task, so memory copy overheads are not contributing to the performance loss. So I'm purely seeing some kind of kernel launch overhead, so I'd expect it to be the same regardless of compute API. I haven't analysed this overhead across a variety of kernels, so my experience is just a taste of the pitfalls.

My experiment was to get a feel for the effect on system responsiveness, since this kernel can run for as long 1.3s when I really give it work to do (on 1GHz 7970), which makes Windows desktop really juddery. So my intention is to split it up into 1/100ths, roughly, to retain responsiveness.

I'm still optimising the kernel. It was taking 3s and I have a couple more tricks up my sleeve. Hopefully that'll compensate for the loss in performance due to the use of sub-kernels.
 
I'm still optimising the kernel. It was taking 3s and I have a couple more tricks up my sleeve. Hopefully that'll compensate for the loss in performance due to the use of sub-kernels.
Intel GPUs seems to be getting quite big gains from OpenCL 2.0 nested parallelism (GPU side enqueue). In this example they got nice gains from separating the launch code to a tiny separate kernel: https://software.intel.com/en-us/ar...ted-parallelism-and-work-group-scan-functions

Aren't AMD OpenCL 2.0 drivers still in beta? Performance issues might be driver related. It's kind of sad that AMD has had OpenCL 2.0 hardware available for long time, but Intel beat them in the driver race (Broadwell supported OpenCL 2.0 at launch and the drivers seem solid).
 
I'm writing something that I hope will run on NVidia (since the previous version does) and so sticking to OpenCL 1.1. Some time later I might try OpenCL 2.

Unfortunately I'm working with huge amounts of intermediate data (about 10GB per kernel invocation in the extreme case) that needs to be sorted. My sort is about 10x faster than off the shelf sorts, but that's because I don't need ordering, merely to know which are the best 128 items from a 624 long list (there's millions of these lists to sort per kernel invocation). I get that performance using registers across cooperating work items to hold the list rather than a combination of global and local memory.

A fundamental problem with OpenCL 2.0 GPU enqueue is that newly generated parent data has to go off chip to be used by the child kernels. Sure caching might work, but since I'm sticking with OpenCL 1.1 for the time being, it's going to be a while before I get to explore whether it's possible to make caching work by constructing a suitably fine-grained cluster of parents and their children. I'm doubtful.

Ultimately I'm looking at finding the best 128 items from ~4000-item long lists for millions of lists (~80 GB of raw data), potentially using multiple kernels in succession per list, which would necessitate off-chip storage twixt kernels. Which would also require that I break up the work into sub-domain kernels, since the minimum working set is a smidgen more than 1GB, and I'd prefer not to exclude 1GB cards.

The Intel article's use of GPU enqueue seems to be partly about not being dependent upon the CPU to set up kernel launch parameters, based on results produced. I imagine you were alluding to that aspect for my purposes, but not necessarily with varying parent-data-dependent kernel parameters. That use case, where a parent kernel simply issues sub-domain kernels, does sound like an interesting way to avoid launch overheads that arise due to CPU/GPU interaction. Though I can imagine it still being subject to the command processor bottleneck, if indeed that is what's happening.

Catalyst Omega has OpenCL 2.0 support. I'm unclear if that is beta, per se. The APP SDK, version 3.0 with OpenCL 2.0 support is itself beta.

http://developer.amd.com/community/blog/2014/12/09/amd-app-sdk-3-0-beta/
 
http://www.dsogaming.com/news/dx12-...00-performance-increase-achieved-on-amd-gpus/

Interesting that they're saying the XBO would see a similar performance increase to the AMD PC GPU's in Starswarm. This suggests the XBO API isn't much more efficient than DX11 at present.

Mainly a CPU bound scenario though, so the DX11/low level api is only feeding the GPU through one CPU core (or so they say). So this benchmark would be nearly the same, but on the GPU side we do not yet fully know what X1 will contain (outside of those under NDA).
 
The only thing I was excited about from Brad Wardell's interview was the introduction of a new Star Control game. How is the world not going crazy over that? Like what gives seriously.
 
  • Like
Reactions: Jwm
No, not impressive at all: "these results are with D3D 11 deferred contexts disabled".
This just seems like blatant trolling. I'm not sure what your definition of impressive is: but DX12 has enough power with 2/4 CPU cores to max out the graphics command processor of GCN GPUs.

Now I know you know more than me, but I think it's fair to say most developers on this board would agree that DX11 was never capable of such a feat and the reason we didn't see a lot of deferred contexts used in DX11 games was because it never did a great job with it anyway. There are enough bar graphs to go around to show how full the immediate context thread is compared to the rest of the deferred ones. I'll let the numbers speak for themselves.
 
I would be curious to see the test run on a console APU.
The Anandtech tests were run on Intel desktop cores, so while they were able to launch sufficient batches, they operate in an entirely different league in terms of performance.

The 8-core Jaguar APU is equivalent throughput-wise to a Steamroller 4 core (closely enough that Orbis was rumored to have once been that), and in many games we see AMD needing a whole 2-core module to match an Intel core.
In the console space, it would be 6 to almost 7 cores due to system reserve, and some possible performance loss due to straddling the non-unified L2s.

The console GPUs in this case may not be that far behind in command processor capability, as 800-850 MHz is not a large drop from the desktop GPU clocks regardless of whether the secondary command processors ever become available to games.
 
DX11 was never capable of such a feat

It doesn't matter. The problem is that the results in the benchmark that doesn't use deferred contexts (DC) for the game that was proven to gain a lot from DC was not run with DC. I.e. that benchmark differences are inflated.
My problem with DX12 and all the praise it gets is that it's not different from Mantle (in my opinion, it's 99.999% Mantle) and the only reason for its existence is Nvidia and Microsoft not willing to use Mantle.
 
Back
Top