DirectX 12: The future of it within the console gaming space (specifically the XB1)

Andrew Lauritzen · Feb 12, 2015

Ethatron said:
Almost all the studios which survived till now still have their own engines! Basically all console newcomers (RAD fe.) still build their own new engines, which have some chance of being ported to PC (see Resident Evil).

I'm talking about PC. On console it's somewhat less relevant as very few developers who are willing to target - say - PS3 are scared of a "low level" API like libgcm. And rightfully so... these APIs aren't really difficult per se, especially when they are only for one piece of pre-determined hardware.

Also note that when I say "engines", I'm talking about technology created by graphics "experts" that are used across multiple titles. Ex. Frostbite is obviously an engine even though it is not sold as middleware.

Ethatron said:
And a great deal of them are stuck at DX9/OGL2 and will never move forward.

That further supports my point - if they haven't moved forward to even DX10/11 then it has nothing to do with the "ease of use" of the API and continuing to cater to programmers who want "safer, easier" APIs is wasted effort.

Ethatron said:
It's always possible to move forward, it was always possible to make "cool" APIs. I conjecture the problem ist _not_ the game developers, the problem is that the API providers don't know where something experimental leads to, and because it's business they feel reluctant to throw resources behind a crazy idea, they don't want market segmentation and whatnot business-based concepts.

I don't think the resourcing has ever been a huge concern to be honest... as you yourself point out, to these big companies it's peanuts. I wouldn't be surprised if some misguided notion of "protecting" the advantages of the Xbox platform vs. PC have played a role in the past, but I don't think anyone has really said "we're not doing this because it would take some time".

Ethatron said:
Ultimately, the reason for DX12 coming 12 years too late, is business, nothing else IMHO.

Obviously DX12 as it is defined would not have worked on hardware 12 years ago, so you can hand wave about a "DX12-like API" but I think it's far from clear that you could do a similarly portable and efficient API more than a few years ago.

psorcerer said:
Why?

They couldn't support WDDM2.0 for one, which is an important part of the new API semantics. It hasn't been that long that GPUs have had properly secured per-process VA spaces - certainly 12 years ago there were a lot of GPUs still using physical addressing and KMD patching, hence the WDDM1.x design in the first place.

psorcerer said:
Example: UE is not sold for 7 digit numbers, but is "free-to-play".

Ha, don't be fooled by their re-positioning - do the math and it's about the same cost as it was before for a AAA studio. The only difference is it's somewhat more accessible to indy devs now too, a la. Unity.

psorcerer said:
I doubt that, I preached it to developers since 2006, and it was met with "Where's the eye roll emoticon" each time I've tried. So, I'm in kind of deja vu here.

Not sure which developers you were "preaching" to, but as I said for as long as I've been in graphics it has been clear to at least the AAA folks.

psorcerer said:
Yes, that could happen, that's why we remove the frontend. And program GPU directly.

Meh, you can already do that with "compute", and the notion isn't even well-defined with the fixed function hardware. There are sort/serialization points in the very definition of the graphics pipeline - you can't just hand wave those away and "we'll just talk *directly* to the hardware this time guys!". Talking directly to the hardware *is* talking to the units that spawn threads, create rasterizer work, etc.

psorcerer said:
CP, which is a totally standard CPU, but we cannot program directly even that, because it's managed by the byte-code generated by driver.

By all means expose the CP directly and make it better at what it does! But that's arguing my point: you're effectively just putting a (currently kinda crappy) CPU on the front of the GPU. That's totally fine, but the notion that the "GPU is a better CPU" is kind of silly if your idea of a GPU includes a CPU. And you're going to start to ask why you need a separate CPU-like core just to drive graphics on an SoC that already has very capable ones nearby...

function · Feb 12, 2015

DX12 may very well be the API (or one of them) that drives the next Xbox.

How would you like to see CPU/GPU integration progress to support a DX12 like API on a next gen console?

Andrew Lauritzen · Feb 12, 2015

function said:
How would you like to see CPU/GPU integration progress to support a DX12 like API on a next gen console?

That's kind of a big topic for another thread I think, but at a minimum you need a tighter integration of caches between CPU/GPU (at least something like Haswell's LLC) with some basic controls for coherency (even if explicit), shared virtual memory (consoles can get away with shared physical I guess) and efficient, low-latency atomics and signaling mechanisms between the two. The multiple compute queue's stuff in current gen is a good start but ultimately I'd like to see that a bit finer grained - i.e. warp-level work stealing on the execution units themselves rather than everything having to go through the frontend.

mosen · Feb 12, 2015

I'm eager to know more about WDDM 2.0 and it's specifications. Is it possible for all GPUs (with FL_11) to fully support WDDM 2.0 or it needs new hardware?

psorcerer · Feb 12, 2015

Andrew Lauritzen said:
properly secured per-process VA spaces

Err, and why do you need that to reduce draw call pressure, exactly?

Andrew Lauritzen said:
do the math and it's about the same cost as it was before for a AAA studio

Yeah, but the problem is: AAA studios are not buying anymore, hence the "accessible for indy" repositioning.

Andrew Lauritzen said:
Talking directly to the hardware *is* talking to the units that spawn threads, create rasterizer work, etc.

Yep, and I don't see any problem with that, just need a compiler, obviously.

Andrew Lauritzen said:
if your idea of a GPU includes a CPU

No, it doesn't, I'm describing the current state of affairs, there each GPU has a "crappy CPU" attached.
I would argue that making it possible to drive FFP parts from GPU code eliminates the need for any CPU altogether.

forumaccount · Feb 12, 2015

psorcerer said:
No, it doesn't, I'm describing the current state of affairs, there each GPU has a "crappy CPU" attached.
I would argue that making it possible to drive FFP parts from GPU code eliminates the need for any CPU altogether.

And here I am thinking it's already too easy to hang GPUs.

Andrew Lauritzen · Feb 12, 2015

psorcerer said:
Err, and why do you need that to reduce draw call pressure, exactly?

So that you can bake GPU virtual addresses into user-visible structures (descriptors, etc) and avoid any submission-time patching based on residency (WDDM1.x style). i.e. to get rid of the "KM Driver" block in the diagram from here:
http://blogs.msdn.com/b/directx/archive/2014/03/20/directx-12.aspx

sebbbi · Feb 13, 2015

Andrew Lauritzen said:
So that you can bake GPU virtual addresses into user-visible structures (descriptors, etc) and avoid any submission-time patching based on residency (WDDM1.x style). i.e. to get rid of the "KM Driver" block in the diagram from here:
http://blogs.msdn.com/b/directx/archive/2014/03/20/directx-12.aspx

This is also one of the reasons why deferred contexts in DX11 are practically useless. You can't pre-patch the deferred command lists on the other cores, because you don't know when the deferred lists are going to be submitted (residency is not yet known). You need to patch when the list is submitted, meaning that the main thread (controlling the immediate context) needs to do it. It needs to go through all the lists by all the cores as it is the only thread actually submitting anything to the GPU.

This is how you should do it in DX11:

GCN Performance Tip 31: A dedicated thread solely responsible for making D3D calls is usually the best way to drive the API.Notes: The best way to drive a high number of draw calls in DirectX11 is to dedicate a thread to graphics API calls. This thread’s sole responsibility should be to make DirectX calls; any other types of work should be moved onto other threads (including processing memory buffer contents). This graphics “producer thread” approach allows the feeding of the driver’s “consumer thread” as fast as possible, enabling a high number of API calls to be processed.

DirectX11 Deferred Contexts will not achieve faster results than this approach when it is implemented correctly.

oldschoolnerd · Feb 13, 2015

sebbbi said:
This is also one of the reasons why deferred contexts in DX11 are practically useless. You can't pre-patch the deferred command lists on the other cores, because you don't know when the deferred lists are going to be submitted (residency is not yet known). You need to patch when the list is submitted, meaning that the main thread (controlling the immediate context) needs to do it. It needs to go through all the lists by all the cores as it is the only thread actually submitting anything to the GPU.

This is how you should do it in DX11:

GCN Performance Tip 31: A dedicated thread solely responsible for making D3D calls is usually the best way to drive the API.Notes: The best way to drive a high number of draw calls in DirectX11 is to dedicate a thread to graphics API calls. This thread’s sole responsibility should be to make DirectX calls; any other types of work should be moved onto other threads (including processing memory buffer contents). This graphics “producer thread” approach allows the feeding of the driver’s “consumer thread” as fast as possible, enabling a high number of API calls to be processed.

DirectX11 Deferred Contexts will not achieve faster results than this approach when it is implemented correctly.

So fundamentally this is what DirectX12 improves upon, each core can be a producer thread, concurrently. Have you had a go on it yet? Would love to know your views on Mr Wardell and his 500% performance improvement claims. To my mind this would only be possible if the GPU was currently only 20% utilised ...

psorcerer · Feb 13, 2015

Andrew Lauritzen said:
avoid any submission-time patching

Cool, but seems like that can be accomplished in other ways (i.e. you just need one level of indirection, it doesn't require full-fledged VA, although solving it through VA looks "right").
Anyway, the change is not that big, and there were quite a lot of changes for DX10 or DX11 in GPUs anyway. I still fail to see why it couldn't be done earlier. Or at least why it's more complex than say tessellation.

iroboto · Feb 13, 2015

oldschoolnerd said:
So fundamentally this is what DirectX12 improves upon, each core can be a producer thread, concurrently. Have you had a go on it yet? Would love to know your views on Mr Wardell and his 500% performance improvement claims. To my mind this would only be possible if the GPU was currently only 20% utilised ...

I think you missed out on the discussion of large batched jobs and small batches jobs from earlier.

sebbbi · Feb 13, 2015

psorcerer said:
I still fail to see why it couldn't be done earlier. Or at least why it's more complex than say tessellation.

I would assume that the old PC GPUs used 32 bit physical memory pointers. No paging and no virtual memory, just raw physical memory pointers to the GPU memory. Nowadays you have 64 bit pointers and virtual memory with CPU compatible page sizes and cache line sizes. I believe this is a much bigger change than tessellation or geometry shaders for example.

Good analogy would be compute shaders. Compute shaders would have been problematic with DX9 hardware. DX10 brought us unified shaders and geometry shaders needed some on-chip scratch pad to store the results. Later the on-chip scratch pad buffer was exposed to the developers (as thread local shared memory) and it was straightforward to extend the existing unified shading cores with synchronization primitives to allow threads to cooperate with each other (= compute shaders). DX10 also introduced integer arithmetic (very useful for address calculation for example).

iroboto · Feb 13, 2015

I'm going to try my best to apply some of what I learned from this thread, let me know how I did.

As soon as I read what your wrote Sebbbi, it completely and entirely brought me back to the Eurogamer Titanfall dev interview:

A lot of the work was making the engine multi-threaded. The original Source engine had a main thread that built up a command buffer for the render thread and there was a frame of delay. We kind of changed that so the render thread buffered commands as soon as they were issued - in the best case, it gets rid of a frame of latency there. And then almost every system has been made multi-threaded at this point. The networking is multi-threaded, the particle simulation is multi-threaded. The animation is multi-threaded - I could go on forever. The culling, the gathering of objects we're going to render is multi-threaded. It's across as many cores as you have."

.

To note Baker writes: At the BEST case this remove frame latency : at the worst case it probably was the cause of the frame drops when there were too many draw calls

Interestingly enough they also used a lot of compute shaders as well, if you were really GPU bottlenecked from an ALU perspective I'd imagine you'd try to do some more of those tasks back in your CPU. The 792 resolution is not reflective of the lack of available GPU resources in this case.

"Currently we're running it so that we leave one discrete core free so that DX and the driver can let the CPU do its stuff. Originally we were using all the cores that were available - up to eight cores on a typical PC. We were fighting with the driver wanting to use the CPU whenever it wanted so we had to leave a core for the driver to do its stuff," Baker adds.

Sounds like to me, after having understood everything that I can possibly understand from reading all your thoughts dx12 is massively beneficial to especially weaker CPUs. The concept of dx12 only benefitting games that are CPU bound is too generic a statement.

In this case Respawn traded off batching all its draw calls to reduce frame latency of that batch. They wanted the render without latency so they instead submitted as many draw calls as possible as opposed to sending large batches and as you wrote Sebbbi GCN architecture demands an entire thread/core dedicated to this job. Because of this they multi-threaded culling and gathering of objects to be rendered, a situation I don't think would occur each thread could draw, cull and gather in its own. They basically have this one thread/core that is waiting to do what the main thread would normally do.

Respawn writes:

Going forward, the quest to parallelise the game's systems continues - particle rendering is set to get the multi-threaded treatment, while the physics code will be looked at again in order to get better synchronisation across multiple cores.

So synchronization between cores is another issue that needs to be contended with especially if you are sending information back to thread 0 to render but before that thread 1 needs to cull and gather items. All these moving parts in the CPU side because it's attempting to do it's absolute best at removing frame latency. If they didn't have to they would have extra time to do things properly to a degree I imagine.

I can imagine in TFs scenario there was not enough work coming in to saturate the GPU as the CPU, but the time available to render at 60fps is still 16ms, so when certain larger jobs come into play the GPU had very limited time left to do it and with less burst muscle available the to only way to make that frame interval was ultimately to reduce resolution.

I still think they suffered with esram, but it sounds like if they had dx12 before launch and not de facto dx11 titanfall would be performing much better.

iroboto · Feb 13, 2015

AMD APUs tested under DX12 From Anandtech.

http://www.anandtech.com/show/8968/star-swarm-directx-12-amd-apu-performance

Jwm · Feb 13, 2015

^ Interesting results

Davros · Feb 13, 2015

oldschoolnerd said:
Mr Wardell and his 500% performance improvement claims.

I think that's going to turn out as some function runs 500% quicker, not we'll see a 500% fps increase

forumaccount · Feb 13, 2015

Davros said:
I think that's going to turn out as some function runs 500% quicker, not we'll see a 500% fps increase

The most charitable interpretation is that he meant command buffer creation time could be 5x faster because it's running on 6 threads. As you state it has nothing to do with framerate at all.

psorcerer · Feb 14, 2015

sebbbi said:
I would assume that the old PC GPUs used 32 bit physical memory pointers. No paging and no virtual memory, just raw physical memory pointers to the GPU memory.

It's trivial to do real-time adress patching in this one (even more trivial if you have 64 ->32 bit). Do you remember that bytecode is completely controlled by the driver? You just never load any address without adding an offset to it, and voila - you have "VA-like" thing in software. Google did it for x86.

Andrew Lauritzen · Feb 14, 2015

psorcerer said:
It's trivial to do real-time adress patching in this one (even more trivial if you have 64 ->32 bit). Do you remember that bytecode is completely controlled by the driver? You just never load any address without adding an offset to it, and voila - you have "VA-like" thing in software. Google did it for x86.

Physical address patching/resolution must be done in kernel model for security reasons. Hence the whole allocation/patch lists and KMD design in WDDM1.0.

Metal_Spirit · Feb 14, 2015

Davros said:
I think that's going to turn out as some function runs 500% quicker, not we'll see a 500% fps increase

And is there any other way to think??
Since DX 12 will not improve hardware calculation performance (as no hardware can improve the hardware capability) , what we will have is a lot of bottlenecks removed, specially on the CPU part.
A 500% increase would mean a scenario where only 20% of the GPU full power was in use. And that is not a normal scenario, but a very specific one.

Wardell is talking about theoretical maximum gains, but lets not forget Star Swarm is a stress test. It is made to create the maximum bottleneck possible on the CPU so that we can see tha gains of a low level API.
It is in no case a usual scenario in current games. Just check Mantle gains in BF4 and other games and you will see that average gains are much, much smaller. And must I remember that Mantle beated DX 12 on the Anandtech tests?

DirectX 12: The future of it within the console gaming space (specifically the XB1)

Andrew Lauritzen

Moderator

function

None functional

Andrew Lauritzen

Moderator

mosen

psorcerer

forumaccount

Andrew Lauritzen

Moderator

sebbbi

oldschoolnerd

psorcerer

iroboto

Daft Funk

sebbbi

iroboto

Daft Funk

iroboto

Daft Funk

Jwm

Davros

forumaccount

psorcerer

Andrew Lauritzen

Moderator

Metal_Spirit

Similar threads