DirectX 12: The future of it within the console gaming space (specifically the XB1)

If the memory system manages this automagically, it's the first I've read of it.
Maybe I'm wrong, but the impression I got from the earlier comment was a handled structure, although that wouldn't really be the memory system as I called it but, as Iroboto points out, the API. :oops:

On PS4, most GPU commands are just a few DWORDs written into the command buffer, let's say just a few CPU clock cycles. On Xbox One it easily could be one million times slower because of all the bookkeeping the API does.

Or you need more draw calls if you want individuality to things like exploding walls and debris all acting on its own each piece affected by its own physics everywhere.
Which would require more processing power and potentially resources.
Or if you want tons of different objects each with their own material and shaders.
Which would require more textures and resources.
Or if you want to just ramp up performance faster, like being able to fill a command queue quickly if it gets empty.
Yep.

Basically, in all the years of developers talking about bottlenecks, of GPU power limits and triangle set up limits and CPU limits and memory limits, and them eking out every ounce of usability from the finite RAM and bandwidth, drawcalls has never been an issue. The lack of variety in games has never been because there's not enough drawcalls to go around, but because the RAM is full already or the cost of creating assets is prohibitive. Eliminating the drawcall cap is going to be a plus for devs as it makes their lives easier, but it surely can't be a world-changing event. Especially coupled with async compute for maximal utilisation of GPU resources.
 
Maybe I'm wrong, but the impression I got from the earlier comment was a handled structure, although that wouldn't really be the memory system as I called it but, as Iroboto points out, the API. :oops:

I would expect a software API (for DirectX/OpenGL libraries) to schedule and queue calls like this but I've not read anything to suggest that GPU hardware does. What GNM has is anybody's guess - we only have "most GPU commands are just a few DWORDs written into the command buffer". Most. I must admit I assumed "command buffer" was a reference to the GPU itself but it could be the API.

Maybe Sebbbi can offer some wisdom without incurring the wrath of Sony ninjas.
 
Maybe I'm wrong, but the impression I got from the earlier comment was a handled structure, although that wouldn't really be the memory system as I called it but, as Iroboto points out, the API. :oops:

On PS4, most GPU commands are just a few DWORDs written into the command buffer, let's say just a few CPU clock cycles. On Xbox One it easily could be one million times slower because of all the bookkeeping the API does.

Which would require more processing power and potentially resources.
Which would require more textures and resources.
Yep.

Basically, in all the years of developers talking about bottlenecks, of GPU power limits and triangle set up limits and CPU limits and memory limits, and them eking out every ounce of usability from the finite RAM and bandwidth, drawcalls has never been an issue. The lack of variety in games has never been because there's not enough drawcalls to go around, but because the RAM is full already or the cost of creating assets is prohibitive. Eliminating the drawcall cap is going to be a plus for devs as it makes their lives easier, but it surely can't be a world-changing event. Especially coupled with async compute for maximal utilisation of GPU resources.

Yea I mainly agree. Though for PC and Xbox is a little different, PS4 is excluded due to it's memory architecture. (From what I understand so far)
There is the async compute part of the multithreaded which everyone benefits from.
As well as DMA/resource copying between memory pools that can be multithreaded - which is limited to dual memory pool situations.

In a serial situation from the AMD slides from earlier you're loading a lot of stuff, processing a lot of stuff, loading up more stuff, processing more stuff.
In a parallel situation you can hide that latency by having another thread loading resources while another thread is processing it. It's a fairly good win for dual pool situations, and creates a more seamless situation.
Streaming should be improved in this case. The idea of having to load all your textures up front is now more flexible. You can now begin processing earlier as textures are constantly being streamed into the GPU, and maybe it's a little less of an engineering nightmare trying to cram everything into a limited space.

This should be a decent win for xbox with it's extremely limited esram, though I imagine the limit is 30GB/s, speed of their DMEs. So there might be some resident textures that stay on esram as well.
 
Basically, in all the years of developers talking about bottlenecks, of GPU power limits and triangle set up limits and CPU limits and memory limits, and them eking out every ounce of usability from the finite RAM and bandwidth, drawcalls has never been an issue. The lack of variety in games has never been because there's not enough drawcalls to go around, but because the RAM is full already or the cost of creating assets is prohibitive. Eliminating the drawcall cap is going to be a plus for devs as it makes their lives easier, but it surely can't be a world-changing event. Especially coupled with async compute for maximal utilisation of GPU resources.
Of course draw calls have been an issue. That's why new APIs always try to address the issue. You can find comments about DX10 and DX11 reducing the draw call overhead. DX12 is like a step function here. Things like instancing came about at least partially to reduce draw overhead.

You're correct it won't be a world changing event. Just reducing a bottleneck and allowing developers to have more control.
 
I would expect a software API (for DirectX/OpenGL libraries) to schedule and queue calls like this but I've not read anything to suggest that GPU hardware does. What GNM has is anybody's guess - we only have "most GPU commands are just a few DWORDs written into the command buffer". Most. I must admit I assumed "command buffer" was a reference to the GPU itself but it could be the API.

It's _a_ command buffer, not _one_. If you ever wrote self-modifying code, you know it's hardly possible to run the code _at the same time_ as it's created. Command buffers are _buffers_, they buffer the commands for later execution/processing. And if you have the knowledge (or limit yourself) to make them context-free blobs, then they just linger around until _dispatched_, without any further dependency/synchronization. The dispatch of course is a critical section, there is no parallel kicking the GPU into the soft parts. You can make as much command buffers - in parallel - as you wish, if you know how to hack them together. If not, like in DX11 (missing abstraction, too much possible hardware), then you're stuck at dictating - kind of like a doctor and it's secretary which has hearing loss - every bit you want to do in ridiculous high-level API style; that is called deferred context. The critical section + crazy code and the memory manager was called immediate context.
 
Interesting. I'd like to see more details on implementation. How are shaders assigned to queues, and at what granularity? Do you assign priorities to queues, or are the queues for each ACE of a fixed priority (eg 0 lowest, 7 highest)?
 
Interesting. I'd like to see more details on implementation. How are shaders assigned to queues, and at what granularity? Do you assign priorities to queues, or are the queues for each ACE of a fixed priority (eg 0 lowest, 7 highest)?
As from what I understand so far, you are responsible for the CPU side of multithreading, so you are responsible in this case for the assigning of queues. DX12 will be responsible for the management of it.
On the MSDN site there is a full breakdown from beginning to end with examples on how to setup the pipeline and submit work to the queue.

https://msdn.microsoft.com/en-us/library/dn899121(v=vs.85).aspx
I believe that "work submission" is what you're looking for.
 
This is a cross post from @Kaotik 's post in the other thread - except this accompanying write-up is in English and not in Chinese ;)

Asynchronous Shaders in DX12
http://www.tomshardware.com/news/amd-dx12-asynchronous-shaders-gcn,28844.html
Also up at AnandTech, but @Ryan Smith seems to have made some mistakes, namely the amount of queues on GCN (which should be 64 for GCN 1.1/1.2 GPU's with 8 ACEs, not 8 - also it's not clear if the compute queues actually eat any graphics queues on AMD like AnandTech's article claims)
http://www.anandtech.com/show/9124/amd-dives-deep-on-asynchronous-shading
 
Also up at AnandTech, but @Ryan Smith seems to have made some mistakes, namely the amount of queues on GCN (which should be 64 for GCN 1.1/1.2 GPU's with 8 ACEs, not 8 - also it's not clear if the compute queues actually eat any graphics queues on AMD like AnandTech's article claims)
http://www.anandtech.com/show/9124/amd-dives-deep-on-asynchronous-shading
Today must have been embargo day. Interesting.

You are referring to the table with graphics + compute queues right?
 
Last edited:
Anandtech:
Why Asynchronous Shading Wasn’t Accessible Before
AMD has offered multiple Asynchronous Compute Engines (ACEs) since the very first GCN part in 2011, the Tahiti-powered Radeon HD 7970. However prior to now the technical focus on the ACEs was for pure compute workloads, which true to their name allow GCN GPUs to execute compute tasks from multiple queues. It wasn’t until very recently that the ACEs became important for graphical (or rather mixed graphics + compute) workloads.

Why? Well the short answer is that in another stake in the heart of DirectX 11, DirectX 11 wasn’t well suited for asynchronous shading. The same heavily abstracted, driver & OS controlled rendering path that gave DX11 its relatively high CPU overhead and poor multi-core command buffer submission also enforced very stringent processing requirements. DX11 was a serial API through and through, both for command buffer execution and as it turned out shader execution.

lol ouch, but we all knew this to be true.
This is starting to make a lot of sense. I think this is good evidence that upon release GNM had closer closer to a full DX12 functionality.

It would explain why I:SS runs hotter, by reports than other games do.

Well i'm a guy that likes to fail fast, looks like PS4 has really had both a major API and hardware advantage. I'm surprised Drive Club and Order1886 didn't leverage this, they may have though, it's just not listed on the slide.

Async_Games_575px.png
 
Last edited:
Today must have been embargo day. Interesting.

You are referring to the table with graphics + compute queues right?
Yeah.
Anand's looks like this:
AMD GCN 1.2 (285) 1 Graphics + 7 Compute 8 Compute
AMD GCN 1.1 (290 Series) 1 Graphics + 7 Compute 8 Compute
AMD GCN 1.1 (260 Series) 1 Graphics + 1 Compute 2 Compute
AMD GCN 1.0 (7000/200 Series) 1 Graphics + 1 Compute 2 Compute
260-series should be 1 gfx + 7 or 8 compute (depending if the gfx actually eats 1 slot or not)
290-series & 285 should be 1 gfx + 63 or 64 compute (depending if the gfx actually eats 1 slot or not)
And I'm fairly certain that GCN 1.0's ACE's could each do 2 queues, not 1, so it should be 1 gfx + 3 or 4 compute (depending if the gfx actually eats 1 slot or not)

edit:
AnandTech's article has been fixed :)
 
Last edited:
Anandtech:


lol ouch, but we all knew this to be true.
This is starting to make a lot of sense. I think this is good evidence that upon release GNM had closer closer to a full DX12 functionality.

It would explain why I:SS runs hotter, by reports than other games do.

Well i'm a guy that likes to fail fast, looks like PS4 has really had both a major API and hardware advantage. I'm surprised Drive Club and Order1886 didn't leverage this, they may have though, it's just not listed on the slide.

Async_Games_575px.png

I think weather use compute in Drive Club.
 
Anandtech:


lol ouch, but we all knew this to be true.
This is starting to make a lot of sense. I think this is good evidence that upon release GNM had closer closer to a full DX12 functionality.

It would explain why I:SS runs hotter, by reports than other games do.

Well i'm a guy that likes to fail fast, looks like PS4 has really had both a major API and hardware advantage. I'm surprised Drive Club and Order1886 didn't leverage this, they may have though, it's just not listed on the slide.

Async_Games_575px.png

The PS4 version of battlefield 4 runs at 900p and Xbox One at 720p....you would think there would be even more of a gap though wouldn't you?
 
The PS4 version of battlefield 4 runs at 900p and Xbox One at 720p....you would think there would be even more of a gap though wouldn't you?
Unfortunately, no one knows the extent as to how much was actually done on any of those titles. And once again, optimization time must have been fairly limited back then. 5 different platforms etc, try to make it for launch date.
I'm going to wait to see later releases of Frostbyte (by DICE) to compare how far that engine has come.
 
900p is 56% more pixels than 720p. That's more than the theoretical difference in GPU power.

No but that is about exactly the theoretical power.

Unfortunately, no one knows the extent as to how much was actually done on any of those titles. And once again, optimization time must have been fairly limited back then. 5 different platforms etc, try to make it for launch date.
I'm going to wait to see later releases of Frostbyte (by DICE) to compare how far that engine has come.

I agree. Probably didn't really have much time to work on it.
 
Back
Top