DirectX 12: The future of it within the console gaming space (specifically the XB1)

AMD's slide deck:

http://amd-dev.wpengine.netdna-cdn....endering-in-VR-and-Graphics-Applications.ppsx

I'm not sure about the idea of a rolling shutter. Brings a whole new meaning to tearing.

I also wonder just how large of a visual field GPUs are going to have to render in order to do a time warp translation. We could be talking 2-3x more pixels rendered, just to enable time warping. Anyone have solid numbers? Or wants to do the math for say an 8ms (120Hz) frame time. And that doesn't account for head strafing (e.g. lean)...
 
Some excellent links there. It's kinda surprising how NV, AMD and Sony are all implementing more or the less identical solutions (Async Timewarp) under different names as though it's some kind of awesome propriety solution.
 
CAkJnWbUYAAGYnk.png:large



slide deck to the WinHEC 2015 pdf

Wanted to build off of this. I didn't read the slide deck until now, but things did turn out as I thought they could be at least according to these marketing slides. Fine grained draw calls, shorter and smaller jobs to fit between breaks on GPU resources.
azwT05l.jpg


This first slide represents the move to the advantages of parallel rendering. This is imo much above just relieving the CPU of over burdened draw calls. This is GPU stuffing.

eyXVBKH.jpg

This I didn't know, I guess this was just worth sharing. I didn't realize so many different aspects of rendering were split up like this, I assume the color code represents different parts of the GPU. Well stupid written point since I'm going to post the next slide... lol. Green are ROPS? Blue is Bandwidth/BUS? Red are ALU? Purple? Maybe it doesn't work like that, anyway, next slide shows what can now be done in parallel. Is this all marketing mombo jumbo? Because imo these are huge gains, and also seriously taxing to the GPU compared to DX11. And also, if you guys know the answer to this question, but wouldn't each coloured block on that diagram require a read from memory and follow up with a write when it moves onto the next block, and then another read?
So wrt. Xbox One, in this scenario, if you flood all parts of the GPU like this you'll see a lot of concurrent read/writes happening at different times? If so, esram an effective cost wise solution to solve this problem. It has a greater purpose than just supplementing DDR3. Concurrent read/writes on traditional GDDR memory setups would kill the bandwidth if I understand correctly. So my understanding, having 7 or so threads submitting their own GPU work to be done, there is no way any developer could possibly time when all those items hit, you're going to be hit with a lot of read/writes when jobs finish and start. You would need bandwidth that would be unaffected by it hence esram was the best choice for what was available, since we can see it is not hindered by concurrency.

Which if all of that is true, we should expect the way developers program for Xbox One to change. Does it still really make sense to continue to have the render target exist in esram anymore ? You're going to want that scratch pad free if this is the type of GPU pipeline behaviour.


2KRVJ2J.jpg


Last image was a question to you guys. Feb 17 was a while back, so I assumed everything about DX12 we knew already since post GDC. But this slide is telling me that the whole story has not been told yet? 1 major pillar left in DX12?
1sphIc6.jpg



edit: and man is Brad Wardell all over those slides. LOL. I guess he's pretty involved with the marketing side of things here.
Here's another picture of the final pillar.
If 8 Core CPU is the first one, Fine grained compute is the GPU is the second one, I'm completely confused on what they bitmap could be for the last one. VR? Is that a volcano spewing lava?

w3BUfo8.jpg
 
Last edited:
Wanted to build off of this. I didn't read the slide deck until now, but things did turn out as I thought they could be at least according to these marketing slides. Fine grained draw calls, shorter and smaller jobs to fit between breaks on GPU resources.
azwT05l.jpg


This first slide represents the move to the advantages of parallel rendering. This is imo much above just relieving the CPU of over burdened draw calls. This is GPU stuffing.

eyXVBKH.jpg

This I didn't know, I guess this was just worth sharing. I didn't realize so many different aspects of rendering were split up like this, I assume the color code represents different parts of the GPU. Well stupid written point since I'm going to post the next slide... lol. Green are ROPS? Blue is Bandwidth/BUS? Red are ALU? Purple? Maybe it doesn't work like that, anyway, next slide shows what can now be done in parallel. Is this all marketing mombo jumbo? Because imo these are huge gains, and also seriously taxing to the GPU compared to DX11. And also, if you guys know the answer to this question, but wouldn't each coloured block on that diagram require a read from memory and follow up with a write when it moves onto the next block, and then another read?
So wrt. Xbox One, in this scenario, if you flood all parts of the GPU like this you'll see a lot of concurrent read/writes happening at different times? If so, esram an effective cost wise solution to solve this problem. It has a greater purpose than just supplementing DDR3. Concurrent read/writes on traditional GDDR memory setups would kill the bandwidth if I understand correctly. So my understanding, having 7 or so threads submitting their own GPU work to be done, there is no way any developer could possibly time when all those items hit, you're going to be hit with a lot of read/writes when jobs finish and start. You would need bandwidth that would be unaffected by it hence esram was the best choice for what was available, since we can see it is not hindered by concurrency.

Which if all of that is true, we should expect the way developers program for Xbox One to change. Does it still really make sense to continue to have the render target exist in esram anymore ? You're going to want that scratch pad free if this is the type of GPU pipeline behaviour.


2KRVJ2J.jpg


Last image was a question to you guys. Feb 17 was a while back, so I assumed everything about DX12 we knew already since post GDC. But this slide is telling me that the whole story has not been told yet? 1 major pillar left in DX12?
1sphIc6.jpg



edit: and man is Brad Wardell all over those slides. LOL. I guess he's pretty involved with the marketing side of things here.
Here's another picture of the final pillar.
If 8 Core CPU is the first one, Fine grained compute is the GPU is the second one, I'm completely confused on what they bitmap could be for the last one. VR? Is that a volcano spewing lava?

w3BUfo8.jpg
Impressive knowledge, thanks for sharing.

Could this have something to do with the upcoming DirectX 12?

RIDE, a game from developers Milestone, has bumped the resolution from the original 900p to 1080p on the X1 and they praised the evolving code libraries of the console. They also mention the eSRAM.

http://gamingbolt.com/rides-xbox-on...-dev-praises-x1s-ever-evolving-code-libraries
 
Last edited:
If 8 Core CPU is the first one, Fine grained compute is the GPU is the second one, I'm completely confused on what they bitmap could be for the last one. VR? Is that a volcano spewing lava?

w3BUfo8.jpg

Yellow Brick Road,?

We are off to see the wizard (MrX) for some magic sauce.... ;)
 
Yellow Brick Road,?

We are off to see the wizard (MrX) for some magic sauce.... ;)
Rofl.

HHahaha. That's actually looks like a yellow brick road. LOLOLOLOL. It is so bad that imo it's being handled this way.
 
Impressive knowledge, thanks for sharing.

Could this have something to do with the upcoming DirectX 12?
Eh not really. I'm waiting for a senior member to sort me out. I just throw stuff at the wind that I think can stick. All I recall is that in the SDK leak thread @mosen and I were discussing the possibility that this would be dx12 future mode of operation but I recall being told than the pipeline can't operate that way so we stopped looking down that corridor. My assumption going forward was that ultimately when you do a for loop to draw 1000 trees, then draw all the cars and all the particle effects you CPU spends a lot of time going through this code. So I assumed that multi-threaded would just relieve this symptom. But then I was confused proper as to how the GPU sorted out which item to draw first if it wasn't in order. Now I see a slide showing that it's what we theorized, all parts of the GPU working concurrently.

I stopped looking for hidden hardware at some point in time. Phil says that they knew about DX12 when they made Xbox so to me that means the the choices they made for the Xbox should reflect that if he was telling the truth. The only thing one should really be able to see is the memory architecture and cooling. Those are pretty stand out differences. Everything else seems like subtle improvements here and there. The need for Xbox to have feature level 12_1 is definitely forward looking but it should not be the only requirement to pass to prove that Xbox was designed in mind for DX12.

We didn't understand dx12 so it was hard to understand the choices they made from the hardware side of things. Then again I still can be wrong about how many read/writes will occur here. But even so this should effectively change the way engines are designed.
 
Hi everyone. @iroboto IMO, that represents the commands queues available to send commands to GPU. It is like Mantle, you have 3 queues (universal, compute and DMA). With the universal queue you can send graphics or compute jobs (green and purple colors), with compute queue you can submit asynchronous compute (red) and DMA for transfers (blue).

Bye.
 
Thanks this was what I was looking for.

http://blogs.msdn.com/b/directx/arc...-gdc-2015-and-a-year-of-amazing-progress.aspx

GPU Efficiency
Currently, there are three key areas where GPU improvements can be made that weren’t possible before: Explicit resource transitions, parallel GPU execution, and GPU generated workloads. Let’s take a quick look at all three.

Explicit resource transitions
In DirectX 12, the app has the power of identifying when resource state transitions need to happen. For instance, a driver in the past would have to ensure all writes to a UAV are executed in order by inserting ‘Wait for Idle’ commands after each dispatch with resource barriers.



If the app knows that certain dispatches can run out of order, the ‘Wait for Idle’ commands can be removed.



Using the new Resource Barrier API, the app can also specify a ‘begin’ and ‘end’ transition while promising not to use the resource while in transition. Drivers can use this information to eliminate redundant pipeline stalls and cache flushes.

Parallel GPU execution
Modern hardware can run multiple workloads on multiple ‘engines’. Three types of engines are exposed in DirectX 12: 3D, Compute, and Copy. It is up to the app to manage dependencies between queues.

We are really excited about two notable compute engine scenarios that can take advantage of this GPU parallelism: long running but low priority compute work; and tightly interleaved 3D/Compute work within a frame. An example would be compute-heavy dispatches during shadow map generation.

Another notable example use case is in texture streaming where a copy engine can move data around without blocking the main 3D engine which is especially great when going across PCI-E.

GPU-generated workloads
ExecuteIndirect is a powerful new API for executing GPU-generated Draw/Dispatch workloads that has broad hardware compatibility. Being able to vary things like Vertex/Index buffers, root constants, and inline SRV/UAV/CBV descriptors between invocations enables new scenarios as well as unlocking possible dramatic efficiency improvements.
 
http://blogs.msdn.com/b/directx/arc...-gdc-2015-and-a-year-of-amazing-progress.aspx

GPU Efficiency
Currently, there are three key areas where GPU improvements can be made that weren’t possible before: Explicit resource transitions, parallel GPU execution, and GPU generated workloads. Let’s take a quick look at all three.

Explicit resource transitions
In DirectX 12, the app has the power of identifying when resource state transitions need to happen. For instance, a driver in the past would have to ensure all writes to a UAV are executed in order by inserting ‘Wait for Idle’ commands after each dispatch with resource barriers.



If the app knows that certain dispatches can run out of order, the ‘Wait for Idle’ commands can be removed.



Using the new Resource Barrier API, the app can also specify a ‘begin’ and ‘end’ transition while promising not to use the resource while in transition. Drivers can use this information to eliminate redundant pipeline stalls and cache flushes.

Parallel GPU execution
Modern hardware can run multiple workloads on multiple ‘engines’. Three types of engines are exposed in DirectX 12: 3D, Compute, and Copy. It is up to the app to manage dependencies between queues.

We are really excited about two notable compute engine scenarios that can take advantage of this GPU parallelism: long running but low priority compute work; and tightly interleaved 3D/Compute work within a frame. An example would be compute-heavy dispatches during shadow map generation.

Another notable example use case is in texture streaming where a copy engine can move data around without blocking the main 3D engine which is especially great when going across PCI-E.

GPU-generated workloads
ExecuteIndirect is a powerful new API for executing GPU-generated Draw/Dispatch workloads that has broad hardware compatibility. Being able to vary things like Vertex/Index buffers, root constants, and inline SRV/UAV/CBV descriptors between invocations enables new scenarios as well as unlocking possible dramatic efficiency improvements.
Thank Dobwal.

Parallel GPU execution parts really clarifies some of the dx12 slide deck from GDC. When sebbbi said he was looking forward to that feature, I didn't quite understand what it did, but now it's a pretty clear picture. So the pipelines are still very much async compute, 3D, and copy, I really didn't realize it couldn't do it before - I just assumed these were just functions that should have been there in dx11.

I'm curious now to know if some of this is already in DX11 fast semantics for Xbox One.
It would be poor to assume that none of it is, I think we have leaked SDK documentation that async compute was added to the SDK late in the cycle. They also make mention that waves 3-4 of games will finally be good at asyncing-moving data in and out of esram. But we're unsure of the timelines of when waves 3-4 of games is supposed to be, and if those games were meant to be dx12 games anyway.

I'll need to look into the SDK to see if these features exist (parallel rendering) but I'm fairly positive I did not see ExecuteIndirect and UAV loading in the Xbox API.
 
DirectCompute in DirectX12 by Chas Boyd (Microsoft)

sorry everyone heres the link : http://on-demand.gputechconf.com/gtc/2015/video/S5561.html


DX11 saw the GPU and CPU's like this

CAwpGvrVEAEsus0.png:large




Dx12 sees the GPU as individual Cores that a developer has complete control over

CPU, GPU, DMA (Copy Engine) are all just "Cores" , if you know how to programme a CPU core, then all these other cores are handled the same

CAwpRBLUcAA2A2h.png:large


These so called "Cores" (CPu/GPU/DMA etc) are now called "Engines"

CAwqOkBUMAAtz_d.png




Much like HSA, DX12 seems to have ended in the same place where cores are all just "compute cores"



CAwstGmUUAAAUxp.jpg:large
 
Last edited:
Huh? DirectCompute has been around since DX11?
I'd never heard it referenced before and I don't work on Microsoft platforms myself. I wonder how much support it gets compared to CUDA or OpenCL.
 
Back
Top