DirectX 12: The future of it within the console gaming space (specifically the XB1)

It seems probable there would be an impact on SIMD throughput as well. One SIMD leaching register capacity from others means a whole CU's register resources are favoring one SIMD and potentially starving the other three. Since wavefronts are distributed 10 to a SIMD, that could shut down 3/4 of a CU's instruction issue if the other SIMDs are starved.

Sure, but the program which uses lots of GPRs finishes quicker. Instead of 3- possible wavefronts with which you might stall really quick, you still may have 10. You can schedule lightweight programs like shadow mapping, stencil passes, hull/domain shader and so on to the the other SIMDs.
It's an optimization problem, and I simply would like to know if I can only utilize fe. 23.8% of the chips capability (occupancy) in the graphics pipeline, then why? Can I get 100%? How?
Ultimately the statistics of real workloads could tell if it helps or hinders.
 
Each set (NDRange in OpenCL speak) of work items associated with a kernel launch is divided up into wavefronts. Those are issued to any CU that has the capacity.

The graphics engine can issue vertex, hull, domain, geometry and pixel shader wavefronts. So in a given pipeline state the graphics engine has 5 distinct NDRanges it can pull from. Because the graphics pipeline is a pipeline, the amount of work items available in each type of NDRange grows and shrinks over time. The work items are, of course, parametrically defined since much of the state of a work item is held in memory somewhere (off or on chip) or is literally a set of parameters for the NDRange and/or wavefront. So at the time the wavefront is despatched, these parameters are materialised into a context that's delivered to the CU.

The ACEs each have 8 distinct queues, so on a GPU with 8 ACEs, there are 64 NDRanges that could all be waiting to be executed.

So, in effect, there are as many as 69 NDRanges that are all waiting for CUs to become available. Some of those generate wavefronts with high register pressure, so they'll pack less easily.

Generally, graphics NDRanges all have to be long-running (for the duration of the pipeline state, until the next pipeline state replaces it). The GPU has its own algorithms for load-balancing these, to take account of the growing and shrinking sets of work items for each of these, over the lifetime of the pipeline state. In general you can think of greedy wavefront despatch for pixel shading (since pixel shading tends to be the majority of all graphics work) versus carefully balanced hull and domain shader pairs, since the output of HS, after amplification by the tessellator, has to return to the same CU for DS.

By contrast there's no way to pre-determine the allocation of work from a set of 64 competing NDRanges which could all be running the same compute kernel or they could all be distinct kernels. Since the GPU has a limited rate at which it can construct wavefronts (2 cycles per wavefront, I believe) some arbitration amongst the ACEs is required. This is where you get into the subject of priorities and pre-emption. And then have to work out whether the combination of VGPR, SGPR and LDS allocations per wavefront (or workgroup, since a workgroup can have 4 wavefronts) is compatible with any of the CUs that are currently executing work.

Having a plethora of NDRanges, preferably with distinct kernels, ready to issue wavefronts obviously helps the GPU to pack work into CUs especially if some kernels have allocations that hurt their own throughput. Balancing those with lightweight wavefronts will make the CUs run at a higher duty cycle.

It's going to be very rare that NDRanges that consist of only a single wavefront are issued. Usually there'll be 10s of wavefronts all the way up to 100s of thousands. So even with only a small set of NDRanges live on the GPU at any one time, the GPU will have plenty of choice about what work to issue to CUs to maximise their utilisation.

Of course, with 10s of different kernels on the GPU, another constraint arises: instruction cache, which is shared amongst groups of CUs.
 
Last edited:
Sure, but the program which uses lots of GPRs finishes quicker.
This assumes the algorithm cannot be refactored to potentially split into multiple phases, each requiring a lower number of max registers.
The phases could distribute normally within the confines of the existing implementation. It would have overhead, but it also would not disadvantage the breadth of programs out there that do not need to break GCN's resource and execution split.


You can schedule lightweight programs like shadow mapping, stencil passes, hull/domain shader and so on to the the other SIMDs.

There's lightweight and then there's trivial amounts of allocation. In the case of register file contention, just needing a register access may lead to a stall since the register files as they exist have no provision for multiple SIMD clients.
Even if there were a unified file or an interconnect, the routing and arbitration injects much more complexity than the status quo of nothing. Each additional requesting SIMD can generate 3 reads and a write.
Additional details on how extensively the allocation can be split across files would be needed, because the the level of sharing would indicate how much the implementation bloats.
A full share scenario could multiply the port count, and increase the asummed 6T SRAM cell size by 1 transistor per extra port (3 extra clients with 4 ports is 12 extra T), to the point that you could have less capacity per SIMD with sharing than when things were split.
 
A full share scenario could multiply the port count, and increase the asummed 6T SRAM cell size by 1 transistor per extra port (3 extra clients with 4 ports is 12 extra T), to the point that you could have less capacity per SIMD with sharing than when things were split.
Or you could just build a local data share...
 
I simply would like to know if I can only utilize fe. 23.8% of the chips capability (occupancy) in the graphics pipeline, then why? Can I get 100%? How?
Ultimately the statistics of real workloads could tell if it helps or hinders.
You can get full utilization with graphics. The easiest way is to have a lot of pixel shaders and the right balance of GPR usage in order to get good occupancy.
 
... (2 cycles per wavefront, I believe) ...

The posted slides state ACEs can create one workgroup and dispatch one wavefront per cycle. They don't say anything about the graphics engines, maybe each engine can do the same, or maybe just the big CP can do this.

It's going to be very rare that NDRanges that consist of only a single wavefront are issued. Usually there'll be 10s of wavefronts all the way up to 100s of thousands. So even with only a small set of NDRanges live on the GPU at any one time, the GPU will have plenty of choice about what work to issue to CUs to maximise their utilisation.

Thanks for the explanation. I'm not sure it gets me any nearer to a explicit wavefront life-cycle "picture" for the whole graphics pipeline for one drawcall (be it 1 triangle or more).
There must be a way to balance the whole thing. A way to rank and list all local optimization minima for some pipeline configuration and stall probability.

Something like, 1 triangle with [1,1,1,3] tesselation will lead to:
- 3 vertex shader threads: 1 SIMD, 18.75% utilization, if it stalls bad luck, any non-multiple of 4 vertices means underutilization
- 1 hull shader const thread: 1 SIMD, 6.25% utilization, ...
- 1 hull shader cp threads: 1 SIMD, 6.25% utilization, ...
- 6 domain shader threads: 1 SIMD, 37.5% utilization, ...
- say 120 pixel shader threads: 1 SIMD, 750% utilization, ...

Problem is they get executed 4x (why?, what's the technical reason?), for a wavefront size of 64, so divide them all by 4. Let's try to tune this:

- 64 vertex shader threads: 1 SIMD, 400% utilization, if it stalls bad luck
- 62 hull shader const thread: 1 SIMD, 387.5% utilization, ...
- 186 hull shader cp threads: 1 SIMD, 2906.25% utilization, not divisible by 64
- 372 domain shader threads: 1 SIMD, 5881,25% utilization, arrg not divisible by 64
- say 120 pixel shader threads: 1 SIMD, 750% utilization, ...

Is it all on the the same SIMD? It seems, except I put the HS into mem-mode, then the DSs might be created somewhere else.
Are the pixel shader threads also all piped though the same SIMD?
What is the really real maximum utilization one could possibly construct?
It appears possible to me to design a slider for tesselation, which, when used, oscillates up and down in rendering speed, 4x being faster than 2x and such. I'm certain this oscillation appears in other configurations without tesselation as well. Subdivide your NURBs, oh, 2512 vertices is faster than 2468 (reason in the vertex shader), oh wait 2546 is even faster (reason in the rasterizer, no thin triangles).
I hate being unable to carefully tune it. :D

I was really happy with the old version of the CodeAnalyst which had the pipeline analysis, it helped me soo much getting everything I could out of x86 SIMD.

This assumes the algorithm cannot be refactored to potentially split into multiple phases, each requiring a lower number of max registers.
The phases could distribute normally within the confines of the existing implementation. It would have overhead, but it also would not disadvantage the breadth of programs out there that do not need to break GCN's resource and execution split.

I don't see a possibility to create a custom graphics pipeline, like 2x domain shader stages, or 6 pixel shader stages. I see it conceptually possible, but I don't know if the hardware could support it. In a compute shader I could fake a pipeline even in one monolothic shader by separating scopes {} and communicate data across scopes though LDS. I don't have an LDS in the pixel shader. I have only the possibility to use a UAV as scratchpad.

But on the other hand I've seen it works out better more often than not, to split compute shader into smaller separate pipeline-pieces, even if it meant to go through UAVs, I believe because it was easier to load-balance the hardware with more but tinier fragments of work (less points of stall as well). I don't think GPR pressure reduction was causing it, as the data was "cached" in the LDS instead of the UAV anyway.

Edit: correct HS/DS numbers
 
Last edited:
I don't understand your executed by 4x statement. 120 pixel shader threads won't launch on 1 SIMD for AMD hardware. There will be at least 2 waves so 2 SIMDs and these SIMDs are likely different than the SIMDs executing the other shader types.

With a small amount of geometry like one triangle or one patch you're not going to get good SIMD utilization on any PC architecture. Intel would fare the best due to its 16 wide SIMD.

You never really want draw calls this small because the last wave in the draw is likely to be underutilized. Using larger draw calls amortizes this cost.

Sometimes under utilization doesn't matter though. For example, drawing a full screen quad has poor VS utilization, but the PS will have good utilization and that's where the cost is because the ratio of PS to VS is high. That's not good enough for some people so they might draw a single triangle large enough that it covers the entire screen. The rasterizer will throw away the parts that lie beyond the screen and PS efficiency will be perfect. If the fill rate is 64 pixels/clock you'll see a new PS wave launched every clock on AMD hardware. Nvidia probably launches 2 warps/clock due to the smaller warp size.
 
In a compute shader I could fake a pipeline even in one monolothic shader by separating scopes {} and communicate data across scopes though LDS. I don't have an LDS in the pixel shader. I have only the possibility to use a UAV as scratchpad
In theory yes. In practice, the shader compiler mashes all the instructions together as it tries to maximize the instructions between each load and store (to hide potential memory latency). You can try to limit this by loops, branches and barriers, but there's no foolproof technique to make the compiler do exactly what you want. And that's a big problem.
 
I don't understand your executed by 4x statement.

A SIMD executes it's vector 4x times to get to a wavefront size of 64. The slides say it. I leaned using thread-groups <64 means underutilization. If I have 48 threads, then the SIMD is underutilized. Even though it could do it just 3x.

120 pixel shader threads won't launch on 1 SIMD for AMD hardware.

I want to know if the pixel shaders which are related to the outputs of some vertex/domain shader, all execute on the same SIMD as the vertex shader. The interpolants are put into LDS, they appear not to be transferred around on the chip, so the only solution appears to me that all pixel shaders belonging to some vertices, need to stay on the same SIMD. Except the rasterizer has some broadcast functionality, where the interpolants a moved to other SIMDs. Or the interpolants between vertex/domain shader and pixel shader are not put into LDS. Same for tesselator.

There will be at least 2 waves so 2 SIMDs and these SIMDs ...

The previous responses told me that this is not the case, that this are 2 waves on the same SIMD. I was asking about vertical vs. horizontal execution of a variable number of waves with the same program.

... are likely different than the SIMDs executing the other shader types.

How does that work? How is information shared betwen stages if not thorugh LDS? LDS means you could rotate SIMDs on one CU, so it could be on 1 of the other three. There is a register in the hull shader which puts it into memory-mode where interpolants spill to memory, or stay in the SIMD, and domain shader need to stay where the hull shader were.

That's not good enough for some people so they might draw a single triangle large enough that it covers the entire screen.

It's measurably faster. Two triangles also have interpolation precision problems. On AMD the top-left triangle is perfect and the right-bottom triangle has quantization banding the further you go below half the screen, just use frac() and integer screen coordinates. It's better to use just one continous bary-space.
 
For AMD data doesn't completely stage through LDS for pixels. A VS might execute on CU 0, SIMD 0 and a PS created by the triangle executes on CU 5, SIMD 2. Forcing all PS waves to execute on the same CU or SIMD would suck performance wise. There's some global memory involved.

Also, consecutive PS waves are very unlikely to be placed on the same SIMD.
 
GpuMmu model

Dn894173.gpummu_model.1(en-us,VS.85).png


https://msdn.microsoft.com/en-us/library/windows/hardware/dn894173(v=vs.85).aspx
 
Hmm...
I've fairly thoroughly read through the leaked SDK and I believe Async Shaders (async compute shaders) have been available on DX11 Fast Semantics in the later half of last year. A lot of these discussed features DX12 features now present in XO have come late, but I believe they are available today for use now.

The only thing that has not changed is that X1 is still stuck on immediate and deferred context rendering. The move to DX12 will give them full multithreaded rendering that PS4 has enjoyed for some time now. This will probably be a boon for async shaders, since it's ideal to be submitting async shaders in parallel with regular draws; at least that's my assumption of it, submitting them serially seems to 'miss' the point a bit. Additionally, multi-threaded rendering provides developers a chance to make their draw calls more granular and combined with async should overall provide a performance increase.

Should be good for Xbox, perhaps it's been way underperforming compared to where they want it to be; but as far as I know and read and understand to date, GNM has provided all these features (though maybe not officially supported by Sony - developers are left to use the API to create these features themselves) up till now.

The rest of that article seems a bit misinformed.
 
Last edited:
"Async Shaders have been enabled in DirectX 12, which were not available in its predecessor. A few PS4 titles have gone to the trouble of implementing the feature (such as Infamous: Second Son and Battlefield 4.)" - meaning that most didn't and PS4 is likely to benefit about as much or more, perhaps?
 
"Async Shaders have been enabled in DirectX 12, which were not available in its predecessor. A few PS4 titles have gone to the trouble of implementing the feature (such as Infamous: Second Son and Battlefield 4.)" - meaning that most didn't and PS4 is likely to benefit about as much or more, perhaps?
The list that AMD provided wasn't exhaustive, as I have been directly/indirectly informed on another thread. Not hard to believe that some particular exclusives of PS4 leverage it effectively.
 
Phil Spencer retweeted this article:

kurstozgiplj385nbnxr.jpg


we've known from the xdk leak that the interesting features were only enabled (out of beta) from around oct 2014, and only really added a couple months before. So devs really haven't had a chance to implement them.. The XB1 currently is DX10/11 based api's.

DX12 api, as many have said, has not arrived at all on the xb1.. That arrives when Win10 drops and the xbox team has had a chance to bring that across to xb1 OS's .. I'm still trying to work out what the xb1 architecture will look like once Win10 arrives, what is the GameOS going forward now that Win10 will have native windows containers and all apps/games will go thru the One Store ...
 
Back
Top