DirectX 12: The future of it within the console gaming space (specifically the XB1)

That madman has finally revealed his motives for his blog, and he is now suckering his followers into donating money into his paypal account.
That's one way to start a religion.

The GCP is responsible for a number of contexts to run, when the context hits a snag and can no longer process instructions instead of stalling, it will switch over to the next context.

The command processors generally receive a queue command, attempt to arbitrate sufficient resources to launch a wavefront, and then let those wavefronts do as they will. Typical stall situations at a shader level are handled by the CUs, whose contexts run independently once initialized and launched.
Front end engines try to race ahead and process what they can from their queues, up until they cannot arbitrate for the resources necessary or are waiting on synchronization.

Xbox pulls memory from two pools, esram and DDR3. One has amazing latency, the other is ok, one has superior bandwidth the other doesn't.
There are GPUs that run with DDR3 already. It doesn't bother them much. The CUs see the ESRAM as being roughly half the latency of DDR3, which is a significant relative improvement but still slow in CPU terms.

I think this is pretty high level that needs to be verified, but I think at this point this is where the rabbit hole goes deeper: the reason why you may actually prefer smaller draw calls, and more of them.
GPU hardware is pretty coarse in the granularity of work items it can run without wasting hardware. Small batches are an issue with current hardware. Draw calls that mess with graphics state enough can make the GPU stall to handle this, which may have more to do with the hardware contexts than memory latency.

My understanding here isn't fully clear, since i'm not sure what happens at the SIMD level if I render 100K boxes with the same shader.
Primitives are evaluated and transformed into pixel coverage, and depending on what you're doing that takes some number of wavefronts as well. Each pixel, or rather quad of pixels, will need to go into a wavefront, which is 64 pixels in size. The number of wavefronts comes down to how many it takes to contain all the quads. The GPU culls or rejects things at multiple points, so how much gets handed off or may be discarded is variable.

Does it do the quoted above? I'm not sure how many wavefronts are submitted.
Without knowing what is being done for each box, how much is culled, and how much they cover, it would be unknown.

edit: spelling
 
Last edited:
That's one way to start a religion.



The command processors generally receive a queue command, attempt to arbitrate sufficient resources to launch a wavefront, and then let those wavefronts do as they will. Typical stall situations at a shader level are handled by the CUs, whose contexts run independently once initialized and launched.
Front end engines try to race ahead and process what they can from their queues, up until they cannot arbitrate for the resources necessary or are waiting on synchronization.


There are GPUs that run with DDR3 already. It doesn't bother them much. The CUs see the ESRAM as being roughly half the latency of DDR3, which is a significant relative improvement but still slow in CPU terms.


GPU hardware is pretty coarse in the granularity of work items it can run without wasting hardware. Small batches are an issue with current hardware. Draw calls that mess with graphics state enough can make the GPU stall to handle this, which may have more to do with the hardware contexts than memory latency.


Primitives are evaluated and transformed into pixel coverage, and depending on what you're doing that takes some number of wavefronts as well. Each pixel, or rather quad of pixels, will need to go into a wavefront, which is 64 pixels in size. The number of wavefronts comes down to how many it takes to contain all the quads. The GPU culls or rejects things at multiple points, so how much gets handed off or may be discarded is variable.


Without knowing what is being done for each box, how much is culled, and how much they cover, it would be unknown.

edit: spelling
Lol this is getting way deeper than I expected, but I might as well ask.
GCN CUs are capable of 10 wavefronts. So is the GCP responsible for submitting 10 wavefronts, and waiting for the CU to signal the work is done?
According to timothy's blog
  • The 5 instructions need to come from different wavefronts.
  • The 5 instructions need to be of different types.
So the CU can operate on more than 1 wave front at a time if I'm understanding. Is it up to the graphics programmer to optimize the command buffer such that for instance, all 5 instructions are in use at all times? Eventually as time goes on I imagine that performance optimization is going to come down to wavefront/CU optimization. And so it seems like a particularly important topic since -low level api- will be directly responsible for that interface.
 
Lol this is getting way deeper than I expected, but I might as well ask.
GCN CUs are capable of 10 wavefronts. So is the GCP responsible for submitting 10 wavefronts, and waiting for the CU to signal the work is done?
Submitting wavefronts is one of the things the GPC can do. Accepting host commands, maintaining the GPU and fixed-function pipeline, starting certain kinds of DMA transfers early, sending interrupts, and handling system and synchronization at the queue level also happen there.

I'm unclear on the internals, so I'm not sure how much it waits on signals from CUs.
There are so many front ends that there's arbitration and scoreboarding hardware that the front ends use to hash out who gets what, so the resources are significantly decoupled from the front end.

So the CU can operate on more than 1 wave front at a time if I'm understanding. Is it up to the graphics programmer to optimize the command buffer such that for instance, all 5 instructions are in use at all times? Eventually as time goes on I imagine that performance optimization is going to come down to wavefront/CU optimization. And so it seems like a particularly important topic since -low level api- will be directly responsible for that interface.
That is probably more dependent on the shader programmer and rendering engine. What goes into the shaders is not the same as what goes in the command queues. The front end passes around data and pointers necessary to set the right modes and then a relatively compact set of values that the CUs use to find their code, target data, and settings.
For compute, it can be very compact with pointers to the shader program and values indicating the dimensions of the kernel. The CUs handle their own instruction fetch.
 
Submitting wavefronts is one of the things the GPC can do. Accepting host commands, maintaining the GPU and fixed-function pipeline, starting certain kinds of DMA transfers early, sending interrupts, and handling system and synchronization at the queue level also happen there.

I'm unclear on the internals, so I'm not sure how much it waits on signals from CUs.
There are so many front ends that there's arbitration and scoreboarding hardware that the front ends use to hash out who gets what, so the resources are significantly decoupled from the front end.


That is probably more dependent on the shader programmer and rendering engine. What goes into the shaders is not the same as what goes in the command queues. The front end passes around data and pointers necessary to set the right modes and then a relatively compact set of values that the CUs use to find their code, target data, and settings.
For compute, it can be very compact with pointers to the shader program and values indicating the dimensions of the kernel. The CUs handle their own instruction fetch.
Yea I started reading this:

http://developer.amd.com/community/blog/2014/05/16/codexl-game-developers-analyze-hlsl-gcn/

HLSL optimization for GCN. You pretty much nailed it. Wavefronts are issued by the shader code. If you surpass the amount of VGPR and SPGRs, you'll get less wavefronts per CU. You want more wavefronts to improve latency hiding/thread switching - without at least a couple of wavefronts the risk of stalling your CU is extremely high. But there are factors into that as well.

So much can be learned from this one page. So much more interesting than web work and networking. Goodness what did I do with my last 8 years.
 
Wow! really? lol source I can read that's pretty cool.

http://www.vgleaks.com/playstation-4-balanced-or-unbalanced

http://www.eurogamer.net/articles/digitalfoundry-face-to-face-with-mark-cerny

First link talks about the extra alu on 4 CUs

From last link:

"Digital Foundry: Going back to GPU compute for a moment, I wouldn't call it a rumour - it was more than that. There was a recommendation - a suggestion? - for 14 cores [GPU compute units] allocated to visuals and four to GPU compute...
Mark Cerny: That comes from a leak and is not any form of formal evangelisation. The point is the hardware is intentionally not 100 per cent round. It has a little bit more ALU in it than it would if you were thinking strictly about graphics. As a result of that you have an opportunity, you could say an incentivisation, to use that ALU for GPGPU."
 
This is old news and you are misinterpreting what he is saying, as did others last year. The CUs are the same, Cerny was trying to push compute by saying you don't need 100% of all 18CUs for purely for rendering.
 
Anyway, I made a mistake on my answer... I had read that a long time ago (2013)... and the words "extra" ALU were kept as having a second ALU.
But finding the links and having read them all over again I see it´s not in that sense! There is no second ALU at all. Just that the 4 CU are there for extra ALU power, not with a second ALU.

Sorry about that!
 
This is old news and you are misinterpreting what he is saying, as did others last year. The CUs are the same, Cerny was trying to push compute by saying you don't need 100% of all 18CUs for purely for rendering.

Yes... I saw that as soon as I re-read that! I had read than on 2013... And kept a wrong idea!
 
  • The 5 instructions need to come from different wavefronts.
  • The 5 instructions need to be of different types.
So the CU can operate on more than 1 wave front at a time if I'm understanding.
There are 4 vector ALUs (16-wide) and 1 scalar ALU in each CU. Each of the VALUs spends 4 cycles executing a single instruction and there could be a vertex, hull, pixel and compute shader loaded into the instruction cache:
  • On cycle 1, VALU1 starts a new instruction - e.g. vertex shader: MAD v0, v1, v2, v3
  • On cycle 2, VALU2 starts a new instruction - e.g. hull shader: MUL v4, v2, v3
  • On cycle 3, VALU3 starts a new instruction - e.g. pixel shader: ADD v17, v47, v99
  • On cycle 4, VALU4 starts a new instruction - e.g. compute shader: MIN3 v11, v18, v22, v31

On cycle 5, VALU1 can start another instruction. On cycle 6, VALU2 can start another instruction, etc. Instruction issue to the VALUs is offset by 1 cycle, and each VALU spends four cycles executing the wavefront on 64 work items, as four lots of 16.

VALU1 can get the new instruction from the vertex shader, or from some other wavefront (e.g. another wavefront running the same code). The same applies to all the VALUs, they get their work from any wavefront that has its context assigned to it, and which has a VALU instruction ready to go.

The Scalar ALU is shared by all wavefronts on the CU. Every cycle it will execute an instruction. That instruction can come from any of the wavefronts that are loaded into the CU. Since it's a scalar ALU instruction, it only needs to run once and all 64 work items in the wavefront will be able to access the result of that instruction on the next cycle (with some minor exceptions).

So, in a single cycle, the CU can issue 1 VALU and 1 scalar instruction. That leaves 3 other instructions. They can be selected from: local data share, global data share or graphics export, vector global memory access, branching and then machine ops that have no meaning in code (such as signalling which long-latency operations need to complete).

Is it up to the graphics programmer to optimize the command buffer such that for instance, all 5 instructions are in use at all times?
It's up to the programmer to utilise all the resources efficiently. e.g. if you can write code that has the VALUs, SALU, LDS and global memory systems all running at full load with none of them being stalled by dependencies upon any of the others, then you can say you are fully using the machine. This is definitely possible.

Bearing in mind that the set of all wavefronts loaded into the CU are all being choreographed to use these resources. Whether those wavefronts are all running the same shader or they're all from different shaders (kernels), it doesn't matter. The CU works with a variety of scoreboards to identify which wavefronts can have an instruction executed.

Eventually as time goes on I imagine that performance optimization is going to come down to wavefront/CU optimization. And so it seems like a particularly important topic since -low level api- will be directly responsible for that interface.
The API doesn't materially affect what happens once shader/kernel wavefronts are despatched to a CU (unless the API allows for prioritisation of work or pre-emption or allows for exceptions to be handled). Once the wavefront has been delivered to the CU, you can think of the CU as a factory: it'll work on the instructions given to it until the code for that wavefront comes to an end (or is killed).

Progress of the wavefront can stop the whole GPU from doing other things. In old-fashioned GPUs, if you wanted to change the z buffer comparison operator, you'd wait until all the pixel shader wavefronts running with the first z operator had finished. Once done, the GPU would flip a switch and then the pixel shader for the new operator could start. So the wavefronts in this example are the cause of a GPU-wide delay.

I don't actually know if modern GPUs can render to both operators simultaneously. I'd hope so (at least if it was on distinct z buffers). It's certainly theoretically possible, if you scoreboard pixel export at the pixel quad level (or render target tile level), rather than at the state change level (for the z buffer comparison change)...
 
Last edited:
Last edited:
My understanding here isn't fully clear, since i'm not sure what happens at the SIMD level if I render 100K boxes with the same shader.
Lets's assume that you change the texture, vertex and constant buffer bindings between the draw calls. Let's assume a GPU that that executes threads using 64 wide waves (64 wide logical SIMD).

SIMD always executes identical instruction (in lock step) for 64 threads. This is also true for load/store (including texture and vertex fetch) instructions. A single texture fetch instruction takes uv coordinate from 64 threads and returns 64 filtered texels to the GPRs. All texels came from the same texture. Same is true for vertices. Single data load/store accesses 64 items from a single resource.

Constant data often also flows through an optimized data patch (fixed function hardware or specific read only scalar cache). One wave reads the constants once.

This all means that each thread in the same wave need to come from the same draw call. This is both true for vertex shader and pixel shader waves. For small draw calls that process less than 64 vertices, the remaining threads of the wave do no productive work (execute same stuff as others but are masked out). Same is true if the pixel shader produces less that 64 pixels per object.
 
Okay, thanks everyone. At this point in time, it's a good time to just stop and read and learn. Write some HLSL to get the feel for it, and see what it's doing. Unfortunate that I don't have a GCN based card, but I imagine that nvidia has tools for their kepler line similar to CodeXL.

So I guess what I'm understanding here, does this indirectly mean you should be optimizing your shader not just for the hardware, but also the API?
I can imagine where in situations like DX11 things are generally not as compact, data and command calls come at a different rate than they would for DX12, there are also different feature sets as well - but just looking at standardized calls between the two: would timing and synchronization have an affect on how shaders are optimized?

Like say want to optimize for 10 wavefronts, each wavefront would have less VGPRs to work with. So this might imply your shader code as being less complex/smaller. and if you want longer more complex shaders you're going to use more VGPR and lose out on some wavefronts. So the situation with DX11 submitting work into the pipeline, to maximize the use of the GPU, you might write shaders say with 3-5 wavefronts. But with DX12 submitting from multiple queues and a lot of them, maybe you want shaders to operate differently, the optimization point moves, maybe it's more wavefronts would get more use out of the pipeline.

What is the actual relationship between API calls and shader code, if any?
 
You misunderstood. You always have a specific SIMD-width granularity on a specific hardware, totally unrelated to the API. You have more than 1 SIMD-unit on the chip!
Look here for a graphic: CU/SIMD
The 290x has 44 CUs, each CU has 4 SIMD-units (means 176 SIMDs), each SIMD-unit is 16 elements wide (means 2816 elements).
The CU can revive/sleep it's SIMD-units running instances when they stall and run another instance _of the same program_. The CU's SIMDs share some memory, which is used for the pipeline-stage parameter passing (you know: vertex-shader out -> pixel-shader in). If this memory is too small for 4 instances then you only get 3 or 2 or 1. That's an optimization problem of the driver, in theorey it could spill to memory, but then it gets really slow. It also limits the amount of revive/sleep can be conducted in the CU.
Each CU can have it's own program running and has no relationship with the other CUs normally.

Okay. Now this means, you could run 44 different shader-programs simultaniously. Of each program you can run 4x different tiles of size 4x4, which is exactly the z-buffer tile size. They don't need to be related at all, not the same triangle, possibly not the same drawcall, just the same inputs and program.

It doesn't matter if DX11 or DX12, if the local CU's state is turning out to be identical. DX12 will not use a different amount of GPRs. And multiple queues doesn't change the way it operates locally.
 
Each CU can have it's own program running and has no relationship with the other CUs normally.
CUs can actually run multiple programs each. Basically each wave could be running a different compute program. This is because GCN doesn't need global state for compute. All the bindings (resource descriptors) a compute shader needs are loaded to SGPRs. Graphics rendering on the other hand needs global state (DirectX graphics state is huge), meaning that there is some global limit about how many simultaneous graphics shaders with different state can be executed concurrently.

How many programs / threads can be running on a single CU is based on the following:

1. The programmer defines a thread group size for his/her compute shader program. Group size is rounded up to the hardware wave size.
2. Shader programs can communicate inside the thread group (synchronization barriers, thread local shared memory = LDS). Communication needs fast connection between the threads, meaning that the whole thread group needs to execute on a single CU.
3. A thread group needs N waves = threads (defined by programmer), some amount of LDS (defined by programmer) and some amount of registers (defined by the programmer/compiler) to run.
4. Thread groups are completely independent of each other. Whenever a CU has enough free resources to start executing a new thread group, the scheduler pushes it a new thread group. Thread groups executing on a single CU might come from different queues and might execute different shaders.

Graphics core next CUs have the following resources:
- 256 KB registers (VGPR)
- 8 KB scalar registers (SGPR). These are used mainly for branching, resource descriptors and constants.
- 64 KB LDS (thread group shared work memory)
- 40 waves (2560 threads)

Source: http://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah

A CU can run as many thread blocks as it has resources. Thread block needs to fully fit to CU in order to start (otherwise there might be deadlock cases as barriers need each thread to advance to a certain position in order for the shader to continue). Big thread blocks (lots of threads, lots of registers, big LDS usage) are often problematic. For example if your thread block needs 33 KB of LDS to run, you can only run one of those per CU. It is important to know the CU resource limits in order to fit as many thread blocks to each CU as possible.
 
Okay, this reads, that it's possible for the global scheduler which assigns programs to CUs, to pack programs with a large GPR footprint and programs with a small GPR footprint together, to not loose the slot and underutilize the CU. The question is, is the 40 program queue per CU like a stack, and can you just drop a new program in it if you have space? Or do all the programs need to be set once? GCN also guarantees that all 64 threads run on the same CU, which means if you can only run 1 of them at a time (a lot GPR use fe.), they are run sequentually.
In addition, the 40 programs would have to be compatible in their packings, or you need logic to only pick combinations of 4 which do not exceed your resources. Say program 1+2+3+4 can be run together, but 1+2+3+5 not, and 2+3+4+5 also not, but 2+3+5+6 and so on. Much effort.

Thinking about the decision levels though, I think it's nasty. The to-be-run programs are not known in advance (anymore), you don't know how often they run, you can't calculate the perfect packing. You have no clear metric to decide if it's faster to run 4 different programs in parallel and their instances sequentially or if you run the instances in parallel and the programs sequentially.

Is there a tool/profiler which can make pipe-line graphs of the scheduling behaviour in the CUs? Or if not, is there some way to track the CU's state do do it yourself?
 
Last edited:
All 64 threads on a wave run on one SIMD so if you're VGPR limited you'll still have 4 waves per CU. Each of these waves can execute a different program. You'll want more waves than that to get good performance in most situations.

Console vendors have tools showing when and where a wave executes. There have been a few pictures in presentations.
 
Hm, I see. I thought GPR-space is shared for all 4 SIMDs, but's not. Having it shared could delay GPR overcomission by a bit? Otherwise, wouldn't there be opportunity to run overcomitted shaders in parallel (or vertical) at the expense of being unable to switch to the same program, but another, if it stalls?
I'm not interested in compute, I want to see the GPU's state while running pipeline wavefronts.
 
Hm, I see. I thought GPR-space is shared for all 4 SIMDs, but's not. Having it shared could delay GPR overcomission by a bit?
As currently structured, each SIMD has its own register file. These are physically separate pools.
If the register files were somehow able to route signals to other SIMDs, I suppose it would avoid running into register limits in certain combinations of register allocation.

However, the register files and their matching SIMDs streamline a lot of things by only having one SIMD capable of addressing or interacting with one register file. In terms of power and footprint, enabling some kind of centrally shared file would either necessitate a drop in CU count and/or overall register capacity.
It seems probable there would be an impact on SIMD throughput as well. One SIMD leaching register capacity from others means a whole CU's register resources are favoring one SIMD and potentially starving the other three. Since wavefronts are distributed 10 to a SIMD, that could shut down 3/4 of a CU's instruction issue if the other SIMDs are starved.
Physical constraints on how many units can contend for the same register file may also cause stalls where the favored SIMD steals register file access from its neighbors.

GCN is very heavily architected for a 4-way partition of its resources and instruction issue. It must be a very important shader if 3/4 of everything but registers can be discarded at random.
 
Back
Top