DirectX 12: The future of it within the console gaming space (specifically the XB1)

chris1515 · Jan 11, 2016

dobwal said:
Why is that an error on Sony? If you are a PC indie dev who simply wants to port a DX11 like title that doesn't push your typical PC or console, whats the point of a low level api?

GNMX and the higher level api on the XB1 are there to serve devs whose games aren't cutting edge tech where a lower level api only represents additional headaches and costs.

There is one API on Xbox One side just easier for PC dev because very familiar for Direct X developer. The problem is GNMX is used not only by PC Indie dev but some AAA games use or used it too...

Lalaland · Jan 11, 2016

chris1515 said:
There is one API on Xbox One side just easier for PC dev because very familiar for Direct X developer. The problem is GNMX is used not only by PC Indie dev but some AAA games use or used it too...

No there are two APIs on the XB1 side also, there is regular vanilla DX11 and the low-level DX12 like DX11.x (where all of the custom optimisations for XB1 live). When the full DX12 transition happens the more advanced features will still require learning new techniques and syntax just as learning to write in GNM would. The misuse of GNMX or DX11 is hardly a fault of the authors of those APIs.

Edit: Actually iRoboto has the correct API listing below but there are still two of them (he said grasping at straws)

iroboto · Jan 11, 2016

chris1515 said:
There is one API on Xbox One side just easier for PC dev because very familiar for Direct X developer. The problem is GNMX is used not only by PC Indie dev but some AAA games use or used it too...

Xbox One officially supports
DX11.X (contains additional access to hardware specific to Xbox One)
and DX11.X Fast Semantics (which is the low overhead variant of DX11.X with additional lower level access to some components of the GPU)

Newly added but likely to be phasing out the older APIs over time:
DX12.X (this should be replacing fast semantics in the near future)
edit: (wrong)DX11.3X (this should replace 11.X) (how does one strike through??)

Metal_Spirit · Jan 11, 2016

DirectX 11.3?? 100% on hardware?
I believe that would make the Xbox feature_level Dx 12_1.

iroboto · Jan 11, 2016

Metal_Spirit said:
DirectX 11.3?? 100% on hardware?
I believe that would make the Xbox feature_level Dx 12_1.

no.. not necessarily. But I should double check to confirm.

edit: you are correct - to be considered 11.3X you would have to meet all the requirements to make that standard. I don't know how this slipped my mind at the time of writing.
https://msdn.microsoft.com/en-us/library/windows/desktop/dn914596(v=vs.85).aspx

D3D11.3 Features:

Topic Description
Conservative Rasterization

Conservative rasterization adds some certainty to pixel rendering, which is helpful in particular to collision detection algorithms.

Default Texture Mapping

The use of default texture mapping reduces copying and memory usage while sharing image data between the GPU and the CPU. However, it should only be used in specific situations. The standard swizzle layout avoids copying or swizzling data in multiple layouts.

Rasterizer Order Views

Rasterizer ordered views (ROVs) allow pixel shader code to mark UAV bindings with a declaration that alters the normal requirements for the order of graphics pipeline results for UAVs. This enables Order Independent Transparency (OIT) algorithms to work, which give much better rendering results when multiple transparent objects are in line with each other in a view.

Shader Specified Stencil Reference Value

Enabling pixel shaders to output the Stencil Reference Value, rather than using the API-specified one, enables a very fine granular control over stencil operations.

Typed Unordered Access View Loads

Unordered Access View (UAV) Typed Load is the ability for a shader to read from a UAV with a specific DXGI_FORMAT.

Unified Memory Architecture

Querying for whether Unified Memory Architecture (UMA) is supported can help determine how to handle some resources.

Volume Tiled Resources

Volume (3D) textures can be used as tiled resources, noting that tile resolution is three-dimensional.

Metal_Spirit · Jan 11, 2016

Not shure thats 100% hardware. GCN does not really support ROVs or Conservative Rasterization by hardware although they can do both by software.
Read this by Andrew Lauritzen (post #783)
https://forum.beyond3d.com/posts/1823113/
Sebbbi seems to confirm this in the next post.

Andrew Lauritzen · Jan 12, 2016

You guys are getting confused between API and feature levels again. DirectX 11.3 is an API, not a feature level and doesn't imply anything about the latter. GCN - at least on desktop and we have no real reason to think it is different on console - is feature level 12_0 and does not support conservative rasterization or ROVs.

sebbbi · Jan 12, 2016

Andrew Lauritzen said:
GCN - at least on desktop and we have no real reason to think it is different on console - is feature level 12_0 and does not support conservative rasterization or ROVs.

But... GCN supports ordered atomics. And I like ordered atomics

. Hopefully we get ordered atomics soon in PC DirectX.

I wonder whether Intel ROV hardware is tied to ROPs... It would be extremely useful in compute shaders.

Andrew Lauritzen · Jan 12, 2016

sebbbi said:
But... GCN supports ordered atomics. And I like ordered atomics .

Define an "ordered atomic" for me here. Ordered by what? Thread index or something?

Metal_Spirit · Jan 12, 2016

Andrew Lauritzen said:
You guys are getting confused between API and feature levels again. DirectX 11.3 is an API, not a feature level and doesn't imply anything about the latter. GCN - at least on desktop and we have no real reason to think it is different on console - is feature level 12_0 and does not support conservative rasterization or ROVs.

Not really. Question was if DX 11.3 was fully supported on hardware on XBox One. Although DX 11.3 is an API, support of all it's features on an hardware level translates to feature level 12_1.
Mere support is not enough to conclude anything though, since it can be feature level 12_0 or 12_1.
I belive this chart explains it:
https://msdn.microsoft.com/en-us/library/windows/desktop/ff476876(v=vs.85).aspx

Andrew Lauritzen · Jan 13, 2016

Metal_Spirit said:
Not really. Question was if DX 11.3 was fully supported on hardware on XBox One. Although DX 11.3 is an API, support of all it's features on an hardware level translates to feature level 12_1.

Right and the very question indicates a misunderstanding of how APIs relate to feature levels

It is not a meaningful question.

If you want me to literally interpret "support of all it's features and hardware levels" then no hardware supports any recent version of DirectX and likely never will. That's not even touching the nebulous notion of "supported on hardware" which is not well-defined.

Really guys, there is no "translation" or "implication" from APIs to feature levels. They are two completely separate things.

Metal_Spirit said:
https://msdn.microsoft.com/en-us/library/windows/desktop/ff476876(v=vs.85).aspx

I'm incredibly familiar with how this all works

pTmdfx · Jan 13, 2016

Andrew Lauritzen said:
Define an "ordered atomic" for me here. Ordered by what? Thread index or something?

Code:

DS_ORDERED_COUNT
Increment an append counter. The operation is done in wavefront-creation order.

GCN3 ISA Reference

Andrew Lauritzen · Jan 13, 2016

Oh so it's more of a pack than an atomic per se, and via the counter mechanism vs. general purpose atomics. Pretty different from ROVs.

Curiously does it basically have similar semantics to a group sync on top of the atomic as well? One would assume it would potentially require threads to stall until other ones in the group arrive at the same place. That actually begs the question of if if's just a per-CU "wavefront launch ordering" or something somehow global... I'm not sure I like the idea of putting any kind of global sync in the middle of a shader

Thanks for the link!

pTmdfx · Jan 13, 2016

Andrew Lauritzen said:
Oh so it's more of a pack than an atomic per se, and via the counter mechanism vs. general purpose atomics. Pretty different from ROVs.

Curiously does it basically have similar semantics to a group sync on top of the atomic as well? One would assume it would potentially require threads to stall until other ones in the group arrive at the same place. That actually begs the question of if if's just a per-CU "wavefront launch ordering" or something somehow global... I'm not sure I like the idea of putting any kind of global sync in the middle of a shader

Thanks for the link!

It isn't said to be GDS only, and AMD said it is global. I believe it is possible to drive a ROV implementation with mutexes, likely being disastrous though.

Andrew Lauritzen · Jan 13, 2016

pTmdfx said:
It isn't said to be GDS only, and AMD said it is global. I believe it is possible to drive a ROV implementation with mutexes, likely being disastrous though.

Huh, yeah that seems like you could real introduce some nastiness with that with varying control flow times. Indeed the counters are generally global but if there's an implied sync/ordering on top of that for the "ordered" versions I'm curious if that ends up being really bad for anything more complicated that a pack type operation at the outer scope. Anyways, interesting feature no doubt.

I believe people have played with doing ROVs with various sync primitives; a) you can't do it properly via the API of course and b) even if you do it in the driver it is incredibly slow. Certainly not a substitute for the hardware feature

sebbbi · Jan 13, 2016

Andrew Lauritzen said:
Oh so it's more of a pack than an atomic per se, and via the counter mechanism vs. general purpose atomics. Pretty different from ROVs.

Curiously does it basically have similar semantics to a group sync on top of the atomic as well? One would assume it would potentially require threads to stall until other ones in the group arrive at the same place.

This feature is not well documented. An old GCN 1.0 presentation had some info about it, and the ISA manuals have some info as well. But it is not a difficult feature to understand.

The first thing that came to my mind (when I read the GCN 1.0 presentations many years ago) was exploiting this to do global prefix sum. Idea goes like this: Array is split evenly accross N thread groups. Each group prefix sums it's own elements (in LDS and/or using cross lane operations). When the local prefix sum is done, one thread in the group does ordered atomic increment to the global counter (increment by the sum of the last local element). It gets back the previous value, which happens to be the sum of all the previous thread groups (= all array elements before this thread group). Then each thread adds the atomic return value to their own prefix sum value. Result = single pass global prefix sum with zero memory bandwidth used

Fast prefix sum is super important for many parallel algorithms.

Andrew Lauritzen · Jan 13, 2016

sebbbi said:
The first thing that came to my mind (when I read the GCN 1.0 presentations many years ago) was exploiting this to do global prefix sum. ... Fast prefix sum is super important for many parallel algorithms.

Yeah that's exactly what I was referencing as a "pack" sort of operation. Definitely useful but whether or not it's faster or more efficient depends entirely on how the implementation works in terms of sync and so on

Have you played with this feature at all? Do you know how it works or how fast it is and what the constraints are in practice? Obviously it would need keep HW threads alive and so on, but does it affect any of the scheduling beyond just putting a thread to sleep? Globally that could certainly starve a CU if it was waiting on an earlier-launched thread to issue the "ordered" operation still; what happens if that thread never issues it due to dynamic control flow or otherwise? Does it have to wait until the thread ends completely?

Jawed · Jan 13, 2016

Apparently used in several places in Dreams's engine.

sebbbi · Jan 14, 2016

Andrew Lauritzen said:
Obviously it would need keep HW threads alive and so on, but does it affect any of the scheduling beyond just putting a thread to sleep? Globally that could certainly starve a CU if it was waiting on an earlier-launched thread to issue the "ordered" operation still; what happens if that thread never issues it due to dynamic control flow or otherwise? Does it have to wait until the thread ends completely?

Let's assume the simplest possible (global) GPU scheduler. The scheduler splits the kernel to N thread groups. Each thread group executes threads with id = [N, N+M], where M is the thread group size, and N is a (scheduler) counter that is increased by M every time the scheduler issues a new thread group to a compute unit. When a compute unit finishes any thread group (of any shader, assuming that compute units can run thread groups of multiple shaders), the scheduler checks whether that compute unit has enough resources (GPRs, threads, LDS space, etc) to start a new thread group. If the compute unit had enough resources, the scheduler gives it the next thread group.

This kind of scheduling would ensure that all the previous thread groups have always been issued to compute units (have resources allocated and are ready to execute), meaning that the previous thread groups will eventually also finish. There isn't any deadlock cases. Waiting for a previous thread group to reach point X in the shader is a safe operation. All modern GPUs have similar latency hiding mechanism: put waves/warps to sleep that are waiting for a memory operation and immediately switch to another wave/warp. I don't see a problem implementing ordered atomics using this existing mechanism.

Performance:
Let's assume GPU has the simple global scheduler described above. In the beginning the compute units will be scheduled thread groups [0, K-1], where K is the number of compute units. When the scheduler is done scheduling groups [0,K-1], it notices that each compute has still free resources, and schedules thread groups [K, 2*K-1]. So compute unit 0 will receive groups 0 and K, and compute unit 1 will receive groups 1 and K+1, etc... If we assume that each thread group takes approximately the same time to complete, the waits will be minimal. This is because the thread groups were issued in order, and will roughly finish in order. And this continues throughout the execution, as the scheduler will schedule a new thread group to a compute unit that finished a thread group first, and the new one to the compute unit that finishes a thread group second, and so on.

In my example prefix sum implementation, the ordered atomic would be in the middle of the shader. When the previous thread block is not yet finished, the GPU would behave in a similar way than noticing a cache miss in a middle of the shader. Other waves/warps (and thread groups) on the same compute unit would execute their instructions to hide the stall, until they also reach the ordered atomic. If there are enough instructions (and waves/warps), the GPU should be able to hide the stall quite well. Also it helps that there are also instructions in the end of the shader (that have no dependencies). These instructions can also be executed when another thread group on the same compute unit is waiting for the ordered atomic. So it all boils down to this: Is there enough instructions in the local prefix sum shader to hide the ordered atomic latency (time to bounce the counter value from one compute unit to another).

Divergent control flow:
Practically the same issue as barriers + divergent control flow. The compiler should give a compiler error.

DISCLAIMER: My analysis assumes a super simple global scheduler. Most likely real GPUs have more complex distributed schedulers, making all this much more complex (for both correctness and performance)

Andrew Lauritzen · Jan 14, 2016

sebbbi said:
The scheduler splits the kernel to N thread groups. Each thread group executes threads with id = [N, N+M], where M is the thread group size, and N is a (scheduler) counter that is increased by M every time the scheduler issues a new thread group to a compute unit.

Yes it's not difficult to ensure that only the relevant threads are alive on the machine at once and so on. The issue is if this mechanism efficiently handles the synchronization part in the presence of varying/divergent control flow paths .

sebbbi said:
All modern GPUs have similar latency hiding mechanism: put waves/warps to sleep that are waiting for a memory operation and immediately switch to another wave/warp. I don't see a problem implementing ordered atomics using this existing mechanism.

There's no semantic problem, but this would be highly inefficient in the case where one of the "early" threads takes longer than the "later" ones. Since in this system you've tied up all the "state" of the algorithm in the register files, there is no way to allow any additional HW threads to make progress on the portion of the shader before the sync/"ordered atomic".

sebbbi said:
If we assume that each thread group takes approximately the same time to complete, the waits will be minimal.

Right, that's the critical assumption to me as I noted in my previous post. This is probably a reasonable mechanism for simple pack/scan-type operations but not in the presence of any divergent heavy-lifting to determine the "mask".

That's not to downplay the utility vs. the alternatives in these cases - it sounds like useful feature.

sebbbi said:
These instructions can also be executed when another thread group on the same compute unit is waiting for the ordered atomic.

The local/CU sync is not the issue - inside a CU this mechanism is no more fancy than a group sync and so on. It's really the global sync and associated ability to stall entire CUs that could cause problems.

sebbbi said:
So it all boils down to this: Is there enough instructions in the local prefix sum shader to hide the ordered atomic latency (time to bounce the counter value from one compute unit to another).

My concern isn't about the "latency" of the operation itself. Like I said, it's about the global dependency that is being created between invocations with potentially differing lengths. A two pass algorithm will handle that case ok since it basically spills whatever minimal set of data it needs to and then frees up the thread for other work in all cases. By not spilling and instead tying up a hardware thread based on a global sync primitive you do run the risk of idle units. In that way it is similar to ROVs, but the latter has much more local synchronization.

sebbbi said:
Practically the same issue as barriers + divergent control flow. The compiler should give a compiler error.

Right, so I argue this is more of a sync than just an atomic; minor point, just the terminology was initially confusing to me

Also you'd still need some way to specify an interesting execution mask, you just probably wouldn't want to allow it to skip the operation entirely.

Look at the docs further - does this take a parameter at all, such as a value to add to the counter? If it can only add "1" for a given lane or not (i.e. it can add the execution mask like a regular IncrementCounter) that is still useful for pack, but limits its utility for general purpose scan, right?

DirectX 12: The future of it within the console gaming space (specifically the XB1)

chris1515

Lalaland

iroboto

Daft Funk

Metal_Spirit

iroboto

Daft Funk

Metal_Spirit

Andrew Lauritzen

Moderator

sebbbi

Andrew Lauritzen

Moderator

Metal_Spirit

Andrew Lauritzen

Moderator

pTmdfx

Andrew Lauritzen

Moderator

pTmdfx

Andrew Lauritzen

Moderator

sebbbi

Andrew Lauritzen

Moderator

Jawed

sebbbi

Andrew Lauritzen

Moderator

Similar threads