DX12 Performance Discussion And Analysis Thread

3dilettante · Sep 25, 2015

Andrew Lauritzen said:
Just to add hilarious additional vectors to the terminology fun, there is previous precedent in this specific field for using "concurrent" vs "parallel" (see slide 11). That use gets even more subtle though as it is speaking to whether or not code that is currently executing on a processor can guarantee other code is running in parallel for coordination/synchronization reasons.

As a language nitpick, I don't know if I like the way concurrent is used in that case, since it makes a distinction that wrecks the meaning of "concurrently". It would make discussion of what a system is doing presently rather awkward when concurrent does not readily flow to its adverb, and parallel...-ly is iffy.
I'm not sure if I would object to asynchronous more than that.

sebbbi · Sep 25, 2015

DX12 adds both primitives that allow more control over concurrency and primitives that allow multiple queues of asynchronous execution. People seem to be confusing these concepts.

Barriers allow the programmer to define which back to back commands from the same queue can be executed concurrently without data hazards. In DX11 the driver decided this by heuristics. This meant that compute dispatches only ran concurrently in simple cases. Graphics did run concurrently more often, since render target writes have well defined order (and you could not read and write the same RT, except with blender, and that is atomic and ordered).

Multiple command queues didn't exist in DirectX 11, and are a completely new feature in DX12. Commands inside a queue start execution in queue order (automatic hazard tracking in DX11 enforcing "visible" order, DX12 needs barriers). With DX12 you can push commands to multiple completely independent command queues that run asynchronously of each other, hence called asynchronous compute. Some GPUs are also able to execute commands from multiple queues concurrently.

Andrew Lauritzen · Sep 25, 2015

3dilettante said:
As a language nitpick, I don't know if I like the way concurrent is used in that case, since it makes a distinction that wrecks the meaning of "concurrently".

FWIW I agree, hence my preface, but it was too much to resist throwing more fuel on the fire about terminology

3dilettante said:
I'm not sure if I would object to asynchronous more than that.

Asynchronous is clearly too general to describe anything about the actual execution though. Asynchronous really just means... not synchronous. In some sense all graphics APIs are asynchronous (they definitely are for CPU/GPU, and they are in many areas for the GPU internally already), and definitely DX12's queues are asynchronous regardless of the underlying implementation.

I actually like "simultaneous" and "simultaneously". No one can really misinterpret that

Alessio1989 · Sep 25, 2015

Does anyone have some more readable and meaningful results about asynchronous copy operations? At least this should be easier to measure.

If I understood some documents and presentations well, WDDM 2.0 drivers should also bring free-threaded create&destroy operations on non-D3D12 applications too.

EDIT: as for concurrency/async/prarellel, I still like to think about parallelism as a special case of concurrency where two or more tasks are executed on the same hardware simultaneously on different dedicated resources (but not necessary separated hardware resources). Async is just you do not put an explicit synchronization into code.

3dilettante · Sep 25, 2015

Andrew Lauritzen said:
Asynchronous is clearly too general to describe anything about the actual execution though. Asynchronous really just means... not synchronous. In some sense all graphics APIs are asynchronous (they definitely are for CPU/GPU, and they are in many areas for the GPU internally already), and definitely DX12's queues are asynchronous regardless of the underlying implementation.

Asynchronous does work better above the layer of the abstraction, and concurrent works better below. AMD's asynchronous compute marketing seems to mix above and below that dividing line.

Asynchronous is very encompassing, so I would count that as being a significant demerit for relying on the term when trying to discuss what an architecture is doing. I am just not sure I would object to it more than the usage of concurrent given in that slide, at least in a degenerate case where an implementation is incapable of running the tasks concurrently--either architecturally or due to circumstances at the time.

That set of definitions seems incomplete, or it would require additional qualifications or doubling-back when trying to discuss it in the face of the established meaning of concurrently.

There are some fun alternative word choices, such as coextensive and conterminous.
Coextensive execution would mean the in-flight periods for the tasks overlap.
Conterminous execution means the tasks would execute within the same bounds, which actually might serve better than overloading the term parallel.

I actually like "simultaneous" and "simultaneously". No one can really misinterpret that

For some commonly agreed upon margin of error, since that does have an instantaneous component that might be briefly untrue in the case of hardware that can context switch or preempt.

iroboto · Sep 26, 2015

AlNets said:
(╯°□°)╯

A masterpiece. Your work is done here.

Just a note so I can add something to this discussion. For async compute jobs on GCN (edit: I speculate based upon the information provided below) a default of 4CU are allocated for the job unless specified for more.

Fable legends may use async compute yes but it is designed for XBO in mind which has 12 CU so keep this in mind when you think about this particular benchmark. FuryX has 64 IIRC.

Optimizing for Async Compute in games will not be easy because of the varying number of ACE queues and CUs available at any given time for different hardware configurations. Though it should be doable, it will be much harder optimizing between Nvidia and AMD since they approach async differently.

3dcgi · Sep 26, 2015

iroboto said:
For async compute jobs on GCN a default of 4CU are allocated for the job unless specified for more.

That's not true. At least not in general. I don't know if one of the consoles works this way though and maybe that's where your info comes from.

Metal_Spirit · Sep 26, 2015

iroboto said:
Just a note so I can add something to this discussion. For async compute jobs on GCN a default of 4CU are allocated for the job unless specified for more.

I know you read the Xbox One SDK so you might be correct. But I highly doubt it! We are talking GCN here!

If minimum CU reservation is 4, then you would always work with 4, 8, 12, 16, 20 and so on CUs. Never with 1, 2, 3, 4, 5.....16,17,18... and so on, right?
In this case where would the 18 CUs on the PS4 fit? 18 is not dividable by 4... so, maximum CU usage on the PS4 with a default 4 CU allocation would be 16.

This can even be the case, because if we look at this GCN whitepaper we see that every 4 CUs share some L1 caches. Here's a pic:

So, what happens with the 2 remaining CUs on the PS4 is a mystery (at least for me). Are they sharing L1 caches alone?

But the 14+4 has nothing to do with reservations. Mark Cerny talked about the console being balanced at 14 CUs, meaning the 4 extra CUs would give a minor boost for rendering. And as such they would be better used if reserved for compute. Its as simple as that!
Here's Mark Cerny Slide:

Anyone can enlighten this?

iroboto · Sep 26, 2015

3dcgi said:
That's not true. At least not in general. I don't know if one of the consoles works this way though and maybe that's where your info comes from.

It is speculative (as you say) as it's written on the XBO SDK which is where I'm getting my info from. However after going through all the documentation I haven't seen much evidence that MS has done anything to really modify the hardware after all is said and done. I'm leaning towards doubting they changed this aspect too. The default allocation according to the SDK is 4CU unless you specify for more, there was no indication if you could allocate less. I'll double check the API though and see how it lines up against the dx12 one.
edit: I did make a mistake earlier though, this appears to only apply to the low priority queue for async compute on XBO.
@Metal_Spirit, as you can see from the documentation below both MS and Sony have unequalled access to the hardware. They can modify the degrees in which they can control the workload splits over compute units. I'm still searching through mantle and DX12 documentation if they have this type of control over the hardware, but I haven't come across it yet. My feeling is the AMD set the default value as [4]. If you don't have access to the hardware like MS and Sony do, what would DX12 request in terms of ALU resources for async compute (for any configuration nvidia or AMD?) - I speculate the defaults that AMD and nVidia set for async compute.

Even if the numbers are off, in terms of default allocation, we have at least some insight here into how it's operating at a lower level. I think that some set allocation must be made by AMD and nVidia for async compute resources and it's unknown if they allocated a relatively similar amount.

Documentation in question:

3dcgi · Sep 26, 2015

The driver controls reservations thus Microsoft and Sony can use whatever values the feel fits their hardware. They likely expose some reservation capability to programmers. This is not exposed on the PC and is hidden by the driver. You most definitely will get more than four CUs working on compute tasks on Fiji.

The documentation you posted is different than reservations. It describes how the wave launcher distributes work but doesn't specify which CUs are available.

Metal_Spirit, the cache sharing between CUs was described during the Pitcairn launch. Up to 4 CUs can share but there can be fewer.

iroboto · Sep 26, 2015

3dcgi said:
The driver controls reservations thus Microsoft and Sony can use whatever values the feel fits their hardware. They likely expose some reservation capability to programmers. This is not exposed on the PC and is hidden by the driver. You most definitely will get more than four CUs working on compute tasks on Fiji.

The documentation you posted is different than reservations. It describes how the wave launcher distributes work but doesn't specify which CUs are available.

Metal_Spirit, the cache sharing between CUs was described during the Pitcairn launch. Up to 4 CUs can share but there can be fewer.

But for async compute, isn't the idea to reserve less hardware to complete the jobs? I get that compute jobs depending on size will span over as many CU units as needed, but if the idea (for async compute) is to insert compute jobs where there are stalls in compute jobs (sync points in the shader, waiting for loading of resources etc), then ideally wouldn't you want to reserve less hardware?

pharma · Sep 26, 2015

In theory more CU should give more performance/remove bottlenecks, but isn't there also a point beyond which nothing is gained due to other constraints? Theory specifies a certain maximum but in reality that may never be reached/utilized.

SimBy · Sep 26, 2015

Another bench comparing async on and off. This benchmark version might not be the same as the one reviewers got.

http://forums.anandtech.com/showthread.php?p=37724231#post37724231

Aysnc ON result @ 1080 is far above anything reviewers were able to get. Async OFF on the other hand is in the ballpark.

Razor1 · Sep 26, 2015

pharma said:
In theory more CU should give more performance/remove bottlenecks, but isn't there also a point beyond which nothing is gained due to other constraints? Theory specifies a certain maximum but in reality that may never be reached/utilized.

yeah depending on the program, bottlenecks can shift between the different parts of the GPU, I think one of the Fable Legend's benchmark articles went into that, Fiji seems to be bottlenecked similarly to Hawaii and its derivatives. Similar to how tessellation will bottleneck Fiji earlier than Maxwell 2.

pjbliverpool · Sep 26, 2015

SimBy said:
Another bench comparing async on and off. This benchmark version might not be the same as the one reviewers got.

http://forums.anandtech.com/showthread.php?p=37724231#post37724231

Aysnc ON result @ 1080 is far above anything reviewers were able to get. Async OFF on the other hand is in the ballpark.

Is the user that posted that comparison a reliable source? Seems the Mods aren't giving him the benefit of the doubt at the moment. It's a pretty big claim to make with nothing I can see to back it up thus far.

SimBy · Sep 26, 2015

pjbliverpool said:
Is the user that posted that comparison a reliable source? Seems the Mods aren't giving him the benefit of the doubt at the moment. It's a pretty big claim to make with nothing I can see to back it up thus far.

He claims to be working on a PS4 project. Most likely VR related. But other than that I don't know him.

http://forums.anandtech.com/showthread.php?p=37542560&highlight=#post37542560

iroboto · Sep 26, 2015

pharma said:
In theory more CU should give more performance/remove bottlenecks, but isn't there also a point beyond which nothing is gained due to other constraints? Theory specifies a certain maximum but in reality that may never be reached/utilized.

Having more CUs to complete a job would go faster. Yes, this part isn't so much in debate. It's a question of how large that job is. The idea of async compute isn't go make compute go faster, but to increase the utilization of the compute units, so wherever there are gaps the idea is to fill those gaps with smaller compute jobs that aren't latency sensitive. At least this is my take on it. I assume most async compute jobs are being called from frame N, while the GPU is working on N-1.

Razor1 · Sep 26, 2015

having more CU's won't make the job go faster per say, it will increase the amount of queues available and this will allow the ability to have a larger pool of compute instructions to push into the graphics queue.

Ryan Smith · Sep 26, 2015

Metal_Spirit said:
This can even be the case, because if we look at this GCN whitepaper we see that every 4 CUs share some L1 caches. Here's a pic:

So, what happens with the 2 remaining CUs on the PS4 is a mystery (at least for me). Are they sharing L1 caches alone?

3dcgi said:
Metal_Spirit, the cache sharing between CUs was described during the Pitcairn launch. Up to 4 CUs can share but there can be fewer.

Aye. Pitcairn is by default a 2x4 + 4x3 configuration, and Cape Verde was 1x4 + 2x3. Depending on the specific GCN SKU and the number of disabled CUs for chip salvaging, you'll see 2-4 CUs per CU array.

http://images.anandtech.com/doci/5625/PitcairnArch.png

If I had to guess for PS4, and assuming the last 4 CUs are a single "special" array, you'd likely have 2x4 + 2x3 for the main pool of 14 CUs since it's the most balanced option.

SimBy said:
Another bench comparing async on and off. This benchmark version might not be the same as the one reviewers got.

http://forums.anandtech.com/showthread.php?p=37724231#post37724231

Aysnc ON result @ 1080 is far above anything reviewers were able to get. Async OFF on the other hand is in the ballpark.

For what it's worth I'm not 100% sure on how he's even doing that. Fable Legends is locked down. The INI files appear to be hardcoded in the EXE and regenerated at runtime, which locks out any obvious means of changing the settings beyond the 3 built-in modes.

Alessio1989 · Sep 26, 2015

By the way, UE4.10 appear in the master repo, let's hope they will update the D3D12 RHI implementation D:

DX12 Performance Discussion And Analysis Thread

3dilettante

sebbbi

Andrew Lauritzen

Moderator

Alessio1989

3dilettante

iroboto

Daft Funk

3dcgi

Metal_Spirit

iroboto

Daft Funk

3dcgi

iroboto

Daft Funk

pharma

SimBy

Razor1

pjbliverpool

B3D Scallywag

SimBy

iroboto

Daft Funk

Razor1

Ryan Smith

Alessio1989

Similar threads