DX12 Performance Discussion And Analysis Thread

CSI PC · Sep 3, 2015

Is it really clear cut that Maxwell 2 (just focusing on that for now) does not support Async?
To me it is more muddy than that conclusion because of the behaviour at certain points highlighted by Ka_rf; rather than focus on just where it fails look at where it works (beginning and say around 240-330 threads).
So is it possibly a driver/scheduler issue?
Cheers

PS. sorry previous post it should had said "evolved since then for Ashes?" and not "involved".

sebbbi · Sep 3, 2015

3dilettante said:
The test case is whether graphics processing time can overlay with compute, is there an expectation that a 256 thread group would change the verdict for the GPUs running it?

A single lane compute work needs the scheduler to spawn a single wave on all GPUs. On AMD the wave is 64 wide, meaning that the architecture is designed to run/manage less waves (as each do more work). If you spawn single lane work, you will more likely end up under-utilizing GCN compared to other GPUs. Work group sizes can also expose (academic) bottlenecks, since the resource (gpr, lds) acquisition/release is done at work group granularity.

I am just saying that using a workload that is not realistic can cause various bottlenecks that will not matter in a realistic scenario. Some people seem to be drawing all kinds of conclusions based on the results of this thread. Even though the results might not mean anything for most applications (especially games - the main purpose for DX12 API).

3dilettante said:
There was puzzlement when it came to why the latency was that disparate for GCN for the lowest cases, which is largely explained by the omission of the 4-cycle wavefront.

This is a downside of the GCN architecture, but it almost never matters in real software.

3dilettante said:
I am curious about the exact placement of the inflection points for the timings, since they don't necessarily line up with some of the most obvious resource limits.

PC DirectX 12 abstracts the barriers quite a bit, so we don't know whether we are timing end-of-pipe or something else. This matters quite a bit, since the GPUs have super long pipelines.

Jawed said:
I believe SALU and VALU ops have to come from different hardware threads, so this specific kernel couldn't be sped up that way.

A single CU can run multiple kernels, and since the test is using async compute, we can assume that multiple queues produce work to the same CU. This means that it can interleave SALU from the other kernel and VALU from the other, making both process twice the rate.

Jawed said:
I strongly disagree as I have some code that runs fastest with 64, but I come from an OpenCL perspective (can't get more than 256 work items into a work-group, apart from anything else )...

Of course there are kernels that are fastest with thread blocks of 64 or 1024 threads. But these are fast because of some other bottlenecks, you most likely are trading some GPU utilization for other improvements (like reduced memory traffic). Also if the GCN OpenCL compiler is very smart, it could compile thread groups of 64 threads differently (the scalar unit could be exploited more).

Jawed said:
Fillrate tests aren't meaningful work either. This test does reveal serial versus async behaviours, so it's a success on those terms.

It measures many things, not just async compute. This makes the results hard to understand, and people are making wrong conclusions.

Jawed said:
It appears that even a single queue, the command list test, results in async compute on GCN.

All modern GPUs are capable of running multiple graphics tasks and multiple compute tasks in parallel, when these tasks originate from the same queue. This has been true for long time already. However DirectX 11 API and DX 11 drivers are quite defensive in their resource tracking, meaning that concurrent execution for compute doesn't happen often. Concurrent execution for multiple graphics draw calls however happens regularly (unless the graphics shaders use UAVs). How many graphics draws are executed simultaneously depends on fixed function resource limitations (only a limited amount of global state combinations can execute concurrently).

Razor1 said:
what forced async, there is no forced async, the single command list is not forced async

Yes, single command list by definition is not async. Shaders still can run concurrently even from a single command list (but not asynchronously). DirectX 12 exposes resource barriers to the programmer, giving the programmer more control over concurrent execution from a single command queue. Manual resource barriers allow greater and more controlled parallelism from a single queue, and this also is supported by more GPU vendors. If you don't necessarily need async, this is a good way to maximize the GPU utilization.

Razor1 said:
This is latency not end performance.

Exactly!

This is not a performance (maximum throughput) benchmark. However it seems that less technically inclined people believe it is, because this thread is called "DX12 performance thread". This thread does't in any way imply that "asynchronous compute is broken in Maxwell 2", or that "Fiji (Fury X) is super slow compared to NVIDIA in DX12 compute". This benchmark is not directly relevant for DirectX 12 games. As some wise guy said in SIGGRAPH: graphics rendering is the killer-app for compute shaders. DX12 async compute will be mainly used by graphics rendering, and for this use case the CPU->GPU->CPU latency has zero relevance. All that matters is the total throughput with realistic shaders. Like hyperthreading, async compute throughput gains are highly dependent on the shaders you use. Test shaders that are not ALU / TMU / BW bound are not a good way to measure the performance (yes I know, this is not even supposed to be a performance benchmark, but it seems that some people think it is).

This benchmark has relevance for mixed tightly interleaved CPU<->GPU workloads. However it is important to realize that the current benchmark does not just measure async compute, it measures the whole GPU pipeline latency. The GPUs are good at hiding this latency internally, but are not designed to hide it to external observers (such as the CPU).

RedditUserB · Sep 3, 2015

@sebbbi
Maybe you should write a simple app to test Async Compute functionality if you don't believe this current one is valid?

sebbbi · Sep 3, 2015

RedditUserB said:
@sebbbi
Maybe you should write a simple app to test Async Compute functionality if you don't believe this current one is valid?

This benchmark is valid for testing async compute latency. This is important for some GPGPU applications.

It is important to notice that this benchmark doesn't even try to measure async compute performance (GPU throughput). This is the most important thing for improved graphics rendering performance in games. As this thread is called "DX12 performance thread", I just wanted to point this out.

Deleted member 2197 · Sep 3, 2015

http://www.pcgamesn.com/amd-respond...asynchronously-executing-graphics-and-compute

sebbbi said:
This benchmark is valid for testing async compute latency. This is important for some GPGPU applications.

So any indicated async compute latency in the graphs for specific GPU's would indicate the existence of async compute capability.

pMax · Sep 3, 2015

sebbbi said:
This benchmark is not directly relevant for DirectX 12 games. As some wise guy said in SIGGRAPH: graphics rendering is the killer-app for compute shaders. DX12 async compute will be mainly used by graphics rendering, and for this use case the CPU->GPU->CPU latency has zero relevance. All that matters is the total throughput with realistic shaders. Like hyperthreading, async compute throughput gains are highly dependent on the shaders you use. Test shaders that are not ALU / TMU / BW bound are not a good way to measure the performance

I apologize but... didnt we have people using Async compute in PS4 showing massive benefits due to it? Your game won't be eternally bound in ALU/BW and all the extra 'free time' you get can be used by async compute. So, the fact that you can reduce the rendering time by using async compute do have relevance and matters alot - especially when XB/PS games will be moved to the PC arena using DX12 features.

Deleted member 13524 · Sep 3, 2015

I only found one user, PadyEos, who made the test with a 980 Ti at the same time as he logged his CPU usage.
Here's kar_rf's plot of his results:

We see that somewhere near the 256 kernels the "Async" results do seem to approach the Sync results, suggesting there could be some Async Compute happening after all.

However after looking at RedditUserB's suggestion, I compared the GPU usage to the CPU usage logs that PadyEos took, and here's what I found (poorly glued through paint):

By the time the Async test starts (which should be around the middle of the test), the CPU usage jumps towards ~50% in all cores and threads.
I imagine PadyEos is probably using a Core i7 of some kind. If a test as simple as this is getting this kind of CPU usage just imagine the impact of it in a game.. Imagine this in a Core i5 or a mobile CPU..

This seems to be exactly the issue the Oxide developer was mentioning. nVidia is trying to emulate the lack of hardware Async Compute through the CPU and that comes at the cost of a huge performance penalty. No wonder they had to put a vendor-specific code in the game, so that nVidia cards won't use Async.

This is still just one test and it could be wrong, though. Perhaps more people could re-run the test on a Maxwell 2 card and check their CPU usage when the Async Compute starts?

Deleted member 2197 · Sep 3, 2015

However after looking at RedditUserB's suggestion, I compared the GPU usage to the CPU usage logs that PadyEos took, and here's what I found (poorly glued through paint):

Correct me if I'm wrong but isn't that CPU usage graph include TDR enabled which caused Maxwell GPU's to crash? I think he tested later w/o TDR and was able to finish the test but did not include the CPU usage graph.

CSI PC · Sep 3, 2015

ToTTonTranz,
didn't one tester show that their NVIDIA 9xx card went from toggling GPU utilisation to full 100% when looking at changing TDR?
Has anyone else looked at that with their GPU?
Another problem is that these are not controlled environments, meaning many are using 3rd party boosting/analysing software and sometimes multiples.
I know of a recent case where it went viral on various forums that NVIDIA was meant to be downgrading the graphics in BF4 when someone did an amateur review-comparison, but in reality it came back to the user having interaction issues when they switched between AMD to NVIDIA; not quite the same thing but point is a separate entity affected the behaviour of the NVIDIA driver from an aliasing perspective.
Cheers

Deleted member 13524 · Sep 3, 2015

pharma said:
Correct me if I'm wrong but isn't that CPU usage graph include TDR enabled which caused Maxwell GPU's to crash? I think he tested later w/o TDR and was able to finish the test but did not include the CPU usage graph.

The reason the Maxwell cards were crashing is because the "forced no-Async" mode was making them simply add the time needed for one activated kernel over each iteration, leading to computation times in excess of 3 seconds. The driver was crashing due to a time-out, and disabling TDR simply removed that time-out.
The crash only happened during the last "no-Async" test. The results I'm showing refer to the first two tests.

But as I said, we should try to get more results from other people with Maxwell 2 and CPU logging.

Deleted member 2197 · Sep 3, 2015

So what exactly do you intend to show? As Sebbi and others already mentioned "This benchmark is valid for testing async compute latency." Below is a link to a response (@DmitryKo) you might find interesting:

But when Oxide says that the 900-series (Maxwell-2) don't have the required hardware but the Nvidia driver still exposes "async compute" capability, I don't think they can really tell this for sure, because this feature would be exposed through DXGK (DirectX Graphics Kernel) driver capability bits, and these are driver-level interfaces which are only visible to the DXGI and the Direct3D runtime, but not the API programmer (and the MSDN hardware developer documentation for WDDM 2.0 and DXGI 1.4 does not exist yet).

They are probably wrong on hardware support too, since Nvidia asserted to AnandTech that the 900-series have 32 scheduling blocks, of which 31 can be used for compute tasks.

So if Nvidia really asked Oxide to disable the parallel rendering path in their in-game benchmark, that has to be some driver problem rather that missing hardware support. Nvidia driver probably doesn't expose the "async" capabilities yet, so the Direct3D runtime cannot parallelize the compute tasks, or the driver is not fully optimized yet... not really sure, but it would take me quite enormous efforts to investigate even if I had full access to the source code.

http://forums.guru3d.com/showpost.php?p=5152048&postcount=62

madyasiwi · Sep 3, 2015

However after looking at RedditUserB's suggestion, I compared the GPU usage to the CPU usage logs that PadyEos took, and here's what I found (poorly glued through paint):

By the time the Async test starts (which should be around the middle of the test), the CPU usage jumps towards ~50% in all cores and threads.

Um, no, that CPU usage jumps should only started way past the beginning of forced async sequences. Probably somewhere after the 300th.

Devnant · Sep 3, 2015

Hi, I've created an account just because ToTTenTranz asked for more Maxwell results with CPU usage.

Here's my perf log on a 980 TI. I got a TDR crash on the single commandlist but it shouldn't matter.

Also, here is overall CPU usage on the duration of the test on a i7 5960X:

Razor1 · Sep 3, 2015

pMax said:
I apologize but... didnt we have people using Async compute in PS4 showing massive benefits due to it? Your game won't be eternally bound in ALU/BW and all the extra 'free time' you get can be used by async compute. So, the fact that you can reduce the rendering time by using async compute do have relevance and matters alot - especially when XB/PS games will be moved to the PC arena using DX12 features.

He isn't talking about that, with the relevance of this program, end performance isn't what is being tabulated and it can't be, it will be different for different programs, as with different architectures. Think of it like this, if you are trying to measure lets say how much weight I'm going to loose by running 10 miles today, I can get a good idea of how many kj I will use but I can't say definitively this is the amount of fat I will burn.

Guys lets get back to the topic

pMax · Sep 3, 2015

Razor1 said:
end performance isn't what is being tabulated and it can't be, it will be different for different programs, as with different architectures.

I don't get the difference, sorry: why comparing a core without HT to one that features it -with the added knowledge that the 1st logical core is almost never resource-bounded in a given time-frame of reference- is not a valid comparison for end performance?? Because with async computing and a set time of reference (say 33 msec?) it is exactly what you get - how much workload you can fit inside the given time frame. It is just like if you said that IPC for a single core vs (2 LC) doesnt matter for end performance: I'd agree on different nodes, but really, dont tell me it doesnt matter when comparing a GPU to a GPU.

What you are measuring there is your average IPC over a given time reference. Async computing allows you to do more over a given time, thus you can achieve alot of benefits once you start using it.
Those charts show that: if your software starts pushing stuff in the compute queue and in the 3d queue, you get more done in the same time of reference.

Isnt that what you call 'performance'? For 'end performance', take 2 cards which outputs similar results without Async Compute, then factor it in and see: more done on same time reference, more end performance. Simple as that.

OT I wonder if this limit has to anything to do with the AMD's hardware scheduler presence vs the NVIDIA software one?

Razor1 · Sep 3, 2015

we can't say it on a performance side for this app, its just not possible because

A) its not a realistic game scenario Its like my running example, you just can't do it because there are many other factors you have to take into consideration that can effect the end results, the amount of glycogen in my muscles, the amount of tone my muscles have, the amount of food intake I took that day, the time of the food I took, how much water loss am I having due to the ambient temperature
B) Different games, different apps will use different instruction amounts and different threads.
C) Due to this the schedulers will react differently, this is why the single command test was put in, to see how the hardware would react to it being forced sync, pretty much taking the driver and scheduling out of the loop.

About the hardware scheduler vs a software one, it think there is no any evidence of that yet, the CPU tests don't seem to show that, its pretty much flat line, yeah we have another one that shows otherwise, not to mention the plot line would not look like what we have seen either but need more tests to see if it plays out.

CSI PC · Sep 3, 2015

ToTTenTranz said:
The reason the Maxwell cards were crashing is because the "forced no-Async" mode was making them simply add the time needed for one activated kernel over each iteration, leading to computation times in excess of 3 seconds. The driver was crashing due to a time-out, and disabling TDR simply removed that time-out.
The crash only happened during the last "no-Async" test. The results I'm showing refer to the first two tests.

But as I said, we should try to get more results from other people with Maxwell 2 and CPU logging.

That is true but the guy also mentioned it changed the behaviour with TDR disabled:

PadyEos said:
Thanks, did this. Was then able to finish without driver crash. But I had to restart and now the performance is different(it seems) from my previous runs, faster. Now the GPU remains mostly at 100% when in "Graphics, compute single commandlist" part, instead of switching constantly between 100% and 0%, as it also did for the 970. Do not know why now it runs better...

980TI, 355.82 no crash:

Compute only:1. 5.67ms ~ 512. 76.11ms
Graphics only: 16.77ms (100.06G pixels/s)
Graphics + compute: 1. 21.15ms (79.34G pixels/s) ~ 512. 97.38ms (17.23G pixels/s)
Graphics, compute single commandlist: 1. 20.70ms (81.05G pixels/s) ~ 512. 2294.69ms (0.73G pixels/s)

Regarding the results, seems those show a consistent async type behaviour when looking at around thread 200 and onwards for awhile, not sure if this the chart shown a few pages back is associated with.

It may just be an inconsistent anomaly, but how many others with NVIDIA Maxwell 2 have done this and also looked at the % utilisation?
Can anyone test with the beta drivers that were associated with Star Swarm?
Cheers

Deleted member 2197 · Sep 3, 2015

Can anyone test with the beta drivers that were associated with Star Swarm?

I don't think those will run on Win 10.

Devnant · Sep 3, 2015

Here is with TDR disabled, no crashes. 980 TI driver 355.82

AlexV · Sep 3, 2015

Guys, I am disappointed that this needs to be pointed out, but apparently it needs to be pointed out: this place is NOT Reddit. So the whole "I flaming love IHV X so I AM GOING TO POST ANGRY THINGS AT IHV Y WHO IS THE DEVIL" thing is not acceptable. And before you read into it that this is either about ATI people or NVIDIA people, it's about both, as both are equally culpable. Please try to stay on topic, or this'll get locked.

DX12 Performance Discussion And Analysis Thread

CSI PC

sebbbi

RedditUserB

sebbbi

Deleted member 2197

Guest

pMax

Deleted member 13524

Guest

Deleted member 2197

Guest

CSI PC

Deleted member 13524

Guest

Deleted member 2197

Guest

madyasiwi

Devnant

Attachments

Razor1

pMax

Razor1

CSI PC

Deleted member 2197

Guest

Devnant

Attachments

AlexV

Heteroscedasticitate

Similar threads