DX12 Performance Discussion And Analysis Thread

Then, wouldnt it be better for them to just leave Async Shaders off? And work as they used to with DX11?
It's not possible to "leave async shaders off". All D3D12 drivers must support this feature. The hardware may stumble when fed that kind of workload. If the developer discovers this, they can create a version of the graphics algorithms that does not use async shaders.
 
Request for a favour from the AMD testers: I would like to see what happens when the code runs faster with an unrolled loop. Does it still behave properly if the kernel runs much more quickly?

There's two files in MDolenc's tester. I'd like you to edit the shader.hlsl file. First take a backup copy of the file (select the file, copy it, then paste it). Then edit the shader.hlsl file so that it looks like this, using notepad to do the editing:

Code:
RWStructuredBuffer<float> oUAV : register(u0);

float4 vsMain(float4 pos : POSITION) : SV_Position {
    return pos;
}

float4 psMain() : SV_Target {
    return float4(1.0f, 1.0f, 1.0f, 1.0f);
}

[numthreads(1, 1, 1)]
void csMain(uint3 id : SV_DispatchThreadID) {
    float a1 = id.x + 1.0f;
    float a2 = id.y + 1.0f;
    float a3;

    for (int i = 0; i < 1024 * 4; ++i) {
        for (int j = 0; j < 128; ++j) {
            a3 = a1 + a2;
            a3 = a3 * 0.500001f;

            a1 = a2;
            a2 = a3;
        }
    }
    oUAV[0] = a3;
}

This version should be faster. It should also work on NVidia.
 
All of this does raise the question; what was going on with Star Swarm in terms of Nitrous engine implementation by Oxide and also the NVIDIA beta driver implementation back then.
Was this a Mantle/DX12 Nitrous engine demo actually without async compute focus but importantly showing strong benefits on both platforms, or does it have similarities to the current Nitrous engine and Ashes.
If it is former then rather ironic considering when the engine/Star Swarm was released both AMD and Oxide were heavily presenting benefit of async gaming design for Mantle.
Also as the former, then raises the question on performance comparison between Star Swarm and Ashes with async disabled for NVIDIA; is Star Swarm Extreme setting comparable to low or high settings of Ashes/does it even matter when looking at the gains for one and the loss on the other (I assume it might).
Being vague with the terminology async because Oxide heavily mention async compute while also here discussion mentioning async shaders.
Cheers
 
I would like to see what happens when the code runs faster with an unrolled loop.
Unrolled loop version from GPGPU expert (Vlev):
Code:
RWStructuredBuffer<float> oUAV : register(u0);

float4 vsMain(float4 pos : POSITION) : SV_Position {
    return pos;
}

float4 psMain() : SV_Target {
    return float4(1.0f, 1.0f, 1.0f, 1.0f);
}

[numthreads(1, 1, 1)]
void csMain(uint3 id : SV_DispatchThreadID) {
    float a1 = id.x + 1.0f;
    float a2 = id.y + 1.0f;
    float a3;

for (int i = 0; i < 1024 * 32; ++i) {
        float b3 = a1 + a2; a3 = b3 * 0.50001f;
        float b4 = a2 + b3 * 0.50001f; float a4 = b4 * 0.50001f;
        float b5 = a3 + b4 * 0.50001f; float a5 = b5 * 0.50001f;
        float b6 = a4 + b5 * 0.50001f; float a6 = b6 * 0.50001f;
        float b7 = a5 + b6 * 0.50001f; float a7 = b7 * 0.50001f;
        float b8 = a6 + b7 * 0.50001f; float a8 = b8 * 0.50001f;
        float b9 = a7 + b8 * 0.50001f; float a9 = b9 * 0.50001f;
        float b10 = a8 + b9 * 0.50001f; float a10 = b10 * 0.50001f;
        float b11 = a9 + b10 * 0.50001f; float a11 = b11 * 0.50001f;
        float b12 = a10 + b11 * 0.50001f; float a12 = b12 * 0.50001f;
        float b13 = a11 + b12 * 0.50001f; float a13 = b13 * 0.50001f;
        float b14 = a12 + b13 * 0.50001f; float a14 = b14 * 0.50001f;
        float b15 = a13 + b14 * 0.50001f; float a15 = b15 * 0.50001f;
        float b16 = a14 + b15 * 0.50001f; float a16 = b16 * 0.50001f;
        float b17 = a15 + b16 * 0.50001f; float a17 = b17 * 0.50001f;
        float b18 = a16 + b17 * 0.50001f; float a18 = b18 * 0.50001f;
        float b19 = a17 + b18 * 0.50001f; float a19 = b19 * 0.50001f;
        float b20 = a18 + b19 * 0.50001f; float a20 = b20 * 0.50001f;
        float b21 = a19 + b20 * 0.50001f; float a21 = b21 * 0.50001f;
        float b22 = a20 + b21 * 0.50001f; float a22 = b22 * 0.50001f;
        float b23 = a21 + b22 * 0.50001f; float a23 = b23 * 0.50001f;
        float b24 = a22 + b23 * 0.50001f; float a24 = b24 * 0.50001f;
        float b25 = a23 + b24 * 0.50001f; float a25 = b25 * 0.50001f;
        float b26 = a24 + b25 * 0.50001f; float a26 = b26 * 0.50001f;
        float b27 = a25 + b26 * 0.50001f; float a27 = b27 * 0.50001f;
        float b28 = a26 + b27 * 0.50001f; float a28 = b28 * 0.50001f;
        float b29 = a27 + b28 * 0.50001f; float a29 = b29 * 0.50001f;
        float b30 = a28 + b29 * 0.50001f; float a30 = b30 * 0.50001f;
        float b31 = a29 + b30 * 0.50001f; float a31 = b31 * 0.50001f;
        float b32 = a30 + b31 * 0.50001f; float a32 = b32 * 0.50001f;
        float b33 = a31 + b32 * 0.50001f; float a33 = b33 * 0.50001f;
        float b34 = a32 + b33 * 0.50001f; float a34 = b34 * 0.50001f;
        a1 = a33;
        a2 = a34;
}
     oUAV[0] = a3;
}
Result (Radeon HD 7850):
post.png
 
Unrolled loop version from GPGPU expert (Vlev):
Hi Benny-ua, thanks for your results. It looks as if performance becomes less reliable the harder the workload in compute + graphics async. I also notice that all of the steps in time are in increments of approximately 64, which is unlike most AMD cards running MDolenc's original code. But I can't tell if this is because of the unrolled code from Vlev or if it's because of the card and/or driver.

Some other comments:

1. The loop by Vlev iterates too many times in comparison with the most recent code from MDolenc. It should be 1024 * 16.
2. The raw numbers would be useful, too, if you can provide them.
3. I think you are the first person to post that GPU, so I'm not sure how your results compare with the results from the standard version of the shader file.
4. It is not important, but Vlev re-wrote the shader with a different constant: he used 0.50001f, but the original uses 0.500001f.
 
3. I think you are the first person to post that GPU, so I'm not sure how your results compare with the results from the standard version of the shader file.
Original shader, early version of this test - http://pastebin.com/XSfm4baS

Regarding 1,2,4 - see attached files (number of iterations decreased to 1024 * 16, constant value restored to 0.500001f)
7850.png
 

Attachments

  • perf.zip
    39.4 KB · Views: 3
Last edited by a moderator:
Request for a favour from the AMD testers: I would like to see what happens when the code runs faster with an unrolled loop.

Here's mine on a OC 7950 1025/1575 (wasn't OC when I did the 2nd test from MDolenc)

Compute only: 1. 4.41ms 512. 39.17ms
Graphics only: 51.75ms (32.42G pixels/s)
Graphics + compute: 1. 52.47ms (31.97G pixels/s) 512. 72.62ms (23.10G pixels/s)
Graphics, compute single commandlist: 1. 56.06ms (29.93G pixels/s) 512. 87.25ms (19.23G pixels/s)
 

Attachments

  • HD 7950 OC (15.8 Beta).zip
    52.4 KB · Views: 8
Are you sure? I mean the register files being too small for actual parallel computations is a well know issue, so that can be a likely issue when parallelism raises.

But the caches shouldn't limit the size of each queue. If you insist on pushing longer programs into each queue, the hardware should cope with that quite well. More cache misses, yes. Possibly even running into the memory bandwidth limit. But I don't see how this would possibly affect the refill of the queues. Currently, it only looks as if the queues are simply underrunning far too often, due to a lack of used queue depth.

I see what you are saying that makes sense.
 
Last edited:
Original shader, early version of this test - http://pastebin.com/XSfm4baS

Regarding 1,2,4 - see attached files (number of iterations decreased to 16, constant value restored to 0.500001f)
Ah, sorry, I meant the version of MDolenc's test without unroll (which also uses up to 512 kernels). I was curious to see what happens with the width of the steps and whether faster execution affects the width of the steps or causes more erratic performance.

It seems like this result is less erratic than in your first post. I was expecting it to be more erratic after I saw your first post.
 
Here's mine on a OC 7950 1025/1575 (wasn't OC when I did the 2nd test from MDolenc)

Compute only: 1. 4.41ms 512. 39.17ms
Graphics only: 51.75ms (32.42G pixels/s)
Graphics + compute: 1. 52.47ms (31.97G pixels/s) 512. 72.62ms (23.10G pixels/s)
Graphics, compute single commandlist: 1. 56.06ms (29.93G pixels/s) 512. 87.25ms (19.23G pixels/s)
These new results are very erratic. I wonder if overclocking is causing some throttling or some related effect. Or maybe the erratic results are simply because of the short run time of each kernel. Though I still can't rationalise the mechanism for this, other than just saying "chaos".

But with the same count of kernels running for about the same amount of time as on NVidia, I think we're starting to get something that's more comparable.

Something else that's strange here is that in the graphics + compute test, the first step is at around 180 kernels, not 64 nor 128. After that there's so much variation I can't really see where the other steps are. I'm tempted to say that the steps are at intervals of ~90 as the erraticism starts at about 90.
 
Here is Fury-X GPU usage from whole run. It seems gpu usage is way higher on Graphics, compute single commandlist:, compared to Graphics + compute. My paint skills ain't that great, so I zoomed it a bit to leave white boarders out :D
 

Attachments

  • furyx-gpu usage.png
    furyx-gpu usage.png
    11.5 KB · Views: 57
Without OC:

Compute only: 1. 4.99ms 512. 41.28ms
Graphics only: 58.91ms (28.48G pixels/s)
Graphics + compute: 1. 58.90ms (28.49G pixels/s) 512. 112.92ms
Graphics, compute single commandlist: 1. 68.75ms (24.40G pixels/s) 512. 100.04ms (16.77G pixels/s)
 

Attachments

  • HD 7950 Stock (15.8 Beta).zip
    55 KB · Views: 1
For Jawed

Fury-x @1100/550
I dare say ka_rf's graph indicate that Fury X async shaders are worse than 7950's.

Looking at the data for graphics + compute (async) at 100 kernels launches, we can see that about 30 kernels run for 4ms, one runs until 11ms and then all of the rest finish at either 33, 37 or 41ms with about 30, 30 and 10 kernel launches each.

So it seems as if there's a problem here, where no kernels finish at about 16, 20, 24 or 28ms. It's as if there's congestion in launching new kernels and the GPU just stops doing so for around 20ms.

It makes me wonder what would happen if many more queues were used to share the kernel launches around. This might require multiple threads in the application? I don't know how this stuff works...
 
Maybe AMD just needs time to fix their drivers to get evetything out of their ACE's. Or in Fury's case, ACE's + HWS's. Tahiti seems to be working better with most simples form of 2 ACE's alone. What you think?
 
Without OC:

Compute only: 1. 4.99ms 512. 41.28ms
Graphics only: 58.91ms (28.48G pixels/s)
Graphics + compute: 1. 58.90ms (28.49G pixels/s) 512. 112.92ms
Graphics, compute single commandlist: 1. 68.75ms (24.40G pixels/s) 512. 100.04ms (16.77G pixels/s)
These look about the same as your overclocked results I reckon, so it seems that overclocking is probably not relevant.

Here is Fury-X GPU usage from whole run. It seems gpu usage is way higher on Graphics, compute single commandlist:, compared to Graphics + compute.
That's to be expected, since it is the best-performing scenario.

Maybe AMD just needs time to fix their drivers to get evetything out of their ACE's. Or in Fury's case, ACE's + HWS's. Tahiti seems to be working better with most simples form of 2 ACE's alone. What you think?
I think you're right. I wonder what Hawaii does on this test with the unroll, since the behaviour reported earlier in the week seemed to be the closest to what is desirable.

I expect NVidia will see very little difference running my unroll version of the kernel. It might be slightly faster because the unroll is longer. Unless the unroll limit was already reached in the results shown earlier.
 
Here's my results for the unrolled version on a 290x (1000Mhz / 1250 clock) with 15.8 beta driver.
I noticed that GPU clock rarely rose above 900 Mhz during the run and bounced between 30-100% utilization.
 

Attachments

  • 290x_15.8beta_unrolled.zip
    235.9 KB · Views: 5
I forgot to include the summary for my previous run (290x unrolled):

Compute only: 1. 4.76ms 512. 71.31ms
Graphics only: 27.29ms (61.47G pixels/s)
Graphics + compute: 1. 26.97ms (62.22G pixels/s) 512. 78.01ms (21.51G pixels/s)
Graphics, compute single commandlist: 1. 31.32ms (53.57G pixels/s) 512. 62.35ms (26.91G pixels/s)
 
Back
Top