DX12 Performance Discussion And Analysis Thread

Jawed · Sep 5, 2015

CasellasAbdala said:
Then, wouldnt it be better for them to just leave Async Shaders off? And work as they used to with DX11?

It's not possible to "leave async shaders off". All D3D12 drivers must support this feature. The hardware may stumble when fed that kind of workload. If the developer discovers this, they can create a version of the graphics algorithms that does not use async shaders.

Jawed · Sep 5, 2015

Request for a favour from the AMD testers: I would like to see what happens when the code runs faster with an unrolled loop. Does it still behave properly if the kernel runs much more quickly?

There's two files in MDolenc's tester. I'd like you to edit the shader.hlsl file. First take a backup copy of the file (select the file, copy it, then paste it). Then edit the shader.hlsl file so that it looks like this, using notepad to do the editing:

Code:

RWStructuredBuffer<float> oUAV : register(u0);

float4 vsMain(float4 pos : POSITION) : SV_Position {
    return pos;
}

float4 psMain() : SV_Target {
    return float4(1.0f, 1.0f, 1.0f, 1.0f);
}

[numthreads(1, 1, 1)]
void csMain(uint3 id : SV_DispatchThreadID) {
    float a1 = id.x + 1.0f;
    float a2 = id.y + 1.0f;
    float a3;

    for (int i = 0; i < 1024 * 4; ++i) {
        for (int j = 0; j < 128; ++j) {
            a3 = a1 + a2;
            a3 = a3 * 0.500001f;

            a1 = a2;
            a2 = a3;
        }
    }
    oUAV[0] = a3;
}

This version should be faster. It should also work on NVidia.

CSI PC · Sep 5, 2015

All of this does raise the question; what was going on with Star Swarm in terms of Nitrous engine implementation by Oxide and also the NVIDIA beta driver implementation back then.
Was this a Mantle/DX12 Nitrous engine demo actually without async compute focus but importantly showing strong benefits on both platforms, or does it have similarities to the current Nitrous engine and Ashes.
If it is former then rather ironic considering when the engine/Star Swarm was released both AMD and Oxide were heavily presenting benefit of async gaming design for Mantle.
Also as the former, then raises the question on performance comparison between Star Swarm and Ashes with async disabled for NVIDIA; is Star Swarm Extreme setting comparable to low or high settings of Ashes/does it even matter when looking at the gains for one and the loss on the other (I assume it might).
Being vague with the terminology async because Oxide heavily mention async compute while also here discussion mentioning async shaders.
Cheers

Benny-ua · Sep 5, 2015

Jawed said:
I would like to see what happens when the code runs faster with an unrolled loop.

Unrolled loop version from GPGPU expert (Vlev):

Code:

RWStructuredBuffer<float> oUAV : register(u0);

float4 vsMain(float4 pos : POSITION) : SV_Position {
    return pos;
}

float4 psMain() : SV_Target {
    return float4(1.0f, 1.0f, 1.0f, 1.0f);
}

[numthreads(1, 1, 1)]
void csMain(uint3 id : SV_DispatchThreadID) {
    float a1 = id.x + 1.0f;
    float a2 = id.y + 1.0f;
    float a3;

for (int i = 0; i < 1024 * 32; ++i) {
        float b3 = a1 + a2; a3 = b3 * 0.50001f;
        float b4 = a2 + b3 * 0.50001f; float a4 = b4 * 0.50001f;
        float b5 = a3 + b4 * 0.50001f; float a5 = b5 * 0.50001f;
        float b6 = a4 + b5 * 0.50001f; float a6 = b6 * 0.50001f;
        float b7 = a5 + b6 * 0.50001f; float a7 = b7 * 0.50001f;
        float b8 = a6 + b7 * 0.50001f; float a8 = b8 * 0.50001f;
        float b9 = a7 + b8 * 0.50001f; float a9 = b9 * 0.50001f;
        float b10 = a8 + b9 * 0.50001f; float a10 = b10 * 0.50001f;
        float b11 = a9 + b10 * 0.50001f; float a11 = b11 * 0.50001f;
        float b12 = a10 + b11 * 0.50001f; float a12 = b12 * 0.50001f;
        float b13 = a11 + b12 * 0.50001f; float a13 = b13 * 0.50001f;
        float b14 = a12 + b13 * 0.50001f; float a14 = b14 * 0.50001f;
        float b15 = a13 + b14 * 0.50001f; float a15 = b15 * 0.50001f;
        float b16 = a14 + b15 * 0.50001f; float a16 = b16 * 0.50001f;
        float b17 = a15 + b16 * 0.50001f; float a17 = b17 * 0.50001f;
        float b18 = a16 + b17 * 0.50001f; float a18 = b18 * 0.50001f;
        float b19 = a17 + b18 * 0.50001f; float a19 = b19 * 0.50001f;
        float b20 = a18 + b19 * 0.50001f; float a20 = b20 * 0.50001f;
        float b21 = a19 + b20 * 0.50001f; float a21 = b21 * 0.50001f;
        float b22 = a20 + b21 * 0.50001f; float a22 = b22 * 0.50001f;
        float b23 = a21 + b22 * 0.50001f; float a23 = b23 * 0.50001f;
        float b24 = a22 + b23 * 0.50001f; float a24 = b24 * 0.50001f;
        float b25 = a23 + b24 * 0.50001f; float a25 = b25 * 0.50001f;
        float b26 = a24 + b25 * 0.50001f; float a26 = b26 * 0.50001f;
        float b27 = a25 + b26 * 0.50001f; float a27 = b27 * 0.50001f;
        float b28 = a26 + b27 * 0.50001f; float a28 = b28 * 0.50001f;
        float b29 = a27 + b28 * 0.50001f; float a29 = b29 * 0.50001f;
        float b30 = a28 + b29 * 0.50001f; float a30 = b30 * 0.50001f;
        float b31 = a29 + b30 * 0.50001f; float a31 = b31 * 0.50001f;
        float b32 = a30 + b31 * 0.50001f; float a32 = b32 * 0.50001f;
        float b33 = a31 + b32 * 0.50001f; float a33 = b33 * 0.50001f;
        float b34 = a32 + b33 * 0.50001f; float a34 = b34 * 0.50001f;
        a1 = a33;
        a2 = a34;
}
     oUAV[0] = a3;
}

Result (Radeon HD 7850):

Jawed · Sep 5, 2015

Benny-ua said:
Unrolled loop version from GPGPU expert (Vlev):

Hi Benny-ua, thanks for your results. It looks as if performance becomes less reliable the harder the workload in compute + graphics async. I also notice that all of the steps in time are in increments of approximately 64, which is unlike most AMD cards running MDolenc's original code. But I can't tell if this is because of the unrolled code from Vlev or if it's because of the card and/or driver.

Some other comments:

1. The loop by Vlev iterates too many times in comparison with the most recent code from MDolenc. It should be 1024 * 16.
2. The raw numbers would be useful, too, if you can provide them.
3. I think you are the first person to post that GPU, so I'm not sure how your results compare with the results from the standard version of the shader file.
4. It is not important, but Vlev re-wrote the shader with a different constant: he used 0.50001f, but the original uses 0.500001f.

Benny-ua · Sep 5, 2015

Jawed said:
3. I think you are the first person to post that GPU, so I'm not sure how your results compare with the results from the standard version of the shader file.

Original shader, early version of this test - http://pastebin.com/XSfm4baS

Regarding 1,2,4 - see attached files (number of iterations decreased to 1024 * 16, constant value restored to 0.500001f)

vatrak · Sep 5, 2015

Jawed said:
Request for a favour from the AMD testers: I would like to see what happens when the code runs faster with an unrolled loop.

Here's mine on a OC 7950 1025/1575 (wasn't OC when I did the 2nd test from MDolenc)

Compute only: 1. 4.41ms 512. 39.17ms
Graphics only: 51.75ms (32.42G pixels/s)
Graphics + compute: 1. 52.47ms (31.97G pixels/s) 512. 72.62ms (23.10G pixels/s)
Graphics, compute single commandlist: 1. 56.06ms (29.93G pixels/s) 512. 87.25ms (19.23G pixels/s)

Razor1 · Sep 5, 2015

Ext3h said:
Are you sure? I mean the register files being too small for actual parallel computations is a well know issue, so that can be a likely issue when parallelism raises.

But the caches shouldn't limit the size of each queue. If you insist on pushing longer programs into each queue, the hardware should cope with that quite well. More cache misses, yes. Possibly even running into the memory bandwidth limit. But I don't see how this would possibly affect the refill of the queues. Currently, it only looks as if the queues are simply underrunning far too often, due to a lack of used queue depth.

I see what you are saying that makes sense.

Dygaza · Sep 5, 2015

For Jawed

Fury-x @1100/550

Jawed · Sep 5, 2015

Benny-ua said:
Original shader, early version of this test - http://pastebin.com/XSfm4baS

Regarding 1,2,4 - see attached files (number of iterations decreased to 16, constant value restored to 0.500001f)

Ah, sorry, I meant the version of MDolenc's test without unroll (which also uses up to 512 kernels). I was curious to see what happens with the width of the steps and whether faster execution affects the width of the steps or causes more erratic performance.

It seems like this result is less erratic than in your first post. I was expecting it to be more erratic after I saw your first post.

ka_rf · Sep 5, 2015

ka_rf · Sep 5, 2015

Jawed · Sep 5, 2015

vatrak said:
Here's mine on a OC 7950 1025/1575 (wasn't OC when I did the 2nd test from MDolenc)

Compute only: 1. 4.41ms 512. 39.17ms
Graphics only: 51.75ms (32.42G pixels/s)
Graphics + compute: 1. 52.47ms (31.97G pixels/s) 512. 72.62ms (23.10G pixels/s)
Graphics, compute single commandlist: 1. 56.06ms (29.93G pixels/s) 512. 87.25ms (19.23G pixels/s)

These new results are very erratic. I wonder if overclocking is causing some throttling or some related effect. Or maybe the erratic results are simply because of the short run time of each kernel. Though I still can't rationalise the mechanism for this, other than just saying "chaos".

But with the same count of kernels running for about the same amount of time as on NVidia, I think we're starting to get something that's more comparable.

Something else that's strange here is that in the graphics + compute test, the first step is at around 180 kernels, not 64 nor 128. After that there's so much variation I can't really see where the other steps are. I'm tempted to say that the steps are at intervals of ~90 as the erraticism starts at about 90.

Dygaza · Sep 5, 2015

Here is Fury-X GPU usage from whole run. It seems gpu usage is way higher on Graphics, compute single commandlist:, compared to Graphics + compute. My paint skills ain't that great, so I zoomed it a bit to leave white boarders out

vatrak · Sep 5, 2015

Without OC:

Compute only: 1. 4.99ms 512. 41.28ms
Graphics only: 58.91ms (28.48G pixels/s)
Graphics + compute: 1. 58.90ms (28.49G pixels/s) 512. 112.92ms
Graphics, compute single commandlist: 1. 68.75ms (24.40G pixels/s) 512. 100.04ms (16.77G pixels/s)

Jawed · Sep 5, 2015

Dygaza said:
For Jawed

Fury-x @1100/550

I dare say ka_rf's graph indicate that Fury X async shaders are worse than 7950's.

Looking at the data for graphics + compute (async) at 100 kernels launches, we can see that about 30 kernels run for 4ms, one runs until 11ms and then all of the rest finish at either 33, 37 or 41ms with about 30, 30 and 10 kernel launches each.

So it seems as if there's a problem here, where no kernels finish at about 16, 20, 24 or 28ms. It's as if there's congestion in launching new kernels and the GPU just stops doing so for around 20ms.

It makes me wonder what would happen if many more queues were used to share the kernel launches around. This might require multiple threads in the application? I don't know how this stuff works...

Dygaza · Sep 5, 2015

Maybe AMD just needs time to fix their drivers to get evetything out of their ACE's. Or in Fury's case, ACE's + HWS's. Tahiti seems to be working better with most simples form of 2 ACE's alone. What you think?

Jawed · Sep 5, 2015

vatrak said:
Without OC:

Compute only: 1. 4.99ms 512. 41.28ms
Graphics only: 58.91ms (28.48G pixels/s)
Graphics + compute: 1. 58.90ms (28.49G pixels/s) 512. 112.92ms
Graphics, compute single commandlist: 1. 68.75ms (24.40G pixels/s) 512. 100.04ms (16.77G pixels/s)

These look about the same as your overclocked results I reckon, so it seems that overclocking is probably not relevant.

Dygaza said:
Here is Fury-X GPU usage from whole run. It seems gpu usage is way higher on Graphics, compute single commandlist:, compared to Graphics + compute.

That's to be expected, since it is the best-performing scenario.

Dygaza said:
Maybe AMD just needs time to fix their drivers to get evetything out of their ACE's. Or in Fury's case, ACE's + HWS's. Tahiti seems to be working better with most simples form of 2 ACE's alone. What you think?

I think you're right. I wonder what Hawaii does on this test with the unroll, since the behaviour reported earlier in the week seemed to be the closest to what is desirable.

I expect NVidia will see very little difference running my unroll version of the kernel. It might be slightly faster because the unroll is longer. Unless the unroll limit was already reached in the results shown earlier.

cduchesne · Sep 5, 2015

Here's my results for the unrolled version on a 290x (1000Mhz / 1250 clock) with 15.8 beta driver.
I noticed that GPU clock rarely rose above 900 Mhz during the run and bounced between 30-100% utilization.

cduchesne · Sep 5, 2015

I forgot to include the summary for my previous run (290x unrolled):

Compute only: 1. 4.76ms 512. 71.31ms
Graphics only: 27.29ms (61.47G pixels/s)
Graphics + compute: 1. 26.97ms (62.22G pixels/s) 512. 78.01ms (21.51G pixels/s)
Graphics, compute single commandlist: 1. 31.32ms (53.57G pixels/s) 512. 62.35ms (26.91G pixels/s)

DX12 Performance Discussion And Analysis Thread

Jawed

Jawed

CSI PC

Benny-ua

Jawed

Benny-ua

Attachments

vatrak

Attachments

Razor1

Dygaza

Attachments

Jawed

ka_rf

ka_rf

Jawed

Dygaza

Attachments

vatrak

Attachments

Jawed

Dygaza

Jawed

cduchesne

Attachments

cduchesne

Similar threads