DX12 Performance Discussion And Analysis Thread

Discussion in 'Rendering Technology and APIs' started by A1xLLcqAgt0qc2RyMz0y, Jul 29, 2015.

  1. ka_rf

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    12
    Likes Received:
    19
    Formeman, both of your results conform to the rest of the Maxwell results we've seen. Can anyone with Kepler run the new test?

    [​IMG]
    [​IMG]
     
    Razor1 and Jawed like this.
  2. Devnant

    Newcomer

    Joined:
    Sep 3, 2015
    Messages:
    10
    Likes Received:
    7
    I've posted my results. I'm also using a 980 TI, and even with TDR ON GPU usage was oscillating between 0% and 100%. Only difference with turning TDR OFF is the benchmark didn't crash.

    CPU usage was low with TDR ON and OFF throughout the test.

    I also disabled TDR by setting the TdrLevel flag to 0 in the registry HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers.

    Using same drivers as you, but my CPU is an i7 5960X OC @ 4.7 GHz.
     
  3. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,894
    Likes Received:
    4,549
    You can after posting a certain number, so you are well on your way. (not sure exact number)
     
  4. PadyEos

    Joined:
    Sep 1, 2015
    Messages:
    15
    Likes Received:
    6
    So did you use the TdrLevel flag to 0 or just increased the delay? Because just increasing the delay would still cause the preemption time-out to trigger and you would never reach the 10 seconds required for preemption to take place.
    That would explain why you still had the 0-100-0 spikes and managed to actually finish the test.
     
  5. ka_rf

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    12
    Likes Received:
    19
    PadyEos, here is your result with TDR On, it is in line with all other results (These results shouldn't be effected by TDR as they happen before the issue occurs in the benchmark). I also included a result from your TDR Off log. For some reason on your machine it seems turning TDR Off causes weird behavior.

    [​IMG]
    [​IMG]
     
    Razor1 likes this.
  6. Devnant

    Newcomer

    Joined:
    Sep 3, 2015
    Messages:
    10
    Likes Received:
    7
    I did both. There was no difference.
     
  7. PadyEos

    Joined:
    Sep 1, 2015
    Messages:
    15
    Likes Received:
    6
    Did you also restart the computer after setting TdrLevel flag to 0?
     
  8. PadyEos

    Joined:
    Sep 1, 2015
    Messages:
    15
    Likes Received:
    6
    Well, now there are 2 of us(980TI and 960) with the same behavior on 355.82 when setting TdrLevel flag to 0, restarting the computer and the running the test.
     
  9. PadyEos

    Joined:
    Sep 1, 2015
    Messages:
    15
    Likes Received:
    6
    This graph actually uses the data from TDR-ON not TDR-OFF. in TDR-OFF it it a steadily increasing value, with no drops.
     
  10. Devnant

    Newcomer

    Joined:
    Sep 3, 2015
    Messages:
    10
    Likes Received:
    7
    Yes.
     
  11. ka_rf

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    12
    Likes Received:
    19
    You're right, my mistake, grabbed it from the wrong file.

    The difference though between Forceman and your results, though, is that either on or off, it shows Asynch results consistent with others (at least, the two logs he posted) whereas yours shows weird behavior half way through. I'm not saying your results aren't valid based upon the test, just that it seems to be something specific to your system possibly causing different results.
     
  12. Toysrme

    Joined:
    Sep 1, 2015
    Messages:
    1
    Likes Received:
    0
    Comparing overclocked VS stock; mine starts off at a nice scale
    Heavily overclocked, (1654mhz gpu 4150mhz ram) VS stock MSI (1351mhz / 3505mhz ram), an 18% decrease in clocks shows a linear 18% decrease in completion times
    Compute only @ 512 finishes up around 66ms VS 81ms
    Graphics only: 19.70ms (85.15G pixels/s) VS 23.45ms (71.55G pixels/s)
    Graphics + compute: 85.05ms (19.73G pixels/s) {89.47 G pixels/s} VS 104.33ms (16.08G pixels/s) {74.11 G pixels/s}

    Graphics, compute single commandlist:
    Overclocked, it finishes 512 every time. 2759.19ms (0.61G pixels/s)
    Stock refuses to finish. 487. 0.24ms (7076.43G pixels/s) VS (Overclocked 487. 2667.79ms (0.63G pixels/s))

    Can someone explain that to me (or did i miss that in the thread?)

    Repeated it several times. Watching afterburner, CPU load idles throughout the task. The main gtx-980 clocks up, the second gtx-980 remains at reference clock bin (1190mhz) with the "normal" trivial amount of GPU load. (Sometimes it has 5-15% load, which is common for the second card in SLI)


    i7-5930K @ 4.5ghz (27x166.67)
    32gb G.Skill DDR4-3001 T1, 13-14-14-34-278
    2x MSI GTX-980 4G Frozr
    Samsung XP941 256gb M.2 (OS/App drive, 800mb/s read 600mb/s write)
    2x RAID-0 Mushkin Skorpeon Delux 512gb (4.26gb/s read 3.86gb/s write)
     

    Attached Files:

  13. PadyEos

    Joined:
    Sep 1, 2015
    Messages:
    15
    Likes Received:
    6
    Pfff... runing out of idead now... The last I would have would be did you create a DWORD or a QWORD command? I used DWORD.
    Except that I see nothing else that could be different,


    I have no explanation for that. My numbers don't seem that far off. The sum of both series ones is almost always equal to the async time. Similar stuff with Forceman's 960. Maybe knowing the formulas behind the graph would help.
     
  14. ka_rf

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    12
    Likes Received:
    19
    Toysrme, the single commandlist causes a TDR (timeout) on Maxwell because of how long it starts to take each iteration. It seems that heavily oc'd it gets it done just fast enough to avoid the timeout I guess.
     
  15. ka_rf

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    12
    Likes Received:
    19
    Your results with TDR on are fine, it's with TDR off that they get a little screwy (compared to other Maxwell cards). This is your results with TDR off from earlier in the thread, I didn't plot it but your new results look similar unless I picked the wrong folder again. Series execution is graphics + compute added. Asynchronous is the results of the asynchronous test. These values are normalized to which ever value of graphics or compute is slower so that if the orange line is at 1, it means it is running the two loads (graphics and compute) perfectly asynchronously and in parallel. Dips below 1 shouldn't happen but is probably just an artifact of the time stamps or something.

    [​IMG]
     
    Razor1 likes this.
  16. PadyEos

    Joined:
    Sep 1, 2015
    Messages:
    15
    Likes Received:
    6
    So with TDR on my graphs are consistent with the others with TDR on.

    But with TDR off they aren't consistent with the others? May I ask everyone that contributed with TDR off for graphs, did you still get with it off the 0-100-0-100 gpu load switch? If yes I suspect your TDR wasn't actually off and you were still experiencing the TDR time-out of 2 seconds even if the test finished successfully without the driver crashing.

    Can I get specific confirmation that Forceman's TDR off graph(he was the only one able to reproduce TDR off without the 0-100-0 gpu load switch) is also not matching mine?
     
  17. June31

    Joined:
    Sep 3, 2015
    Messages:
    1
    Likes Received:
    0
    AMD Fury async results look disappointing compared to other GCN cards. Could it be related to Fiji power limit ? Async computing -> more work done in parallel -> more Watts -> Fiji reducing shader frequencies, or stalling some shaders, or whatever. A good way to find out would be to run tests again with increased power limit.
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    I thought you might have seen how it compiled.

    I now see that you have linked: http://forum.ixbt.com/topic.cgi?id=10:61506-91#3150

    Code:
    BB0_1:
    add.f32  %f9, %f57, %f56;
    mul.f32  %f10, %f9, 0f3F000011;
    add.f32  %f11, %f10, %f57;
    fma.rn.f32  %f12, %f11, 0f3F000011, %f10;
    mul.f32  %f13, %f12, 0f3F000011;
    fma.rn.f32  %f14, %f11, 0f3F000011, %f13;
    fma.rn.f32  %f15, %f14, 0f3F000011, %f13;
    mul.f32  %f16, %f15, 0f3F000011;
    fma.rn.f32  %f17, %f14, 0f3F000011, %f16;
    fma.rn.f32  %f18, %f17, 0f3F000011, %f16;
    mul.f32  %f19, %f18, 0f3F000011;
    fma.rn.f32  %f20, %f17, 0f3F000011, %f19;
    fma.rn.f32  %f21, %f20, 0f3F000011, %f19;
    mul.f32  %f22, %f21, 0f3F000011;
    fma.rn.f32  %f23, %f20, 0f3F000011, %f22;
    fma.rn.f32  %f24, %f23, 0f3F000011, %f22;
    mul.f32  %f25, %f24, 0f3F000011;
    fma.rn.f32  %f26, %f23, 0f3F000011, %f25;
    fma.rn.f32  %f27, %f26, 0f3F000011, %f25;
    mul.f32  %f28, %f27, 0f3F000011;
    fma.rn.f32  %f29, %f26, 0f3F000011, %f28;
    fma.rn.f32  %f30, %f29, 0f3F000011, %f28;
    mul.f32  %f31, %f30, 0f3F000011;
    fma.rn.f32  %f32, %f29, 0f3F000011, %f31;
    fma.rn.f32  %f33, %f32, 0f3F000011, %f31;
    mul.f32  %f34, %f33, 0f3F000011;
    fma.rn.f32  %f35, %f32, 0f3F000011, %f34;
    fma.rn.f32  %f36, %f35, 0f3F000011, %f34;
    mul.f32  %f37, %f36, 0f3F000011;
    fma.rn.f32  %f38, %f35, 0f3F000011, %f37;
    fma.rn.f32  %f39, %f38, 0f3F000011, %f37;
    mul.f32  %f40, %f39, 0f3F000011;
    fma.rn.f32  %f41, %f38, 0f3F000011, %f40;
    fma.rn.f32  %f42, %f41, 0f3F000011, %f40;
    mul.f32  %f43, %f42, 0f3F000011;
    fma.rn.f32  %f44, %f41, 0f3F000011, %f43;
    fma.rn.f32  %f45, %f44, 0f3F000011, %f43;
    mul.f32  %f46, %f45, 0f3F000011;
    fma.rn.f32  %f47, %f44, 0f3F000011, %f46;
    fma.rn.f32  %f48, %f47, 0f3F000011, %f46;
    mul.f32  %f49, %f48, 0f3F000011;
    fma.rn.f32  %f50, %f47, 0f3F000011, %f49;
    fma.rn.f32  %f51, %f50, 0f3F000011, %f49;
    mul.f32  %f52, %f51, 0f3F000011;
    fma.rn.f32  %f53, %f50, 0f3F000011, %f52;
    fma.rn.f32  %f54, %f53, 0f3F000011, %f52;
    mul.f32  %f56, %f54, 0f3F000011;
    fma.rn.f32  %f55, %f53, 0f3F000011, %f56;
    mul.f32  %f57, %f55, 0f3F000011;
    add.s32  %r6, %r6, 32;
    setp.ne.s32  %p1, %r6, 0;
    @%p1 bra  BB0_1;
    Which is a partially unrolled loop (not fully unrolled). Which will be vastly faster than AMD's naive compilation. Which is not as advanced as I wrote about back in post 167, but it needs the compiler to spot the opportunity.

    There is no co-issue here.

    Combination of the compiler being stupid and I think FMA isn't strictly the same.
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    You asserted: "If the compiler is good, [...] emitting both scalar + vector in an interleaved "dual issue" way (the CU can issue both at the same cycle, doubling the throughput)." Which is clearly not the same as the core scheduling an SALU op from one hardware as well as a VALU op from another. The compiler hasn't emitted "dual issue" in this case.

    The important case is that all work-group synchronisations cost 0 cycles with 64 work items. Barriers are elided. For example it makes heavy duty LDS usage that bit faster.
     
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    The graphs made by ka_rf are very nice.

    The graph actually shows that the compute only line, blue, shows a divergence from the overall trend, slowing down in that region. The orange line hasn't shifted downwards from its trend, to indicate that async compute is occurring, the blue line has wobbled into slowness. The blue line returns to its trend near 360.

    There's no async here.

    Could that be the start of the single command list run?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...