DX12 Performance Discussion And Analysis Thread

Discussion in 'Rendering Technology and APIs' started by A1xLLcqAgt0qc2RyMz0y, Jul 29, 2015.

  1. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    Missed this one why is Fiji getting similar results to Maxwell 2 then?
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    I can only blame myself.
    My brain was undergoing poorly implemented context switching at that time.
     
  3. ahezard

    Joined:
    Sep 2, 2015
    Messages:
    3
    Likes Received:
    0
    My bad, I missed that.

    I do not understand the result of this new test. How can I get similar result to a 290 with my 7790? The pixels pushed/s is lower but execution time is similar.
     

    Attached Files:

    • perf.zip
      File size:
      135.3 KB
      Views:
      20
  4. It's not.

    You'll probably resort to a lot of mental gymnastics to claim it is, though.
    Good luck with that, I guess. ;-)
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Shame you didn't flesh out your thoughts.

    I've been pondering your comment about dual-issue and I still don't understand how that's relevant. Maybe you can explain that. Do you know how this code compiles on Maxwell 2?
     
  6. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    If the compiler is good, it could either skip the vector units completely by emitting pure scalar unit code (saving power) or emitting both scalar + vector in an interleaved "dual issue" way (the CU can issue both at the same cycle, doubling the throughput).

    Benchmarking thread groups that are under 256 threads on GCN is not going to lead into any meaningful results, as you would (almost) never use smaller thread groups in real (optimized) applications. I would suspect a performance bug if a kernel thread count doesn't belong to {256, 384, 512}. Single lane thread groups result in less than 1% of meaningful work on GCN. Why would you run code like this on a GPU (instead of using the CPU)? Not a good test case at all. No GPU is optimized for this case.

    Also I question the need to run test cases with tens of (or hundreds of) compute queues. Biggest gains can be had with one or two additional queues (running work that hits different bottlenecks each). More queues will just cause problems (cache trashing, etc).
     
    #366 sebbbi, Sep 3, 2015
    Last edited: Sep 3, 2015
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    I think you have the best AMD results so far for the compute portion of this test.

    I did have a theory that the single shader engine might make the Cape Verde chip better, but that was based on my belief that the compute should finish much faster. Perhaps there is something there, but it's not blatantly obvious to me right now...

    I suspect your card is clocked at more than 1GHz since my 7770 is 1GHz.
     
  8. Sinistar

    Sinistar I LIVE
    Regular Subscriber

    Joined:
    Aug 11, 2004
    Messages:
    660
    Likes Received:
    74
    Location:
    Indiana
    Seem to like 50% GPU usage on Fury X

    [​IMG]
     

    Attached Files:

  9. ahezard

    Joined:
    Sep 2, 2015
    Messages:
    3
    Likes Received:
    0
    Indeed my card is clocked at 1.1 Ghz (r7 260x reference clock). So it seems the result on the compute part of the second test for GCN is only dependent on the clock and do not depend at all on the number of compute unit you have.
     
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The test case is whether graphics processing time can overlay with compute, is there an expectation that a 256 thread group would change the verdict for the GPUs running it?
    A single lane seems like a base case that is generally equivalent to GPUs with differing SIMD widths when it comes to testing if the two types of threads can overlap their lifespans without customizing the code.
    There was puzzlement when it came to why the latency was that disparate for GCN for the lowest cases, which is largely explained by the omission of the 4-cycle wavefront.


    I am curious about the exact placement of the inflection points for the timings, since they don't necessarily line up with some of the most obvious resource limits.

    The current testing does not do this, although a reason to test it would be similar to why people climb mountains: because it's there.
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    I believe SALU and VALU ops have to come from different hardware threads, so this specific kernel couldn't be sped up that way.

    I strongly disagree as I have some code that runs fastest with 64, but I come from an OpenCL perspective (can't get more than 256 work items into a work-group, apart from anything else :razz:)...

    Fillrate tests aren't meaningful work either. This test does reveal serial versus async behaviours, so it's a success on those terms.

    The results presented demonstrate async compute, when it's available. The test does so with one or two queues as far as I can tell. It appears that even a single queue, the command list test, results in async compute on GCN. EDIT: erm, actually I don't think that last sentence is correct.

    What I'm curious to see is whether NVidia hardware will gain async behaviour once NVidia has spent some time on this. Maybe it truly won't work on the current chips, but I'm not convinced either way as yet.
     
  12. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    805
    Likes Received:
    1,634
    Well, I am not a native speaker, so it's PITA to flesh out thoughts in the right way

    Should I explain how the ILP or the lack of ILP due to instructions dependencies affect performance?

    Nope, moreover I've not seen the code, the whole thing was obvious starting from "single lane" words of MDolenc, though instructions dependencies, arithmetic instructions latencies and instructions pairing could affect performance too, but such things really require code analyse
     
  13. RedditUserB

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    24
    Likes Received:
    1
    Why is the Async test on GCN does nothing to CPU usage or temps, but on Maxwell, CPU usage/temp spikes up very high and GPU usage drops to 0%?

    Does this not indicate that there is software emulation of Async Compute "support" going on? This is what Oxide has referred to directly, high CPU usage due to driver emulation. I'd imagine in a real gaming workload, the compute task would be a lot more complex, if its offloaded to the CPU, it will hammer the CPU into a stall.
     
  14. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    http://nubleh.github.io/async/#18

    http://nubleh.github.io/async/#38

    No mental exercise, looking up the two Fiji tests that might be too much for you, any case can we say Fiji doesn't have hardware supported async because of this test? The results are poor compared to older GCN architecture, although better than nV's hardware, but they mirror the up and down cycles of lower and higher latency at certain points and the step by step increase based on load.
     
    #374 Razor1, Sep 3, 2015
    Last edited: Sep 3, 2015
  15. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    805
    Likes Received:
    1,634
  16. RedditUserB

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    24
    Likes Received:
    1
    You should read what Sebbi has to say a few posts earlier, the tool is not fit to analyze performance for GCN because it vastly under-utilizes the architecture.

    Also in your example, look at the Forced Async, massive improvement for GCN. Massive degradation for Maxwell.
     
  17. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    what forced async, there is no forced async, the single command list is not forced async

    I read what Sebbi stated, but with the same number of 8 ACE's in Hawaii to Fiji we shouldn't see a similar data plot to Maxwell 2 when latency is concerned, I would expect to something similar to Hawaii or Tonga or, you can see spikes yes but that is about it

    This is latency not end performance.

    And the theory about offloading to the CPU, when doing async doesn't make sense the latency should increase drastically when doing this, the plot wouldn't go with the step by step method anymore, it should like what the single command list looks likes.
     
    #377 Razor1, Sep 3, 2015
    Last edited: Sep 3, 2015
  18. RedditUserB

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    24
    Likes Received:
    1
    [​IMG]

    Do you guys notice any different in Fiji/Fury?

    Look at the ACEs. There's normally 8 in other GCN, Fiji has 4 ACEs and 2 HWS.....

    Some radically different way to handle queues?
     
  19. juanchotazo99

    Joined:
    Sep 1, 2015
    Messages:
    6
    Likes Received:
    1
    http://forums.anandtech.com/showpost.php?p=37656793&postcount=204

    Interesting. Does Tonga have the same behavior as Fiji under this test?
     
  20. Phyxsyus

    Joined:
    Sep 2, 2015
    Messages:
    3
    Likes Received:
    0
    I just ran the new AsyncCompute benchmark, results are kinda different from the last one. Could anyone kindly explain my results? If ms are going up, does this mean it's not doing asycn?

    R9 280 OC
     

    Attached Files:

Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...