DX12 Performance Discussion And Analysis Thread

Discussion in 'Rendering Technology and APIs' started by A1xLLcqAgt0qc2RyMz0y, Jul 29, 2015.

  1. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    337
    Likes Received:
    294
    Shame on me. I messed up with the data on GCN 1.2. :-(

    It is not the HWS which is capable of doing balancing. It's the Graphics Command Processor which got doubled in width on GCN 1.2. That thing is able to hold 128 grids in flight, as opposed to 64 in 1.1. The ACE/HWS still only manages 64 grids each, no more. That's why GCN 1.2 was much faster in "single commandlist" mode (="Everything in the 3D queue") than it was in pure compute or split mode.

    That means I also misinterpreted the data on Maxwell. It actually has 32 shader slots which are also all active while in compute mode. But it has only a single(!) compute shader slot while in graphics mode while the other slots are reserved (hard wired) for other shader types, which is why it failed so badly. And yes, it does need to switch in between graphics and compute queue mode, it can't do it in parallel. This is unrelated to the Hyper-Q feature, which is operating unrelated to these regular 32 slots, which is why dmw.exe and alike can cut ahead.
    There is no parallel DX12 compatible compute and 3D paths in hardware. Only one 3D queue, which can switch between compute and graphics mode.

    I failed at interpreting "single commandlist" correctly, and never gave it a second thought.
     
    drSeehas likes this.
  2. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    I think at this stage there is likely no concept of wavefront. It is likely a descriptor of the size of a workgroup, with the program pointers, VMID, the resource it needed (LDS allocation size) and perhaps also the data it may need to inject (for pixel/vertex shader?). The SPI will then "break" a workgroup into wavefronts after receiving them.

    I think this is an expected behaviour. The commands from the same queue should likely be committed the in program order, since we already have concurrency in the form of multiple queues. HSA's platform specification also states that:
    Code:
    All preceding packets in the queue must have completed their launch phase.
    Carrizo gonna make this more complicated, since you now have mid-wave preemption (for CUs) and also potentially mid-dispatch preemption (for ACEs).
     
  3. X-AleX

    Newcomer

    Joined:
    May 20, 2005
    Messages:
    72
    Likes Received:
    9
  4. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,911
    Likes Received:
    1,608
    Glad to see Lionhead's comments! It should take care of some of the guessing that's been going around in this thread.
     
  5. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    So its using async compute for GI as well that is good know.
     
  6. Alessio1989

    Regular Newcomer

    Joined:
    Jun 6, 2015
    Messages:
    582
    Likes Received:
    285
    I would like to know if they are going to use features like CR, RoV, PS stencil ref, or even TR (I am still not aware of a single game using D3D TR...). Anyway, did they used a private UE implementation or did they wrote by their own the D3D12 backend? Actual UE DX12 RHI implementation is quite primitive (but looks like some performance optimization will appear into UE 4.10...).
     
  7. monstercameron

    Newcomer

    Joined:
    Jan 9, 2013
    Messages:
    127
    Likes Received:
    101
  8. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    the way the licensing is set up for Unreal Engine 4, if they like something a dev did they can added it back in the main branch of the UE4 engine. So I wouldn't be surprised if many of the features of Fable Legends are introduced back in UE4 at a later date.
     
  9. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,911
    Likes Received:
    1,608
    http://wccftech.com/fable-legends-dx12-benchmark-results/

    I think we'll probably only get a good performance indication once the game is released and benchmarked.
     
    #889 pharma, Oct 3, 2015
    Last edited: Oct 3, 2015
  10. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,583
    Likes Received:
    703
    Location:
    Guess...
    From that link:

    "Microsoft and Lionhead Studios have assured us that asynchronous compute is indeed activated and on for the test, across the board, and that it doesn’t turn off based on the presence of a card that doesn’t support it.

    Read more: http://wccftech.com/asynchronous-compute-investigated-in-fable-legends-dx12-benchmark/#ixzz3nUuiuGAQ"

    So that suggests he previous claim from Anandtech forums that there is a second build which allows async to be toggled on and off resulting in much higher performance on AMD cards was false.

    Also, it sounds like they aren't doing a huge amount concurrently which is where AMD should/might have the advantage over Nvidia. i.e...

    "Multi-engine is the official D3D12 term for that feature. We’re using it quite heavily across the scene, including dynamic GI compute shaders, GPU-based culling for instanced foliage, Forward Plus compute shaders (light gathering). In addition, all the foliage physics in the scene (bushes, grass, etc.) is simulated on the GPU that also runs concurrently with the graphics.

    Read more: http://wccftech.com/lionhead-dx12-features-fable-legends/#ixzz3nUvFWICF"
     
  11. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    337
    Likes Received:
    294
    This ain't correct. It's not running concurrently on Nvidias hardware. That just isn't possible, at all.

    What is working though, is that compute commands which are committed in batch the compute engine, are in fact running concurrently against each other. While compute commands in the 3D engine are only executed in sequential order, up to 32 grids (depending on how large the chip is) can be in flight while the scheduler is running in compute mode. Which naturally increases utilization of the GPU, even when not running concurrently with 3D workload.

    GPUView only shows which queue the workload was submitted to, but not which engine mode was used for a specific command list.

    And no matter what, any Nvidia GPU won't be able to execute 3D and compute shaders concurrently and will result in various stalls if you attempt to do so.

    Nope. The numbers were from AMDs press deck, not from a build Anandtech had access to. And you can be sure that AMD was able to turn off async compute by forcefully rerouting any commands to the 3D engine, which is more than just rerouting anything to the 3D queue which Nvidias driver does either way.

    And it also makes sense. Even though GCN is even capable of executing compute shaders from 3D engine concurrently (yeah, surprise!), it still complicates scheduling on the GPU, partially because it means that all barriers used for the 3D path are also applied to the compute commands.

    And as for Nvidia: See above.
     
    Xuper and drSeehas like this.
  12. Alessio1989

    Regular Newcomer

    Joined:
    Jun 6, 2015
    Messages:
    582
    Likes Received:
    285
    Thank you.
    Finally someone starts to puts some meaningful graphs. Hope to see more deeper analysis in the feature, because it's unclear what is really going on on the TitanX profiling session, looks like that all compute works are executed from the default queue (ie they are serialized directly from the application).
    This also looks strange: "but is there for another background task not associated to the benchmark at all."... and it also quite irregular, while the GCN gen 3 session is the opposite, generating a pretty regular graph.
     
  13. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,911
    Likes Received:
    1,608
    Since the AOS benchmark came out, new gpu reviews at Guru3D have included this as part of their testing regiment. It's interesting to note that performance on GCN hardware basically in the same regardless of what model is used. You do see more variation with Maxwell 2 especially with factory OC'd models which is one of the advantages of having an OC architecture. I can only assume Maxwell 2 performance will improve once Nvidia releases a drivers optimized for async compute, but at this point is seems to be holding it's own.

     
  14. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,583
    Likes Received:
    703
    Location:
    Guess...
    I read the quote to be more a statement that that is how the game is configured to be used by GPU's that support concurrent compute/graphics. Not necessarily that all GPU's were behaving that way. So in effect, the advantage that AMD has over Nvidia in this case is specific to vegitation physics.

    This was in reference to a user on the Anandtech forums who claimed he has access to a different build of the benchmark to the press which had a toggle to enable or disable async compute, and he was apparently seeing a massive boost in performance in AMD hardware as a result. i.e. Fury X easily beating Titan X.

    A dodgy claim to be sure but some claimed he was a reliable source.
     
  15. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,583
    Likes Received:
    703
    Location:
    Guess...
    Something to do with GeForce Experience? Shadow Play perhaps? Just a (very) wild guess.
     
  16. Alessio1989

    Regular Newcomer

    Joined:
    Jun 6, 2015
    Messages:
    582
    Likes Received:
    285
    Do people really use such software when profiling/benchmarking? o_O
     
  17. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,583
    Likes Received:
    703
    Location:
    Guess...
    It's potentially just turned on and running all the time, they may not have even realised. Although yeah, it would be good practice to make sure it's not running.
     
  18. Genotypical

    Newcomer

    Joined:
    Sep 25, 2015
    Messages:
    38
    Likes Received:
    11
    Some of these things might not be as taxing in a canned benchmark. Foliage physics is the clearest example. one might not see a difference between going async or doing what nvidia does.
     
  19. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland

     
  20. Genotypical

    Newcomer

    Joined:
    Sep 25, 2015
    Messages:
    38
    Likes Received:
    11
    Not familiar with GPUView but doesn't that 18.91% represent the percentage of the frametime? To figure how much is actually going on there relative to the entire scene would have to take into account how fast the hardware would complete certain amounts of work as well as the type of work.

    Can they conclude much from 18.91%? There could be a crap ton of compute going on but being done really quickly. Or very little being done relatively slowly.

    What about the 5% claim attributed to lionhead?

    Wish lionhead would be as transparent as oxide were.

    For the nano it looks like there are some compute tasks in the graphics queue - maybe nothing. The tasks in the compute queue also seem predictable at predictable intervals. Just the same thing happening at regualar intervals. (Not familiar with GPUView)

    http://cdn.wccftech.com/wp-content/uploads/2015/10/ViewNano.png
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...