DX12 Performance Discussion And Analysis Thread

Discussion in 'Rendering Technology and APIs' started by A1xLLcqAgt0qc2RyMz0y, Jul 29, 2015.

  1. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    well each HWS unit is supposed to work like 2 ACE's, so I don't think they can do it either.
     
  2. Ext3h

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    428
    Likes Received:
    497
    Thy can do a lot more than that - emulating 2 legacy ACEs is just ONE possible option. Although I'm not sure that simplification is even remotely accurate, regarding the "behaves like two ACEs" part. Not with all the new hardware features like preemption and alike which got added with Tonga/Fiji.

    http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/10/si_programming_guide_v2.pdf
    Just search for "micro engine" inside that document. That document assumes that you have loaded a firmware with PM4 decoding capabilities, but there are other options (queue formats, and correspondingly firmwares) as well.

    For the more recent GCN architectures, there is not only an "micro engine", but also an "micro engine compute" which is programmable as well and handles 8x8 queues. (Respectively even 2 MEC units for recent APUs, only handling 4x8 each, but they may run with different firmwares in return.)

    Respectively try the source code of the Linux radeon driver, it contains a little explanation as well:
    https://cgit.freedesktop.org/~airlied/linux/tree/drivers/gpu/drm/radeon/cik.c?h=drm-next#n3895
    (Thanks at @CarstenS for finding that one. The comment describes an APU, not a dGPU.)
     
    Razor1 and Nemo like this.
  3. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    well the micro engines to me seem like they are there to work across more than one block, this is because Fiji has more blocks than hawaii, and with the same amount of ACE (ace+HWS) it was a way to save space and retain the overall same "number" of ACE's.
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The front ends that function as HWS are ACEs with an additional processor that apparently has a role in mapping and unmapping what software queues are being handled by the hardware. This would permit a program to allocate more queues than are present in hardware, or the GPU to juggle multiple programs that in aggregate exceed the non-HWS queue capacity.
    This came up in the context of AMD's SR-IOV, where HWS can map queues from multiple address spaces to hardware. Remapping within an address space would seem to fall out from that.
     
    Ext3h and Razor1 like this.
  5. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    I've been given to understand that a HWS basically is a little dumbed down ACE, lacking mainly the wavefront dispatch capability of a fully configured ACE. A HWS in it's current configuration supposedly manages 8 queues, just like an ACE.
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    8 hardware queues perhaps, but with AMD's hardware virtualization and its compute driver documenting a mode that permits oversubscribing queues the capacity to handle more than 8 per front end dynamically would need to be present.
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Quick response queue:

    https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved

     
  8. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    I wonder if that's a combination of wavefront prioritization and wavefront preemption.
    Prioritization can get a desired ratio of wave launches, but it would be less able to get the (marketing) graph's idealized transitions if there is an unlucky run of long-lived graphics wavefronts.
    However, I don't think fully preemptable graphics is discussed prior to Carrizo's generation.
     
  9. Ext3h

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    428
    Likes Received:
    497
    Wavefront preemption only starting with the 3rd generation (Fiji and Tonga). And I'm not sure it's being utilized by the driver yet.
     
  10. bridgman

    Newcomer Subscriber

    Joined:
    Dec 1, 2007
    Messages:
    62
    Likes Received:
    123
    Location:
    Toronto-ish
    HWS isn't a separate block, just an optional mode that an MEC pipe can use. By default each pipe round-robins across its own 8 queues, but you can enable HWS which instead takes a "runlist" (list of processes plus per-process queues) and schedules those processes/queues onto the queues of the other MEC pipes.

    We enabled it in the Sept 2015 HSA stack for Carrizo, and IIRC just enabled it for Fiji in the latest ROC stack.
     
    CarstenS, Razor1 and Ext3h like this.
  11. Ext3h

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    428
    Likes Received:
    497
    @bridgman There is something I've been wondering for quite a while now - are the MEC units of any of the GCN generations by chance capable of monitoring the global data share for synchronization flags, rather than using CPU side scheduler?

    More specifically, how do the MECs behave when they encounter a PM4 WAIT_REG_MEM package in one of the queues? That would not block the entire MEC but only a single queue, wouldn't it?
     
  12. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    Quick question guys, and I'm sorry to go off on a point tangential to the discussion, but what level of concurrency are we talking about here ?

    The whole discussion regarding Maxwell and it's concurrent multi-engine capabilities seems to gravitate around concurrency within one SMM (or is it SMX ? a cluster of 'cores'!), and the argument is it is impossible to execute commands from multiple queues concurrently (when I say concurrently, I am categorically excluding parallelism).

    is there anything preventing asynchronous, concurrent execution of graphics + compute across the entirety of the GPU ?
     
  13. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    For nV from what I have read there is none, for AMD No and yes, the way AMD's cache is accessed by each ACE, depending on L1 or L2 cache, there will be penalties if trying to pull instructions form block not adjacent to ACE block that is working on that specific queue. But again, there should be no reason to use L2 cache if done right.
     
  14. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    Are you the same razor1 I'm talking to on hardocp?

    I was under the impression the main advantage of AMD's design was that the ACEs can address any execution unit, independent of adjacency. It was also my understanding that a global data share( I think that's what they called it) enables preemption with a small cost (1 cycle)
     
  15. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    Yes I am,

    Hmm yeah that is the penalty from using the different cache levels.

    From nV hardware I say no but there might be something there, because its not well documented.......
     
  16. Ext3h

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    428
    Likes Received:
    497
    Yes, having only a single hardware queue exposed to DX12 applications, as the second Hyper-Q command processor sporting the independent compute queues is for some unknown reason not used.

    So the application is always subject to cooperative software scheduling (default fallback solution) for concurrent, asynchronous execution.
     
    Razor1 likes this.
  17. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    Oh yes, I remember this from your blog post :)

    Right, lets assume for a minute Hyper-Q, and the gmu it holds, isn't coming back.

    Under dx11 Hyper-Q is working at the driver level, so if my performance under dx12 is pretty much identical to performance under dx11, it is safe to assume the gpu is being utilized to the same extent.

    This then raises the question, under what circumstances would async shaders even be beneficial to maxwell/Kepler, other than obvious advantages for VR with fine granularity in preemption?
     
  18. Ext3h

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    428
    Likes Received:
    497
    Not only "would", but "are", as we can see from examples like (the canceled) Fable Legends.
    IMHO, I think that there are cases in which the scheduler does a better job at aligning the render pipeline stall free, compared to what developers at Lionhead Studios achieved when attempting to do it manually / statically, using only the 3D queue.

    Apart from that: None.
    Should even out at +-0 with use of async shaders. Slight gains if the scheduler can avoid a stall, slight losses when the scheduler messes up. And that's about the same for every architecture not supporting parallel execution in hardware, not limited to Maxwell and Kepler.

    Actually, there are two aspects where the hardware can aid with / profit from async shaders. One is the obvious parallel execution of independent command lists, the other one would be hardware support for queue synchronization, avoiding the CPU roundtrip for scheduling. Both require support for multiple queues in hardware, but the features can be provided independently.
    The first theoretically brings increased resource utilization, at the possible risk of cache trashing. The second one reduces the penalty for synchronization greatly.

    If I'm not mistaken, then GCN provides both features for the compute queues, the synchronization from 3D to compute queue is limited to barriers and fences in the compute queue.
    Kepler and Maxwell provide neither of these features unless the GMU is brought back to life, in which case it should actually behave quite similar to GCN. (The lack of mixed mode operation on single SMM units aside.)
     
    ieldra likes this.
  19. Alessio1989

    Regular

    Joined:
    Jun 6, 2015
    Messages:
    614
    Likes Received:
    321
    AMD GDC 2016 slides available for download:

    Practical DirectX 12 – Programming Model and Hardware Capabilities Gareth Thomas (AMD), Alex Dunn (NVIDIA)
    Vulkan Fast Paths Graham Sellers (AMD), Timothy Lottes (AMD), Matthaeus Chajdas (AMD)
    Let Your Game Shine – Optimizing DirectX 12 and Vulkan Performance with AMD CodeXL Doron Ofek (AMD)
    Right on Queue: Advanced DirectX 12 Programming Stephan Hodes (AMD), Dave Oldcorn (AMD), Dan Baker (Oxide)
    D3D12 & Vulkan: Lessons Learned Matthaeus Chajdas (AMD)
    Advanced Techniques and Optimization of HDR Color Pipelines Timothy Lottes (AMD)
    LiquidVR™ Today and Tomorrow Guennadi Riguer (AMD)
    Multiplatform GPU Ray-Tracing Solutions With FireRender and FireRays Takahiro Harada (AMD), Dmitry Kozlov (AMD)
    GPUOpen – Unlocking Game Development with Open Source Nicolas Thibieroz (AMD), Jason Stewart (AMD), Dmitry Kozlov (AMD),
    Doron Ofek (AMD), Jean-Normand Bucci (Eidos Montreal) + others

    About the last slides, I guess some people around here will be happy to read pages 48-49
     
    Kej, Jawed, Lightman and 5 others like this.
  20. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    homerdog, Malo and Razor1 like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...