DX12 Performance Discussion And Analysis Thread

Depends on the revision of the GCN architecture.

1.0 certainly isn't, the ACEs had not been programmable back then at all.
1.1 is programmable, but the space is limited, and the space is required for the queue decoding logic.
1.2 might be able to do such a thing. But I'm not entirely sure what the HWS/ "new" ACE units on Tonga and Fiji are actually doing right now.


well each HWS unit is supposed to work like 2 ACE's, so I don't think they can do it either.
 
well each HWS unit is supposed to work like 2 ACE's, so I don't think they can do it either.
Thy can do a lot more than that - emulating 2 legacy ACEs is just ONE possible option. Although I'm not sure that simplification is even remotely accurate, regarding the "behaves like two ACEs" part. Not with all the new hardware features like preemption and alike which got added with Tonga/Fiji.

http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/10/si_programming_guide_v2.pdf
Just search for "micro engine" inside that document. That document assumes that you have loaded a firmware with PM4 decoding capabilities, but there are other options (queue formats, and correspondingly firmwares) as well.

For the more recent GCN architectures, there is not only an "micro engine", but also an "micro engine compute" which is programmable as well and handles 8x8 queues. (Respectively even 2 MEC units for recent APUs, only handling 4x8 each, but they may run with different firmwares in return.)

Respectively try the source code of the Linux radeon driver, it contains a little explanation as well:
https://cgit.freedesktop.org/~airlied/linux/tree/drivers/gpu/drm/radeon/cik.c?h=drm-next#n3895
(Thanks at @CarstenS for finding that one. The comment describes an APU, not a dGPU.)
 
well the micro engines to me seem like they are there to work across more than one block, this is because Fiji has more blocks than hawaii, and with the same amount of ACE (ace+HWS) it was a way to save space and retain the overall same "number" of ACE's.
 
The front ends that function as HWS are ACEs with an additional processor that apparently has a role in mapping and unmapping what software queues are being handled by the hardware. This would permit a program to allocate more queues than are present in hardware, or the GPU to juggle multiple programs that in aggregate exceed the non-HWS queue capacity.
This came up in the context of AMD's SR-IOV, where HWS can map queues from multiple address spaces to hardware. Remapping within an address space would seem to fall out from that.
 
I've been given to understand that a HWS basically is a little dumbed down ACE, lacking mainly the wavefront dispatch capability of a fully configured ACE. A HWS in it's current configuration supposedly manages 8 queues, just like an ACE.
 
8 hardware queues perhaps, but with AMD's hardware virtualization and its compute driver documenting a mode that permits oversubscribing queues the capacity to handle more than 8 per front end dynamically would need to be present.
 
Quick response queue:

https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved

Tasks submitted into this special queue get preferential access to GPU resources while running asynchronously, so they can overlap with other workloads. Because the Asynchronous Compute Engines in the GCN architecture are programmable and can manage resource scheduling in hardware, this feature can be enabled on existing GPUs (2nd generation GCN or later) with a driver update.
 
I wonder if that's a combination of wavefront prioritization and wavefront preemption.
Prioritization can get a desired ratio of wave launches, but it would be less able to get the (marketing) graph's idealized transitions if there is an unlucky run of long-lived graphics wavefronts.
However, I don't think fully preemptable graphics is discussed prior to Carrizo's generation.
 
I've been given to understand that a HWS basically is a little dumbed down ACE, lacking mainly the wavefront dispatch capability of a fully configured ACE. A HWS in it's current configuration supposedly manages 8 queues, just like an ACE.

HWS isn't a separate block, just an optional mode that an MEC pipe can use. By default each pipe round-robins across its own 8 queues, but you can enable HWS which instead takes a "runlist" (list of processes plus per-process queues) and schedules those processes/queues onto the queues of the other MEC pipes.

Wavefront preemption only starting with the 3rd generation (Fiji and Tonga). And I'm not sure it's being utilized by the driver yet.

We enabled it in the Sept 2015 HSA stack for Carrizo, and IIRC just enabled it for Fiji in the latest ROC stack.
 
@bridgman There is something I've been wondering for quite a while now - are the MEC units of any of the GCN generations by chance capable of monitoring the global data share for synchronization flags, rather than using CPU side scheduler?

More specifically, how do the MECs behave when they encounter a PM4 WAIT_REG_MEM package in one of the queues? That would not block the entire MEC but only a single queue, wouldn't it?
 
Quick question guys, and I'm sorry to go off on a point tangential to the discussion, but what level of concurrency are we talking about here ?

The whole discussion regarding Maxwell and it's concurrent multi-engine capabilities seems to gravitate around concurrency within one SMM (or is it SMX ? a cluster of 'cores'!), and the argument is it is impossible to execute commands from multiple queues concurrently (when I say concurrently, I am categorically excluding parallelism).

is there anything preventing asynchronous, concurrent execution of graphics + compute across the entirety of the GPU ?
 
For nV from what I have read there is none, for AMD No and yes, the way AMD's cache is accessed by each ACE, depending on L1 or L2 cache, there will be penalties if trying to pull instructions form block not adjacent to ACE block that is working on that specific queue. But again, there should be no reason to use L2 cache if done right.
 
Are you the same razor1 I'm talking to on hardocp?

I was under the impression the main advantage of AMD's design was that the ACEs can address any execution unit, independent of adjacency. It was also my understanding that a global data share( I think that's what they called it) enables preemption with a small cost (1 cycle)
For nV from what I have read there is none, for AMD No and yes, the way AMD's cache is accessed by each ACE, depending on L1 or L2 cache, there will be penalties if trying to pull instructions form block not adjacent to ACE block that is working on that specific queue. But again, there should be no reason to use L2 cache if done right.
 
Yes I am,

Hmm yeah that is the penalty from using the different cache levels.

From nV hardware I say no but there might be something there, because its not well documented.......
 
The whole discussion regarding Maxwell and it's concurrent multi-engine capabilities seems to gravitate around concurrency within one SMM (or is it SMX ? a cluster of 'cores'!), and the argument is it is impossible to execute commands from multiple queues concurrently (when I say concurrently, I am categorically excluding parallelism).

is there anything preventing asynchronous, concurrent execution of graphics + compute across the entirety of the GPU ?
Yes, having only a single hardware queue exposed to DX12 applications, as the second Hyper-Q command processor sporting the independent compute queues is for some unknown reason not used.

So the application is always subject to cooperative software scheduling (default fallback solution) for concurrent, asynchronous execution.
 
Yes, having only a single hardware queue exposed to DX12 applications, as the second Hyper-Q command processor sporting the independent compute queues is for some unknown reason not used.

So the application is always subject to cooperative software scheduling (default fallback solution) for concurrent, asynchronous execution.

Oh yes, I remember this from your blog post :)

Right, lets assume for a minute Hyper-Q, and the gmu it holds, isn't coming back.

Under dx11 Hyper-Q is working at the driver level, so if my performance under dx12 is pretty much identical to performance under dx11, it is safe to assume the gpu is being utilized to the same extent.

This then raises the question, under what circumstances would async shaders even be beneficial to maxwell/Kepler, other than obvious advantages for VR with fine granularity in preemption?
 
This then raises the question, under what circumstances would async shaders even be beneficial to maxwell/Kepler, other than obvious advantages for VR with fine granularity in preemption?
Not only "would", but "are", as we can see from examples like (the canceled) Fable Legends.
IMHO, I think that there are cases in which the scheduler does a better job at aligning the render pipeline stall free, compared to what developers at Lionhead Studios achieved when attempting to do it manually / statically, using only the 3D queue.

Apart from that: None.
Should even out at +-0 with use of async shaders. Slight gains if the scheduler can avoid a stall, slight losses when the scheduler messes up. And that's about the same for every architecture not supporting parallel execution in hardware, not limited to Maxwell and Kepler.

Actually, there are two aspects where the hardware can aid with / profit from async shaders. One is the obvious parallel execution of independent command lists, the other one would be hardware support for queue synchronization, avoiding the CPU roundtrip for scheduling. Both require support for multiple queues in hardware, but the features can be provided independently.
The first theoretically brings increased resource utilization, at the possible risk of cache trashing. The second one reduces the penalty for synchronization greatly.

If I'm not mistaken, then GCN provides both features for the compute queues, the synchronization from 3D to compute queue is limited to barriers and fences in the compute queue.
Kepler and Maxwell provide neither of these features unless the GMU is brought back to life, in which case it should actually behave quite similar to GCN. (The lack of mixed mode operation on single SMM units aside.)
 
AMD GDC 2016 slides available for download:

Practical DirectX 12 – Programming Model and Hardware Capabilities Gareth Thomas (AMD), Alex Dunn (NVIDIA)
Vulkan Fast Paths Graham Sellers (AMD), Timothy Lottes (AMD), Matthaeus Chajdas (AMD)
Let Your Game Shine – Optimizing DirectX 12 and Vulkan Performance with AMD CodeXL Doron Ofek (AMD)
Right on Queue: Advanced DirectX 12 Programming Stephan Hodes (AMD), Dave Oldcorn (AMD), Dan Baker (Oxide)
D3D12 & Vulkan: Lessons Learned Matthaeus Chajdas (AMD)
Advanced Techniques and Optimization of HDR Color Pipelines Timothy Lottes (AMD)
LiquidVR™ Today and Tomorrow Guennadi Riguer (AMD)
Multiplatform GPU Ray-Tracing Solutions With FireRender and FireRays Takahiro Harada (AMD), Dmitry Kozlov (AMD)
GPUOpen – Unlocking Game Development with Open Source Nicolas Thibieroz (AMD), Jason Stewart (AMD), Dmitry Kozlov (AMD),
Doron Ofek (AMD), Jean-Normand Bucci (Eidos Montreal) + others

About the last slides, I guess some people around here will be happy to read pages 48-49
 
Back
Top