Some AMD guys have mentioned cache hits, texturing metrics, and simple registers when this was discussed before. Answers have been deliberately vague beyond performance counters existing. ALU:fetch of all scheduled waves would seem a good start and somewhat easy to implement. Idle and stall counts for all hardware units would open many possibilities. Basically any bottlenecks AMD mentioned in their async pairings are probably valid.
That does not speak to whether the programmer's desired responsiveness level was reached. The best results for the most progress as measured by CU-level hardware counters would be a system that is completely unresponsive until all wavefronts have completed. The HWS could have made massive progress on all its shaders within 33ms, but failed to meet an audio wavefront's time budget by 1ms.
It's not clear whether the HWS has had that level of access for the CU counters, or the visibility or intelligence to rate the global QoS, and it's a modest core that is already running a specific workload.
The various QoS measures, reservations, and latency optimizations actually make the GPU less capable of maximizing utilization, as the harder timing constraints mean the software or hardware are more pessimistic in their metrics since their job is to maintain quality of service in the face of uncertain future activity and non-zero adjustment latency.
If they suspend and dump active work as shown with their actual preemption I'd consider that real-time.
Where are they showing that, the preemption diagram in their asynchronous compute marketing? That's the full preemption of the GPU's graphics context, which is the one AMD cites as the least desirable choice. It's disruptive and it can take some time for one workload to fully vacate the GPU.
A primitive is a wave to my understanding and suspending a wave seems rather straightforward. Might need to flush ROPs and cache though.
Compute has a more deterministic wavefront count, per-CU residency, and limited per-kernel context and side effects, which AMD freely provisions preemption for at an instruction level.
Geometry that takes up a large amount of screen space while running some convoluted shader is a primitive that can span most of the GPU. It depends not just on the local CU's context, but on more global graphics context information and deeply buffered context changes and versions. Switching out a wavefront there means having some way of bringing it back with all that global state still valid.
Interrupting in the middle of a triangle's processing at an instruction level is what Nvidia claims to have implemented. AMD says it's implemented some kind of preemption of undisclosed granularity, and only somewhat recently. Concerns about a denial-of-service of the GPU due to buggy or malicious graphics code were brought up for Kaveri, and it was only promised with Carizzo that something might have changed.
I can't imagine with a mixture of graphics and compute were talking about more than microseconds here.
I don't know about imagining, but the DX12 performance thread included a graphics and compute shader synthetic that purposefully took both types beyond a few microseconds. Saying an architecture can offer real-time or closer to hard real-time responsiveness is more about what it can guarantee regardless of how well-behaved the other workloads are.
What AMD seems to offer as being real-time or timing-critical real-time includes having ways of removing "surprises" that affect not just the time it takes to dispatch, but the time it takes for a kernel to complete.
Preemption at a wave or kernel level doesn't make ill-behaved concurrent workloads not interfere, hence CU reservation. If AMD's graphics preemption is more coarse, then a static CU reservation bypasses it needing to be invoked for timing-critical loads.
That part confuses me though. Simply gating off a lane should do that as it lowers the burden on the supply. It also leaves all the lanes out of sync, which would seemingly leave only the temporal SIMD idea running the same instruction up to 64 cycles. Hard to increase clocks without taking a shared RF and hardware along for the ride.
Clock gating at a lane level was noted for the original Larrabee.
Power gating or power distribution changes at a SIMD lane level hasn't been noted. Clock gating can more readily switch states, but there are costs to power gating that usually leave it at a larger block level and with higher thresholds for use.
If you spend more power in the lanes that are on, then the total power usage is the same - and this is apparently lower in stress. "Stress" apparently meaning any change in overall power consumption (presumably at SIMD or maybe CU level?). This stress thing might be a red herring... For me the key concept here is that spending more power on other lanes means the other lanes can run at a higher clock.
The claim seems to be in the context of reducing current spikes and the stress is on the rail voltage from Vdd to Vss, so one possible interpretation is moderating the amount of current flow per the limits of the circuit, and keeping voltage from drooping below safe limits.
There seems to be a local and regional hierarchy, and a need for some level pipelining, prediction, or compiler-level information get determine whether gating or voltage adjustments are warranted. Excessively toggling the power gating status would increase the amount of spikes and instability on the power supply.