AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

I felt tempted to consider custom FF hardware for this. Using just one lane seems complete waste to me. :D In a way x86 instruction coding is also abstract and becomes fixed length instructions inside, the instruction decoder does that. Why not in a CU. Using a compute shader to transform the command-stream has the disadvantage that you need a temporary distinct command-buffer, which is ... really inconvenient for an implementer (edit: of Driver/ExecuteIndirect ofc, not the app guy).
Regardless, I also think it's a compute shader (currently).
Why you you go though the command list with a single lane? You would obviously process each command with their own lane and do a in-place prefix sum based on the storage each command needs. AMD GCN even has a custom GDS intrinsic for in-place ordered sum (DS_ORDERED_COUNT). Nvidia had a good paper how to implement the same efficiently on their GPU, but the link is now dead (http://research.nvidia.com/sites/default/files/publications/nvr-2016-002.pdf).
 
Why you you go though the command list with a single lane? You would obviously process each command with their own lane and do a in-place prefix sum based on the storage each command needs. AMD GCN even has a custom GDS intrinsic for in-place ordered sum (DS_ORDERED_COUNT). Nvidia had a good paper how to implement the same efficiently on their GPU, but the link is now dead (http://research.nvidia.com/sites/default/files/publications/nvr-2016-002.pdf).
I think this might be the link.
 
That's all wrong. First of all, ACEs are not meant to resolve such situations, there is a CU reservation feature for this, which allows to dedicate a particular number of CUs on timewarp in order to get it completrd in time. SM reservation is the same method which has been implemented for parallel compute and graphics execution on NV's GPUs since Maxwells. So all modern GPUs including Maxwells can do it. Unlike reservation, fine grain warp level scheduling of async dispatches can't guarantiee timewarp execution in time. Moreover, Pascals also can use their mid triangle preemption feature to ensure that timewarp is executed in time, which I guess is less expensive than dedicate a number of SMs for this task, but still should guarantee timewarp execution in time for every frame.
Why not dedicate all the CUs to the task? Why limit yourself if it's time critical? ACEs are there to suspend dispatching while another ACE starts as already scheduled work flushes. Occupancy artificially lowered if space needs reserved.

What you're describing only occurs on Nvidia due to async limitations tied to ACEs as most hardware flushes around draw call boundaries. Go look at what AMD presented on prioritization with high priority compute. So yes ACEs resolve the situation quite nicely. The CU reservation, while it would work how Nvidia implemented ATW, is rather inefficient. It does nothing to minimize execution time or enable overlap. AMD was mostly using it for audio DSP work where the load is extremely predictable and sensitive to latency.

Why you you go though the command list with a single lane? You would obviously process each command with their own lane and do a in-place prefix sum based on the storage each command needs. AMD GCN even has a custom GDS intrinsic for in-place ordered sum (DS_ORDERED_COUNT). Nvidia had a good paper how to implement the same efficiently on their GPU, but the link is now dead (http://research.nvidia.com/sites/default/files/publications/nvr-2016-002.pdf).
Indirect calls originating from the shader and it may be streaming them. A single lane operating as a game loop, handling dependencies, prioritization, or memory management would be possible. Doubt there is much of a gain there, but any command could be generated. Would make sense if actively polling hardware counters or something we haven't seen. Programmable command processor perhaps?
 
Why not dedicate all the CUs to the task? Why limit yourself if it's time critical? ACEs are there to suspend dispatching while another ACE starts as already scheduled work flushes. Occupancy artificially lowered if space needs reserved.

What you're describing only occurs on Nvidia due to async limitations tied to ACEs as most hardware flushes around draw call boundaries. Go look at what AMD presented on prioritization with high priority compute. So yes ACEs resolve the situation quite nicely. The CU reservation, while it would work how Nvidia implemented ATW, is rather inefficient. It does nothing to minimize execution time or enable overlap. AMD was mostly using it for audio DSP work where the load is extremely predictable and sensitive to latency.
GCN sheduler could stop scheduling remaining waves of the currently running compute dispatch to make room for new waves of another compute dispatch and then simply continue scheduling waves from the previous dispatch. This does the job most of the time. However this doesn't work in the case, where the existing compute dispatch has waves already running that take long time to finish. For example these waves have long loops (several milliseconds to finish a single wave). In this case, you need to either reserve some room from each CU (limit waves) or reserve whole CUs. Or if the GPU supports it, you could also write current wave registers to memory to forcefully stop them and restore them later. This would however be a performance hit (costs memory bandwidth and GPU cycles both before and after starting to do something else).
 
GCN sheduler could stop scheduling remaining waves of the currently running compute dispatch to make room for new waves of another compute dispatch and then simply continue scheduling waves from the previous dispatch. This does the job most of the time. However this doesn't work in the case, where the existing compute dispatch has waves already running that take long time to finish. For example these waves have long loops (several milliseconds to finish a single wave). In this case, you need to either reserve some room from each CU (limit waves) or reserve whole CUs. Or if the GPU supports it, you could also write current wave registers to memory to forcefully stop them and restore them later. This would however be a performance hit (costs memory bandwidth and GPU cycles both before and after starting to do something else).
That's pretty much my understanding of what they do. ACEs being active queues and HWS selecting which ones to dispatch a wave according to some algorithm. Over the course of a few frames any need for reservation should be apparent and the HWS probably adapt dynamically.

For time critical scheduling they should be able to cover all scheduling appropriately. Suspend current tasks for highest priority. Prefer dispatching AND scheduling high priority over normal for some concurrency. Let HWS pair kernels for normal concurrency, round robin scheduling working around bottlenecks somewhat naturally. Then still have low priority which runs in the background, probably taking forever, and getting scheduled when no other work exists. Lines up with GCNs four priority levels at least. Vulkan uses normalized 0-1 and DX12 normal/high/real-time according to queue creation documentation. Actual implementation left to the vendors.

https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved

I'm guessing Pre-Emption is realtime, "Async Shaders" normal, and "Async with QRQ" high. QRQ is a DX11 construct as DX12/Vulkan should be able to configure queues directly.
 
Last edited:
That's pretty much my understanding of what they do. ACEs being active queues and HWS selecting which ones to dispatch a wave according to some algorithm. Over the course of a few frames any need for reservation should be apparent and the HWS probably adapt dynamically.
Reservation is something explicitly controlled by the programmer.

I'm guessing Pre-Emption is realtime, "Async Shaders" normal, and "Async with QRQ" high. QRQ is a DX11 construct as DX12/Vulkan should be able to configure queues directly.
http://gpuopen.com/amd-trueaudio-next-and-cu-reservation/

Going by this, AMD considers CU reservation real-time.
Higher priority dispatch does not provide control over whether dispatch is possible in a timely fashion.
Preemption is important in maintaining responsiveness, but it would also help to narrow down which form of preemption is being discussed. Compute, graphics, command-list, dispatch, instruction, primitive and other levels of granularity exist with varying levels of effectiveness and cost. I am not sure in this regard if AMD has that much of a lead in some aspects. For example, graphics preemption within a triangle is something Nvidia has explicitly mentioned.

There is some delay before it kicks in and it doesn't provide determinism in how quickly the newly brought-in task completes unless AMD's preemption scheme completely switches out concurrent wavefronts.
Switching in a high-priority task with hard time constraints could still fail if it is switched into a CU that has a conflicting workflow.

CU reservation provides more predictable and global control over what is available, and how quickly it can spin up and complete. Some global resources like memory bandwidth remain out of its grasp, although AMD's positing a use case that is less bandwidth-constrained.
 
Reservation is something explicitly controlled by the programmer.
Not necessarily, the driver or OS could partition the device for whatever reason. VMs for instance which is out of the scope of prioritization.

Higher priority dispatch does not provide control over whether dispatch is possible in a timely fashion.
I was referring to execution resources and not CUs for concurrency. Say 6/10 waves for graphics and 4/10 reserved for compute if the HWS determined those numbers optimal. Prioritization resulting in only the high priority waves scheduled opposed to the round robin of all 10. If all high priority tasks stall, schedule lower priority.

Switching in a high-priority task with hard time constraints could still fail if it is switched into
Could fail, but a programmer would have decided the risk was reasonable. High priority across the entire GPU may be better than real-time across a subset of reserved CUs. Real-time may not be exclusive to reserved CUs either. That could be the preemption resulting in active waves being suspended to ram. I'd think that is equivalent to Nvidia's triangle. Still, that's a bit extreme and inefficient if full occupancy wasn't required for high utilization or if current waves were rather short.
 
Not necessarily, the driver or OS could partition the device for whatever reason. VMs for instance which is out of the scope of prioritization.
Unclear if that's what would be done generally, and items like VMs would throw a lot into question concerning latency or real-time behavior.
This scenario also contradicts the idea that the HWS is dynamically making these determinations.

I was referring to execution resources and not CUs for concurrency. Say 6/10 waves for graphics and 4/10 reserved for compute if the HWS determined those numbers optimal.
What metric would the HWS use for optimality, and what visibility does it have to even use that metric?

Prioritization resulting in only the high priority waves scheduled opposed to the round robin of all 10. If all high priority tasks stall, schedule lower priority.
What if letting the lower-priority dispatch through consumes enough resources that the high-priority dispatches still cannot find sufficient resources in the next evaluation period?

Could fail, but a programmer would have decided the risk was reasonable. High priority across the entire GPU may be better than real-time across a subset of reserved CUs.
It could be better and in many cases it is preferred, but it wouldn't be called real-time.

Real-time may not be exclusive to reserved CUs either. That could be the preemption resulting in active waves being suspended to ram.
I'd think that is equivalent to Nvidia's triangle.
Nvidia made the point of indicating it could preempt within a primitive.
AMD hasn't said as much, and there's a fair amount of patent or other compute discussion where it continues to cite coarser granularities, which doesn't help if the blockage for progress is one pathological primitive.

Without committing to a precise description of the granularity of preemption and its policies, it doesn't provide the sort of increased determinism in dispatch and execution time that AMD's CU reservation provides.
 
Voltage regulation (and therefore frequency variation) per lane in a SIMD is pretty crazy. But if it's to obviate dynamic wavefront formation, using multiple SIMDs of varying widths and the horror of fragmentation, then it seems to me like a transformational step that's worth taking.
Well, there it is:

METHOD AND APPARATUS FOR PERFORMING INTER-LANE POWER MANAGEMENT


A method and apparatus for performing inter-lane power management includes de-energizing one or more execution lanes upon a determination that the one or more execution lanes are to be predicated. Energy from the predicated execution lanes is redistributed to one or more active execution lanes.

I like the idea that power from down-clocked (or off?) lanes is diverted to other lanes: to reduce stress...

Also the document recognises timeliness as a problem.

I think it pretty much covers all we've discussed. Of course that's not the same as saying any of this is coming to a GPU, soon or ever....
 
What metric would the HWS use for optimality, and what visibility does it have to even use that metric?
Some AMD guys have mentioned cache hits, texturing metrics, and simple registers when this was discussed before. Answers have been deliberately vague beyond performance counters existing. ALU:fetch of all scheduled waves would seem a good start and somewhat easy to implement. Idle and stall counts for all hardware units would open many possibilities. Basically any bottlenecks AMD mentioned in their async pairings are probably valid.

What if letting the lower-priority dispatch through consumes enough resources that the high-priority dispatches still cannot find sufficient resources in the next evaluation period?
Could be a concern, but through trial and error they could dial the reservation up/down each frame and test results. Even then they might be able to determine the maximum ideal occupancy for a kernel. So many variables they could use for decisions it's hard to say. With HWS, max occupancy may not exist anymore.

It could be better and in many cases it is preferred, but it wouldn't be called real-time.
If they suspend and dump active work as shown with their actual preemption I'd consider that real-time.

Nvidia made the point of indicating it could preempt within a primitive.
AMD hasn't said as much, and there's a fair amount of patent or other compute discussion where it continues to cite coarser granularities, which doesn't help if the blockage for progress is one pathological primitive.
A primitive is a wave to my understanding and suspending a wave seems rather straightforward. Might need to flush ROPs and cache though. Perhaps a limitation of suspending graphics for more graphics? Maybe a requirement for two GCP's like XBox, similar to ACEs with compute? Even without that granularity, how many threads are spawning from a single primitive that it couldn't simply flush? As easily as GCN switches focus, primitive doesn't seem advantageous. I can't imagine with a mixture of graphics and compute were talking about more than microseconds here.

I like the idea that power from down-clocked (or off?) lanes is diverted to other lanes: to reduce stress..
That part confuses me though. Simply gating off a lane should do that as it lowers the burden on the supply. It also leaves all the lanes out of sync, which would seemingly leave only the temporal SIMD idea running the same instruction up to 64 cycles. Hard to increase clocks without taking a shared RF and hardware along for the ride.

Best guess is an internal class D in multiple stages. Voltages naturally rise as less energy is expended from any predication and some sort of dropout circuit driving clockspeed. Otherwise I'm not sure how this is any different from typical power gating and quickly ramping power state.

If they were lowering voltages without predication I'd assume they were just equalizing performance per lane to save some power. The patent mentions pipelined predication and a few cycles, so I like my switch mode idea.
 
Are you talking about the DX-API-exposed preemption granularity for compute and graphics?
 
That part confuses me though. Simply gating off a lane should do that as it lowers the burden on the supply.
If you spend more power in the lanes that are on, then the total power usage is the same - and this is apparently lower in stress. "Stress" apparently meaning any change in overall power consumption (presumably at SIMD or maybe CU level?). This stress thing might be a red herring... For me the key concept here is that spending more power on other lanes means the other lanes can run at a higher clock.

It also leaves all the lanes out of sync, which would seemingly leave only the temporal SIMD idea running the same instruction up to 64 cycles.
I linked the post containing my theory as to why that's not needed.
 
Some AMD guys have mentioned cache hits, texturing metrics, and simple registers when this was discussed before. Answers have been deliberately vague beyond performance counters existing. ALU:fetch of all scheduled waves would seem a good start and somewhat easy to implement. Idle and stall counts for all hardware units would open many possibilities. Basically any bottlenecks AMD mentioned in their async pairings are probably valid.
That does not speak to whether the programmer's desired responsiveness level was reached. The best results for the most progress as measured by CU-level hardware counters would be a system that is completely unresponsive until all wavefronts have completed. The HWS could have made massive progress on all its shaders within 33ms, but failed to meet an audio wavefront's time budget by 1ms.
It's not clear whether the HWS has had that level of access for the CU counters, or the visibility or intelligence to rate the global QoS, and it's a modest core that is already running a specific workload.

The various QoS measures, reservations, and latency optimizations actually make the GPU less capable of maximizing utilization, as the harder timing constraints mean the software or hardware are more pessimistic in their metrics since their job is to maintain quality of service in the face of uncertain future activity and non-zero adjustment latency.

If they suspend and dump active work as shown with their actual preemption I'd consider that real-time.

Where are they showing that, the preemption diagram in their asynchronous compute marketing? That's the full preemption of the GPU's graphics context, which is the one AMD cites as the least desirable choice. It's disruptive and it can take some time for one workload to fully vacate the GPU.

A primitive is a wave to my understanding and suspending a wave seems rather straightforward. Might need to flush ROPs and cache though.
Compute has a more deterministic wavefront count, per-CU residency, and limited per-kernel context and side effects, which AMD freely provisions preemption for at an instruction level.
Geometry that takes up a large amount of screen space while running some convoluted shader is a primitive that can span most of the GPU. It depends not just on the local CU's context, but on more global graphics context information and deeply buffered context changes and versions. Switching out a wavefront there means having some way of bringing it back with all that global state still valid.
Interrupting in the middle of a triangle's processing at an instruction level is what Nvidia claims to have implemented. AMD says it's implemented some kind of preemption of undisclosed granularity, and only somewhat recently. Concerns about a denial-of-service of the GPU due to buggy or malicious graphics code were brought up for Kaveri, and it was only promised with Carizzo that something might have changed.

I can't imagine with a mixture of graphics and compute were talking about more than microseconds here.
I don't know about imagining, but the DX12 performance thread included a graphics and compute shader synthetic that purposefully took both types beyond a few microseconds. Saying an architecture can offer real-time or closer to hard real-time responsiveness is more about what it can guarantee regardless of how well-behaved the other workloads are.
What AMD seems to offer as being real-time or timing-critical real-time includes having ways of removing "surprises" that affect not just the time it takes to dispatch, but the time it takes for a kernel to complete.

Preemption at a wave or kernel level doesn't make ill-behaved concurrent workloads not interfere, hence CU reservation. If AMD's graphics preemption is more coarse, then a static CU reservation bypasses it needing to be invoked for timing-critical loads.

That part confuses me though. Simply gating off a lane should do that as it lowers the burden on the supply. It also leaves all the lanes out of sync, which would seemingly leave only the temporal SIMD idea running the same instruction up to 64 cycles. Hard to increase clocks without taking a shared RF and hardware along for the ride.
Clock gating at a lane level was noted for the original Larrabee.

Power gating or power distribution changes at a SIMD lane level hasn't been noted. Clock gating can more readily switch states, but there are costs to power gating that usually leave it at a larger block level and with higher thresholds for use.

If you spend more power in the lanes that are on, then the total power usage is the same - and this is apparently lower in stress. "Stress" apparently meaning any change in overall power consumption (presumably at SIMD or maybe CU level?). This stress thing might be a red herring... For me the key concept here is that spending more power on other lanes means the other lanes can run at a higher clock.
The claim seems to be in the context of reducing current spikes and the stress is on the rail voltage from Vdd to Vss, so one possible interpretation is moderating the amount of current flow per the limits of the circuit, and keeping voltage from drooping below safe limits.

There seems to be a local and regional hierarchy, and a need for some level pipelining, prediction, or compiler-level information get determine whether gating or voltage adjustments are warranted. Excessively toggling the power gating status would increase the amount of spikes and instability on the power supply.
 
Somebody is getting fired...

https://videocardz.com/70465/msi-damn-rx-vega-needs-a-lot-of-power

At Tweakers forums, you can find a message from MSI marketing director, who admits to having seen the RX Vega specs. Here’s what he had to say:

bszHgQ7h.jpg


which translates to this:

"I’ve seen the specs of Vega RX. It needs a damn lot of power. We’re working on it, which is a start so launch is coming closer :)"​

That is consistent with the 300-375W TDP rumors for the two FE cards.

upload_2017-6-16_0-58-24-png.2021
 
Last edited:
The various QoS measures, reservations, and latency optimizations actually make the GPU less capable of maximizing utilization, as the harder timing constraints mean the software or hardware are more pessimistic in their metrics since their job is to maintain quality of service in the face of uncertain future activity and non-zero adjustment latency.
There will always be tradeoffs. The QoS measures should be easily disabled if there was no need. Reserve CUs for guaranteed latency, but risk them being idle. Same with reserving waves, but you only lower occupancy as opposed to idling hardware. It's ultimately up to the developer to decide which seems the proper route. Options never hurt.

Where are they showing that, the preemption diagram in their asynchronous compute marketing? That's the full preemption of the GPU's graphics context, which is the one AMD cites as the least desirable choice. It's disruptive and it can take some time for one workload to fully vacate the GPU.
Yes, and full preemption would always be undesirable. That said, there are cases where it's the best choice. It could be compute preemption as well.

Preemption at a wave or kernel level doesn't make ill-behaved concurrent workloads not interfere, hence CU reservation. If AMD's graphics preemption is more coarse, then a static CU reservation bypasses it needing to be invoked for timing-critical loads.
I'm not saying CU reservation isn't a good option for some cases. It definitely has it's uses, but will likely come at the cost of overall throughput. Throwing the entire GPU at a problem with prioritization and some contention would likely outperform a 20% reservation with guarantee. Predictable that takes longer to execute isn't ideal. Regardless, all of these options exist so a developer can decide what is best.

Power gating or power distribution changes at a SIMD lane level hasn't been noted. Clock gating can more readily switch states, but there are costs to power gating that usually leave it at a larger block level and with higher thresholds for use.
We haven't seen it, but gating off execution units in general isn't exactly new. Even clock gating should reduce power and allow higher clocks from lower demand on the system.

There seems to be a local and regional hierarchy, and a need for some level pipelining, prediction, or compiler-level information get determine whether gating or voltage adjustments are warranted. Excessively toggling the power gating status would increase the amount of spikes and instability on the power supply.
I wonder if this is an extension of the Hovis method with XBOX. If they go for chiplets, multiple supply rails would be regional. I'm still of the mindset they're engineering supplies into the chips at a finer level. A finfet and metal layer cap could be an adequate internal supply for a lane/local rail. Allowing per lane adjustments and minimize spikes. Having 4096+ power caps would seemingly take up some space, but they never hurt.
 
Well I guess based on the Fury X that power req is entirely plausible. I'm not sure a liquid cooled version is going to be in my price range which means focus will be on IHV designs. Glad I bought a new PSU though, my Seasonic Platinum 720w should be enough.
 
Back
Top