Can AMD GPUs implement 'hardware' fixed function pipelines through firmware?

Thanks. Actually, the first phrase that came to my mind when trying to think of a way to describe this was "QoS mechanism". Would you mind going into the bolded a little more?
Preemption has a non-zero cost in terms of time and the non-workload consumption of resources for bookkeeping and running the special subroutines for moving data and execution context out of the way, and then later ramping it back up. That's injecting a second startup and flush in the middle.
For the graphics pipeline, the priority queue slides show where the graphics workload slopes down, leaving resources it could be using otherwise idle, but the compute portion cannot start until the last bit of graphics execution is out of the way.
Context switching for compute isn't as global a switch, but individual wavefronts and kernels need to move data in and out rather than running their own code and can tie up a CU for a while even for unrelated wavefronts on the same CU.
In either case, if it weren't for time pressure the GPU probably would have waited for a while and filled in slots as they eventually opened up. This assumes there aren't super long-lived wavefronts, or in another scenario malicious ones trying to DoS the system.

Reserving CUs constrains the GPU from being able to use all its resources for the problem at hand, in favor of keeping them free for a workload that might not need them for a while.

Prioritization does mean the GPU starts picking winners and losers when it comes to competing for a CU, so the losers will see their wavefront launch rate drop.
 
Preemption has a non-zero cost in terms of time and the non-workload consumption of resources for bookkeeping and running the special subroutines for moving data and execution context out of the way, and then later ramping it back up. That's injecting a second startup and flush in the middle.
For the graphics pipeline, the priority queue slides show where the graphics workload slopes down, leaving resources it could be using otherwise idle, but the compute portion cannot start until the last bit of graphics execution is out of the way.
Context switching for compute isn't as global a switch, but individual wavefronts and kernels need to move data in and out rather than running their own code and can tie up a CU for a while even for unrelated wavefronts on the same CU.
In either case, if it weren't for time pressure the GPU probably would have waited for a while and filled in slots as they eventually opened up. This assumes there aren't super long-lived wavefronts, or in another scenario malicious ones trying to DoS the system.

Reserving CUs constrains the GPU from being able to use all its resources for the problem at hand, in favor of keeping them free for a workload that might not need them for a while.

Prioritization does mean the GPU starts picking winners and losers when it comes to competing for a CU, so the losers will see their wavefront launch rate drop.

I had a vague sense of this, but didn't really have a complete picture. Thanks again for the detailed explanation.
 
Just like the last time that I was banned it was from a post I made in a different thread, my post was taken from a thread & made into a thread by a mod without full context & caused confusion.

Given the responses already given in this thread, it might be helpful for you to state your current understanding of what, specifically, these schedulers enable. Fair?
 
The HSA Architected Queuing Language is a great illustration of the expected role of an ACE. It is true that they are running microcodes, and hence can be reprogrammed. But if you looking close to the hardware, it appears to be more about the ability not to hardwire the command packet parser, so that it is still patchable for whatever reasons (AQL support, and hey - CPU microcode :-]). It is supposedly still bound to the limited set of hardware controls you have, as compared to the graphics front-end. For HWS however, it is likely about the ability to update the scheduling algorithm.

In the case of VR reprojection, async timewrap or ray-traced audio, these are all built upon the computing capabilities AFAIK. Since what they demand for from the GPU is all in common: prioritisation and QoS, supporting such features by patching ACE/HWS should not be considered making ACE/HWS "emulating fixed function pipelines". They are still general purpose units that are designed to take anything (incl. the stuff you mentioned) thrown at them, just perhaps with some difference in QoS parameters.
 
Last edited:
Why are people trying to say programmable hardware somehow makes it fixed function when the are the complete opposite of each other?
 
In the case of VR reprojection, async timewrap or ray-traced audio, these are all built upon the computing capabilities AFAIK. Since what they demand for from the GPU is all in common: prioritisation and QoS, supporting such features by patching ACE/HWS should not be considered making ACE/HWS "emulating fixed function pipelines". They are still general purpose units that are designed to take anything (incl. the stuff you mentioned) thrown at them, just perhaps with some difference in QoS parameters.
Yeah. Async timewarp shouldn't need any hardware patching. All you need is (properly working) high priority compute queues. Same for audio.
 
In general FF HW is easily more efficient than programmable HW at a specific task it was designed for. Programmable HW can still win if it used to introduce smarter algorithms and/or to address under-provisioned FF HW. The perfect example of this issue is GCN weak geometry pipeline that developers are fixing using compute + improved triangle culling schemes. In fact AMD fixed some of the GCN culling issues in Polaris.
 
Yeah. Async timewarp shouldn't need any hardware patching. All you need is (properly working) high priority compute queues. Same for audio.

Which is exactly why they use hardware scheduling



Quick Response Queue
Today’s graphics renderers provide many opportunities to take advantage of asynchronous processing, but for some applications the lack of determinism in terms of when certain tasks are executed could diminish the benefits. In these cases the renderer needs to know that a given task will be able to start and complete within a certain time frame.

In order to meet this requirement, time-critical tasks must be given higher priority access to processing resources than other tasks. One way to accomplish this is using pre-emption, which works by temporarily suspending other tasks until a designated task can be completed. However the effectiveness of this approach depends on when and how quickly an in-process task can be interrupted; task switching overhead or other delays can impact responsiveness, and potentially manifest as stuttering or lag in graphics applications.

To address this problem, we have introduced the idea of a quick response queue. Tasks submitted into this special queue get preferential access to GPU resources while running asynchronously, so they can overlap with other workloads. Because the Asynchronous Compute Engines in the GCN architecture are programmable and can manage resource scheduling in hardware, this feature can be enabled on existing GPUs (2nd generation GCN or later) with a driver update.

https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved
 
Which would count as an optimisation, but not creating a fixed-function pipeline, nor emulating a fixed-function pipeline, nor achieving the same performance/efficiency as a fixed-function processor.
 
Given the responses already given in this thread, it might be helpful for you to state your current understanding of what, specifically, these schedulers enable. Fair?
@onQ please respond to the above, and do so without using out of context snippets/quotes.
 

This thread has a discussion going in it about the subject. Do you want to participate in that discussion or not? If your statements are being presented out of context, provide context. If people are misinterpreting the ideas you're trying to convey based on how you've worded things in the past, word them differently now so that we can properly interpret them. Show that you understand how what you initially posted and what this actually is are different, based on the responses people have given you in this thread. Then we can move forward.
 
so I am nowadays just a lazy reader here but... WTF does this thread mean?
Is it someting like "can AMD reprogram their microcode engine(s) to let them do something weird and different via a software update"?
Or something more like "can AMD reprogram their additional components with maybe FF stuff to behave differently -i.e. add x.265 to cards that supports only x.264"?
Or anything else???
 
Why are people trying to say programmable hardware somehow makes it fixed function when the are the complete opposite of each other?
Because they are identical in operation with the only difference being number of transistors to implement the circuit. Not sure why you'd think they are the complete opposite of each other. For all intents and purposes they are hardwired circuits with concurrent execution occurring in a single clock cycle. They just have the ability to be modified at device initialization.

so I am nowadays just a lazy reader here but... WTF does this thread mean?
Is it someting like "can AMD reprogram their microcode engine(s) to let them do something weird and different via a software update"?
Or something more like "can AMD reprogram their additional components with maybe FF stuff to behave differently -i.e. add x.265 to cards that supports only x.264"?
Or anything else???
Possible, but the hardware requirements to do something that complex likely wouldn't be worth it. The decoding likely stays the same, beyond the new formats. There may be changes to the bitrate or new constraints where it would not fit in the hardware. In most cases they are used for industrial system controls where each application is different. An engineer could then reprogram them to suit his needs without having to fabricate a chip for every situation. Realtime scheduling of GPUs is another use for them.
 
Because they are identical in operation with the only difference being number of transistors to implement the circuit. Not sure why you'd think they are the complete opposite of each other. For all intents and purposes they are hardwired circuits with concurrent execution occurring in a single clock cycle. They just have the ability to be modified at device initialization.
Not sure you know what fixed function means.
 
Not sure you know what fixed function means.
Unable to be changed, performing one specific task. Just like these things are doing. Unless you think flashing a BIOS a billion times a second is programmable? For all intents and purposes these things become hardwired circuits once the device is initialized. They won't be bouncing between different capabilities at runtime.
 
Unable to be changed, performing one specific task. Just like these things are doing. Unless you think flashing a BIOS a billion times a second is programmable? For all intents and purposes these things become hardwired circuits once the device is initialized. They won't be bouncing between different capabilities at runtime.
What makes you think these things require a bios flash? They are just hardware that are multipurpose that are abstracted by a driver layer.
 
What makes you think these things require a bios flash? They are just hardware that are multipurpose that are abstracted by a driver layer.
They won't require a bios flash, but that's a rough equivalent. The microcode could be substituted the same way CPUs and many other devices do when initialized. In most cases I've seen that occurs when booting the operating system. Point being it likely won't be occurring during operation or frequently. Any change would likely require resetting the device.
 
They won't require a bios flash, but that's a rough equivalent. The microcode could be substituted the same way CPUs and many other devices do when initialized. In most cases I've seen that occurs when booting the operating system. Point being it likely won't be occurring during operation or frequently. Any change would likely require resetting the device.
Why would you need to change the microcode to do any of the functions you'd want to do with this? The hardware isn't fixed function, so just use it as such. Or do you want to partition half the gpu to do pixel shading and half to do vertex shading for no reason?
 
Not sure you know what fixed function means.
Anarchist4000 is right. Fixed function can still be programmed, like an FPGA. Once established, it functions as a static processor.

The issue with the AMD concept is that the processing of the workload, happening in the shaders, is programmable. It's the HWS that's 'fixed function' although programmable via update. The initial posit was/is unclear about what exactly is meant by 'fixed function', with the implication definitely being that the workload processing (graphics tasks) are operating as if on fixed-function specialist processors, and presently OnQ has declined to clarify his position.
 
Why would you need to change the microcode to do any of the functions you'd want to do with this? The hardware isn't fixed function, so just use it as such. Or do you want to partition half the gpu to do pixel shading and half to do vertex shading for no reason?
Again, Anarchist4000 is talking about the HWS being 'flashable' yet fixed function. You're talking about the GPU shaders being programmable.
 
Back
Top