Can AMD GPUs implement 'hardware' fixed function pipelines through firmware?

onQ

Veteran
Mod: This is a thread started in the console space in 2013. A recent new post led to a new conversation. I think it's worth posting here so the AMD experts can explain what their hardware can and can't do. I will copy over some of the arguments from PM.

I know this sound silly but it seems like it's exactly what Sony is planning to do with the 8 ACE's.

It's a few things that I have read over the last year or so that's leading me to believe this is what they are doing I'll try to go back & find all the quotes later but for now I have a question.

If Sony was to config the 64 command queues to make the pipelines emulate real fixed function pipelines could they work just as efficient as real fixed function hardware?

Update June 2016:

PS4 has Hardware Schedulers & they are the reconfigable processors that has been talked about even before the PS4 was revealed , this is the reason why re-projection for VR can be done with little effect on the GPU because they are able to re-config the HWS to run reprojection on the GPU as if it was a processor made for reprojection by controlling how it run on the GPU. Look back at the quotes I put in bold from Cerny in this thread.

ix3mSby.jpg


Those aren't fixed-function pipelines. They are 1) schedulers, not pipelines, with no application in graphics rendering other than organising the rendering pieces. 2) programmable, not fixed-function.

For normal code it's fixed function but the functions can be changed by low level microcode that's not something that game devs will be able to do freely it will be done by Sony for adding functions like reprojection & Raytraced audio.

A single Graphics Command Processor up front is still responsible for dispatching graphics queues to the Shader Engines. So too are the Asynchronous Compute Engines tasked with handling compute queues. Only now AMD says its command processing logic consists of four ACEs instead of eight, with two Hardware Scheduler units in place for prioritized queues, temporal/spatial resource management and offloading CPU kernel mode driver scheduling tasks. These aren’t separate or new blocks per se, but rather an optional mode the existing pipelines can run in. Dave Nalasco, senior technology manager for graphics at AMD, helps clarify their purpose:

"
The HWS (Hardware Workgroup/Wavefront Schedulers) are essentially ACE pipelines that are configured without dispatch controllers. Their job is to offload the CPU by handling the scheduling of user/driver queues on the available hardware queue slots. They are microcode-programmable processors that can implement a variety of scheduling policies. We used them to implement the Quick Response Queue and CU Reservation features in Polaris, and we were able to port those changes to third-generation GCN products with driver updates."

Quick Response Queues allow developers to prioritize certain tasks running asynchronously without preempting other processes entirely. In case you missed Dave's blog post on this feature, you cancheck it out here. In short, though, flexibility is the point AMD wants to drive home. Its architecture allows multiple approaches to improving utilization and minimizing latency, both of which are immensely important in applications like VR.

http://www.tomshardware.com/reviews/amd-radeon-rx-480-polaris-10,4616.html
 
Last edited by a moderator:
I am of the opinion that the HWS can only schedule work and cannot change the function of the hardware. All it can do is increase efficiency of utilisation for given tasks. OnQ is saying the above -
For normal code it's fixed function but the functions can be changed by low level microcode that's not something that game devs will be able to do freely it will be done by Sony for adding functions like reprojection & Raytraced audio.

Is there any AMD engineer who'd like to set one of us straight?
 
I am of the opinion that the HWS can only schedule work and cannot change the function of the hardware. All it can do is increase efficiency of utilisation for given tasks. OnQ is saying the above -

Is there any AMD engineer who'd like to set one of us straight?

Dave is in the quotes I sent you saying the same thing I said


My Words

"Could Sony create fixed function pipelines for the PS4 even after release?

It's a few things that I have read over the last year or so that's leading me to believe this is what they are doing I'll try to go back & find all the quotes later but for now I have a question.
If Sony was to config the 64 command queues to make the pipelines emulate real fixed function pipelines could they work just as efficient as real fixed function hardware?"



Dave Words
"The HWS (Hardware Workgroup/Wavefront Schedulers) are essentially ACE pipelines that are configured without dispatch controllers.

My words
"
By creating the fixed function pipelines at the driver level once you figure out just what fixed functions you want the pipelines to be used for"

Dave Words

"They are microcode-programmable processors that can implement a variety of scheduling policies. We used them to implement the Quick Response Queue and CU Reservation features in Polaris, and we were able to port those changes to third-generation GCN products with driver updates.
"
 
My Words

"Could Sony create fixed function pipelines for the PS4 even after release?

It's a few things that I have read over the last year or so that's leading me to believe this is what they are doing I'll try to go back & find all the quotes later but for now I have a question.
If Sony was to config the 64 command queues to make the pipelines emulate real fixed function pipelines could they work just as efficient as real fixed function hardware?"



Dave Words
"The HWS (Hardware Workgroup/Wavefront Schedulers) are essentially ACE pipelines that are configured without dispatch controllers.

Assuming that Sony has updated the PS4 to have HWS microcode loaded, which is a point I will be addressing later:
He's telling you that the HWS are ACE pipelines that are configured without dispatch controllers. I think in this context that means they do not perform actual command read and kernel launch themselves. They instead perform higher-level shifting around of the queues and runlists for the rest of the front ends.
I think he's trying to tell you that for the purposes of emulating a function (fixed or not), they don't do any of said "function".

My words
"
By creating the fixed function pipelines at the driver level once you figure out just what fixed functions you want the pipelines to be used for"

Dave Words

"They are microcode-programmable processors that can implement a variety of scheduling policies. We used them to implement the Quick Response Queue and CU Reservation features in Polaris, and we were able to port those changes to third-generation GCN products with driver updates.
"
He's stating that they are processors that can help determine where a generic queue command will be read, when it will be read, and when it will launch.
Also, by saying those features can be back-ported to GCN3, he's not addressing GCN2 of the consoles. Perhaps it can go back that far, although that would be up to Sony to decide and disclose.
 
Last edited:
Assuming that Sony has updated the PS4 to have HWS microcode loaded, which is a point I will be addressing later:
He's telling you that the HWS are ACE pipelines that are configured without dispatch controllers. I think in this context that means they do not perform actual command read and kernel launch themselves. They instead perform higher-level shifting around of the queues and runlists for the rest of the front ends.
I think he's trying to tell you that for the purposes of emulating a function (fixed or not), they don't do any of said "function".


He's stating that they are processors that can help determine where a generic queue command will be read, when it will be read, and when it will launch.
Also, by saying those features can be back-ported to GCN3, he's not addressing GCN2 of the consoles. Perhaps it can go back that far, although that would be up to Sony to decide and disclose.


Cerny already said that it was done in hardware.


""Once we have this vision of asynchronous compute in the middle of the console lifecycle, the question then becomes, 'How do we create hardware to support it?'"

"This concept grew out of the software Sony created, called SPURS, to help programmers juggle tasks on the CELL's SPUs -- but on the PS4, it's being accomplished in hardware.

The team, to put it mildly, had to think ahead. "The time frame when we were designing these features was 2009, 2010. And the timeframe in which people will use these features fully is 2015? 2017?" said Cerny.

"Our overall approach was to put in a very large number of controls about how to mix compute and graphics, and let the development community figure out which ones they want to use when they get around to the point where they're doing a lot of asynchronous compute."
 
Cerny already said that it was done in hardware.

The "it" you are using and the concepts being discussed by the two individuals may not be the same.
HWS in particular is a more recently finalized and validated method that needed a new microcode version that could be loaded by the subset of GPUs capable of taking it via driver update. Sony writes its own software, so if or when is up to them.

The methods Cerny discussed have various vaguely defined features that might align with what HWS became, although it's not exclusive of the non-HWS method that existed for years.
That Cerny went on about 64 queues, which is something that fully-featured HWS can exceed, hints that something as complex HWS was not in the hardware (and technically would be in the microcode rather than actual hardware).
Similarly, Sony's audio engineer discussed the use of the GPU as an HSA audio device back then, and it was clear at the time that there were very serious objections to using it for anything remotely latency sensitive. Additional developer disclosures for launch titles indicated rough edges that HWS and the newest management methods we have now would have handled readily. The foundational elements could have been there from the beginning, but it seems like the actual system might not have fully come together for years and so would not exist in the PS4 until it is put there--if it can.
 
How does the question "Can AMD GPUs implement fixed function pipelines?"
differ from
"Can AMD GPUs run code written for fixed function pipelines?"
 
An important point of clarification we would need would be coming to a shared definition of "fixed-function".

I think this is bit clarifies what OnQ is speculating.

"PS4 has Hardware Schedulers & they are the reconfigable processors that has been talked about even before the PS4 was revealed , this is the reason why re-projection for VR can be done with little effect on the GPU because they are able to re-config the HWS to run reprojection on the GPU as if it was a processor made for reprojection by controlling how it run on the GPU."
 
I think this is bit clarifies what OnQ is speculating.

"PS4 has Hardware Schedulers & they are the reconfigable processors that has been talked about even before the PS4 was revealed , this is the reason why re-projection for VR can be done with little effect on the GPU because they are able to re-config the HWS to run reprojection on the GPU as if it was a processor made for reprojection by controlling how it run on the GPU."

In that case, the HWS is not being reconfigured. Its job is to take a set of commands related to what programs need to be scheduled and how important they are, then tells another front end whether it needs to start looking at that set or part of it.
That ACE will then evaluate various commands as it comes to them, many of which will involve dispatching wavefronts for a program that is pointed to by the command.

That program will be composed of standard instructions that the HWS never sees or cares about, and that the ACE does not care about.

It might happen to be a reprojection kernel.

It will probably have a minor impact because it is not a massively intensive operation relative to everything else the GPU is doing.

And versus some hypothetical dedicated hardware device for reprojection--it would in many ways not be as efficient.
 
Are we even in a GPU "thread" or a GPU thread?

Also need to define what "can" means. Does it mean theoretically or does it mean "should and will" in the premise that Sony (and NOT Amd) will do so.

The original premise is Sony will do such and such. The pedantic reality, and the only one that matters, is that it would never be Sony doing so, it would be AMD.
 
In that case, the HWS is not being reconfigured. Its job is to take a set of commands related to what programs need to be scheduled and how important they are, then tells another front end whether it needs to start looking at that set or part of it.
That ACE will then evaluate various commands as it comes to them, many of which will involve dispatching wavefronts for a program that is pointed to by the command.

That program will be composed of standard instructions that the HWS never sees or cares about, and that the ACE does not care about.

It might happen to be a reprojection kernel.

It will probably have a minor impact because it is not a massively intensive operation relative to everything else the GPU is doing.

And versus some hypothetical dedicated hardware device for reprojection--it would in many ways not be as efficient.

My layman's interpretation would be that the practical effect of these capabilities is that a GPU that was being sent multiple workloads would be able to perform a specific task or class of tasks from the total available work nearly as well or as well as a GPU which was being tasked with *nothing* but that task or class of tasks. Developers have the choice to either give certain tasks the ability to preempt other lower-priority tasks, giving them the ability to potentially fully take over the available processing resources or they would dedicate a certain portion of the available processing resources to a single task or class of tasks.

Now how much of that did I get wrong?
 
In that case, the HWS is not being reconfigured. Its job is to take a set of commands related to what programs need to be scheduled and how important they are, then tells another front end whether it needs to start looking at that set or part of it.
That ACE will then evaluate various commands as it comes to them, many of which will involve dispatching wavefronts for a program that is pointed to by the command.

That program will be composed of standard instructions that the HWS never sees or cares about, and that the ACE does not care about.

It might happen to be a reprojection kernel.

It will probably have a minor impact because it is not a massively intensive operation relative to everything else the GPU is doing.

And versus some hypothetical dedicated hardware device for reprojection--it would in many ways not be as efficient.

Last line!

ix3mSby.jpg
 
My layman's interpretation would be that the practical effect of these capabilities is that a GPU that was being sent multiple workloads would be able to perform a specific task or class of tasks from the total available work nearly as well or as well as a GPU which was being tasked with *nothing* but that task or class of tasks.
The concurrent scheduling and dispatch prior to all of this was about allowing for handling of multiple workloads and increasing utilization of spare resources. The resources are abstracted away from any given task, and the GPU is generally self managing internally.
HWS has functions relating to mapping those tasks to a front end, while allowing for quality of service between tasks and different virtual memory spaces.
The older methods with limited prioritization, no preemption, or reserved resources meant that some workload types simply could not accept that the GPU would not get to them until it decided to, regardless of time-sensitivity. With them, the GPU is no longer completely unacceptable or as vulnerable to an OS timeout, but individual tasks are going to perceive some amount of degradation as a result.

Last line!
That's a slide specifically focused on the HWS unit, in the context of coordinating, scheduling, and prioritization.
Physically, it is a set of custom or semi-custom processors that run a compact program or set of programs whose job is to coordinate, schedule, and prioritize tasks. They aren't going to load a reprojecting shader into their limited local memory and run it. They do not share the functional capabilities or ISA needed to do so. Anyone can write such a shader, and there are plenty of architectures that do so without HWS. A game can write a shader and ask the GPU to run it at a given priority with some possibly pre-allocated resources.
If there is a modification to the functionality that HWS provides, it is probably new ways for them to coordinate, schedule, and prioritize tasks whose actual nature they do not know so that the ACEs can pass those tasks (that they also do not really know or care about the actual nature of) for dispatch to the CUs.
 
The concurrent scheduling and dispatch prior to all of this was about allowing for handling of multiple workloads and increasing utilization of spare resources. The resources are abstracted away from any given task, and the GPU is generally self managing internally.
HWS has functions relating to mapping those tasks to a front end, while allowing for quality of service between tasks and different virtual memory spaces.
The older methods with limited prioritization, no preemption, or reserved resources meant that some workload types simply could not accept that the GPU would not get to them until it decided to, regardless of time-sensitivity. With them, the GPU is no longer completely unacceptable or as vulnerable to an OS timeout, but individual tasks are going to perceive some amount of degradation as a result.

Thanks. Actually, the first phrase that came to my mind when trying to think of a way to describe this was "QoS mechanism". Would you mind going into the bolded a little more?
 
Last edited:
The clarifications on terms comes from the posit, that Sony is using the HWS to enable...
this is the reason why re-projection for VR can be done with little effect on the GPU because they are able to re-config the HWS to run reprojection on the GPU as if it was a processor made for reprojection by controlling how it run on the GPU.
Therefore, fixed function means a custom ASIC purposefully created for the task of reprojection (or audio raytracing).
 
ACEs, HWS, etc are programmable logic devices. So they get initialized as a specific circuit, effectively fixed function. Their program is effectively concurrent in execution. Since GCN1.2 they all appear roughly identical in capabilities with some combination of devices: 2ACE=1HWS=something for HSA=etc.
 
  • Like
Reactions: onQ
Back
Top