Can AMD GPUs implement 'hardware' fixed function pipelines through firmware?

Esrever · Jul 1, 2016

Shifty Geezer said:
Again, Anarchist4000 is talking about the HWS being 'flashable' yet fixed function. You're talking about the GPU shaders being programmable.

Why would that make it fixed function when the HWS is still doing work on programmable shaders, just with a different algorithm. In the scope of fixed functions when it comes to hardware, this doesn't make much sense. If this is fixed function to you then a CPU does fixed function because it allows for changing of microcode too.

Shifty Geezer · Jul 1, 2016

The HWS is a fixed-function scheduler who's program cannot be changed on the fly. The GPU and the work it does and the work scheduled is programmable. The entire procesing pipeline, such as performing a framebuffer reprojection, is not fixed function AFAIK because it's just a shader program. That's the area really in dispute here, and I don't think anyone other than OnQ is suggesting that the operation of the GPU on scheduled work constitutes fixed function.

Ethatron · Jul 1, 2016

Esrever said:
Why would that make it fixed function when the HWS is still doing work on programmable shaders, just with a different algorithm. In the scope of fixed functions when it comes to hardware, this doesn't make much sense. If this is fixed function to you then a CPU does fixed function because it allows for changing of microcode too.

Indeed. A CPU's instruction set are it's fixed functions. And they can be changed (to some degree).

3dilettante · Jul 1, 2016

One difference is that the CPU would generally be expected to have free reign (usually) in terms of what it can access in memory for operands and instructions.
The CUs executing the actual shaders have a general address space that they can use in a Turing-complete manner to perform arbitrary functions.

I'm not sure how much leeway is given to the microcode engines in the front end, or for other blocks generally considered fixed-function from the standpoint of system use. They can access their microcode store and the specific regions of memory allocated by the system for the queues they monitor, but I'm not sure how much more they can even access even for reading.

Esrever · Jul 1, 2016

Ethatron said:
Indeed. A CPU's instruction set are it's fixed functions. And they can be changed (to some degree).

But changing them to make more functions that are more complicated from a programmer's perspective doesn't make any sense. If this is the definition, then why even bother specifically asking about AMD GPUs when pretty much all modern silicon can do this, that is why there is drivers and bios in the first place.

Anarchist4000 · Jul 1, 2016

Esrever said:
Why would that make it fixed function when the HWS is still doing work on programmable shaders, just with a different algorithm. In the scope of fixed functions when it comes to hardware, this doesn't make much sense. If this is fixed function to you then a CPU does fixed function because it allows for changing of microcode too.

They aren't executing code in the way you think they are. In operation you have to think of them as a hardwired circuit doing one thing and one thing only. No idle transistors, memory banks, etc like you would encounter in a CPU. They will perform in a single cycle what a CPU likely takes many cycles to complete. The concurrency is really the big point here. Theoretically, if large enough, you'd do the equivalent of executing an entire shader in a single clock cycle. Then only be able to execute that one shader.

All these things are really doing is a bunch of comparisons, likely not even math, and outputting an outcome for another circuit or processor to make a decision. They won't even have math units or anything you would typically associate with a processor, although yes they could emulate the functionality. Arrays of logic gates are a better way to think of these. You likely don't want them touching anything over 8 bits in size.

Ethatron · Jul 1, 2016

Esrever said:
But changing them to make more functions that are more complicated from a programmer's perspective doesn't make any sense. If this is the definition, then why even bother specifically asking about AMD GPUs when pretty much all modern silicon can do this, that is why there is drivers and bios in the first place.

The instructions are representing an abstract architecture (ISA is InstructionSetArchitecture). This architecture is invented before the hardware architecture. The hardware can implement that in any way they want. Some things have direct representations, but most don't have. So you have instructions which are super fast and one silicon operation, and then you have some which are like 100 silicon instructions long, 1 instruction is really an entire "program". This is all to use the hardware resources in the most efficient way. And being able to actually change the underlying silicon layer from the abstract ISA layer.
So in the beginning of the silicon architecture you can have only 10 native mappings, and in the 5th generation you have 120 native ones fe. and another 5 of the previously native ones are longer now because nobody used them.

pTmdfx · Jul 1, 2016

Esrever said:
Why would you need to change the microcode to do any of the functions you'd want to do with this? The hardware isn't fixed function, so just use it as such. Or do you want to partition half the gpu to do pixel shading and half to do vertex shading for no reason?

It is programmable, but the design contract binds it to a specific purpose for encapsulation and security. Taking ACE as an example: a dispatch command from the user land does not need to acknowledge the fact that this unit is as programmable as your Raspberry Pi, but merely "DISPATCH pointer_to_my_shiny_kernel 1920*1080*1". The ACEs parse it, and do its work in the way defined by the microcode, without leaking any details.

Esrever said:
Why would that make it fixed function when the HWS is still doing work on programmable shaders, just with a different algorithm. In the scope of fixed functions when it comes to hardware, this doesn't make much sense. If this is fixed function to you then a CPU does fixed function because it allows for changing of microcode too.

I am not exactly sure about your wordings, but HWS/ACE/GCP does not "do work on programmable shaders" - they know merely a pointer to the shader object code, and some configuration parameters. Particularly, HWS is only responsible of scheduling units of work to the compute units, who are responsible to run the shaders or compute kernels. That's why it is a fixed function. To the external world, it is a self-running black box despite it in reality being perhaps a RISC microprocessor.

Its programmability is not for general purpose processing, but for maintenance purpose just like your plain old BIOS or secured boot chains in the on-chip security enclaves. They may use a full blown microprocessor, but they only expose an interface to the externals by its design contract.

pTmdfx · Jul 1, 2016

Oh, gonna make a correction.

Particularly, HWS is conceptually only responsible of scheduling units of work to the compute units, who are responsible to run the shaders or compute kernels. software queues to the ACEs' hardware queue slots. ACEs are then responsible to push work to the CUs (in AQL), which are in the end the guys running the shaders/kernels with a general purpose ISA.

onQ · Jul 6, 2016

Does this make it clear or do we still play the pretend onQ is crazy game? True Audio has been moved from a DSP to a compute pipeline.

3dcgi · Jul 6, 2016

Implementing audio using compute is the opposite of fixed function. It's a software library. You don't do a good job explaining what your point is.

onQ · Jul 6, 2016

3dcgi said:
Implementing audio using compute is the opposite of fixed function. It's a software library. You don't do a good job explaining what your point is.

You could have always done Audio physics in compute but because of the customized scheduling/fixed function pipeline for the asynchronous compute it can be done with less draw calls & be low latency.

To be clear you're able to emulate a fixed function co-processor/accelerator using the HWS & compute units.

See my 1st post in the original thread from 3 years ago:

I know this sound silly but it seems like it's exactly what Sony is planning to do with the 8 ACE's.

It's a few things that I have read over the last year or so that's leading me to believe this is what they are doing I'll try to go back & find all the quotes later but for now I have a question.

If Sony was to config the 64 command queues to make the pipelines emulate real fixed function pipelines could they work just as efficient as real fixed function hardware?

TrueAudio being moved from a DSP to compute gives the answer to my question.

Otto Dafe · Jul 6, 2016

Well, with that wrapped up, I propose to lock the thread.

sebbbi · Jul 6, 2016

If Sony was to config the 64 command queues to make the pipelines emulate real fixed function pipelines could they work just as efficient as real fixed function hardware?

Programmable hardware is never as efficient (perf/watt and/or silicon area) as fixed function hardware. But programmable hardware makes it easier to get full utilization of those transistors as it is not limited to executing single kind of work.

I wouldn't call ACEs fixed function hardware. GPU front ends are fully programmable processors with programmable memory access. You just don't have access to program it yourself, the driver team does. Traditionally fixed function hardware tends to have fixed data inputs and data outputs. For example texture sampler, ROP (blend, depth test, HiZ), DXT block decompressor, delta color compressor, triangle backface culling, etc. These are highly performance critical parts of the chip, making it a big perf/watt win to use fixed function hardware to implement them. Also hard-wiring reduces latency compared to a programmable pipeline.

On the other hand, a GPU front end processor only needs to launch a couple of draws/dispatches in a microsecond. That's huge amount of cycles. It is not worth optimizing its throughput at cycle precision, thus programmable hardware makes sense.

onQ · Jul 6, 2016

sebbbi said:
Programmable hardware is never as efficient (perf/watt and/or silicon area) as fixed function hardware. But programmable hardware makes it easier to get full utilization of those transistors as it is not limited to executing single kind of work.

I wouldn't call ACEs fixed function hardware. GPU front ends are fully programmable processors with programmable memory access. You just don't have access to program it yourself, the driver team does. Traditionally fixed function hardware tends to have fixed data inputs and data outputs. For example texture sampler, ROP (blend, depth test, HiZ), DXT block decompressor, delta color compressor, triangle backface culling, etc. These are highly performance critical parts of the chip, making it a big perf/watt win to use fixed function hardware to implement them. Also hard-wiring reduces latency compared to a programmable pipeline.

On the other hand, a GPU front end processor only needs to launch a couple of draws/dispatches in a microsecond. That's huge amount of cycles. It is not worth optimizing its throughput at cycle precision, thus programmable hardware makes sense.

I also covered that in the original thread.

By creating the fixed function pipelines at the driver level once you figure out just what fixed functions you want the pipelines to be used for.

onQ · Jul 6, 2016

Otto Dafe said:
Well, with that wrapped up, I propose to lock the thread.

Not really because now the HWS are starting to be used & ACE's can be upgraded/downgraded (however you look at it) to HWS & Neo/Scorpio is coming.

Why do I point out Neo/Scorpio? because they will be closed platforms with HWS & extra compute units so they can be used to create virtual co-processor/accelerators.

With the new information now do people see what I meant about PS4 APU being MIMD? or about a GPU being a DPU?

Shifty Geezer · Jul 6, 2016

I think I do, and once again your language and communication is terrible. If you hadn't used the words' fixed function' (because they don't apply) and instead talked about scheduling efficiency and latency, then you'd have been talking about what's going on here. Although I'm still unconvinced this is exactly what you meant to begin with because it doesn't match your words. There's still no such thing as 'virtual coprocessors/accelerators' except in how you are personally conceptualising what's going on. For the rest of us, there's a workload and a set of programmable processors that can process them, and a means to make compute low-latency.

milk · Jul 6, 2016

onQ said:
To be clear you're able to emulate a fixed function co-processor/accelerator using the HWS & compute units.

If this is what you meant by your original question, the answer had always been of course. It should be so obvious it shouldn't be worth asking.

Grall · Jul 6, 2016

onQ said:
True Audio has been moved from a DSP to a compute pipeline.

Not sure what you are trying to illustrate, or what you think that slide illustrates, in relation to your original point.

So they cut out the DSP and made trueaudio a shader program - so what? It has nothing to do with fixed function-ness, quite the opposite really.

iroboto · Jul 6, 2016

I think I understand where OnQ is going finally with all of this. It's actually quite similar to an article I once read about how GPUS and sound chips would no longer be needed because CPUs would continually scale higher in core count such that these extra cores would take over the usage of this dedicated hardware accelerators.

But the reality has been the opposite and we have been moving towards more dedicated hardware accelerators and not less for our heavy lifting.

It's like saying GPUs are excellent at Bitcoin mining, but when compared to a ASIC chip they are no where in the same realm of price/performance.

What you're effectively asking is whether GPUs can take on more functionality, which they could, ACEs aren't required for that to happen, they may happen to improve the saturation of the hardware but it's not some
Magic bullet. Your argument would not be much different than implying that if CPUs picked up ACES and all of a sudden we would have GPU like performance running on our CPUs which is false.

Being able to do a function does not imply replacement or substitution. It implies flexibility at most at the cost of heavy inefficiency.

Sent from my iPhone using Tapatalk

Can AMD GPUs implement 'hardware' fixed function pipelines through firmware?

Esrever

Shifty Geezer

uber-Troll!

Ethatron

3dilettante

Esrever

Anarchist4000

Ethatron

pTmdfx

pTmdfx

onQ

3dcgi

onQ

Otto Dafe

sebbbi

onQ

onQ

Shifty Geezer

uber-Troll!

milk

Like Verified

Grall

Invisible Member

iroboto

Daft Funk

Similar threads