Mixed type shader units in GPUs....

Paranoid

Newcomer
First of all, hi everybody. I've been lurking here for years, and now I've finally won over my laziness and decided myself to register! :D

Disclaimer : I know nothing about programming except the very basics (little more than hello world kind of stuff) and certainly nothing about hardware design, though over time I've formed a vague and simplified idea of many terms and concepts relating to 3d graphics.

If you answer, try to explain it as siiiimply as possible. :)

On to my question :

GeforceFX had both PS 1.1 and (rather crappy)PS 2.0 shading units. NV has since abandoned this approach, but at the time it seemed like a good idea (well, in theory) to me.
Why isn't it be desirable(I'm thinking mainly about consoles here) to make a hybrid GPU that has different types of shader units, maybe each even having with it's own precision?

For example, a lot of FP24 PS 2.0 units and a smaller pool of FP32 PS 3.0 ones for shaders that actually need them.
It seems to me that such a configuration could in most circumstances,with the kind of techniques and shaders in use now or(I think) in the next few years, have a higher performance per transistor.

I can imagine that this might make things a bit more difficult for programmers, but especially in a console, where the exact resources on the GPU are known and there are no drivers messing with the code, it could be a way to raise efficiency, since the coder can choose how to allocate the shaders: have a shader run on a PS3.0 unit or make a different version 2.0 version that runs a little slower, but has many faster PS2.0 units ready to work on it.

Or even make two versions of some shaders and run it on whatever unit type is less occupied at the moment.

Maybe this kind of load balancing would be too problematic? Or do programmers simply no't WANT to have to deal with this(not to mention that shader development seems to be moving away from coders and toward artists who might not be able to deal with these issues)?

Sure, it'd be simpler to have a single shader model, but I think it'd be a pragmatic way of increasing efficiency and performance in real world situations in the near future.

Or maybe there are also difficulties from a hardware design point of view? Or am I even making any sense here? I''m not so sure. :oops:

Anyway, in your opinion, is there any chance that we might see such a design decision in one of the next gen console GPUs?
 
Essentially, as it is, a pixel pipeline renders a single pixel. And it is a SIMD architecture: all pipelines use the same program. If you split the pipelines into half, each using their own set of features, you can only render half as much pixels at a time, and the other half of your pipelines are idle.
 
DiGuru said:
Essentially, as it is, a pixel pipeline renders a single pixel. And it is a SIMD architecture: all pipelines use the same program. If you split the pipelines into half, each using their own set of features, you can only render half as much pixels at a time, and the other half of your pipelines are idle.

So the PS1.1 and Ps2.0 units on GFFX Do NOT work in parallel? Bummer.

Well, if it was a console chip it might be a very different architecture. For example, would it be conceivable to decouple the pixel shaders from the rest of the rasterizer pipeline so that there'd be a deferred 'shading pass' that can benefit from a dynamic allocation of shader resources like the one I mentioned? Would this be too inefficient?

Bah, nevermind. I don't know what I'm talking about! :)
Thanks for the answer.

On a completely unrelated note : Does anyone know if and where I might find some benchmarks comparing current GPUs with older chips, such as a GF3 ti200?
 
Paranoid said:
So the PS1.1 and Ps2.0 units on GFFX Do NOT work in parallel? Bummer.
They do -- at least on the NV30, NV31, and NV34. There's an old thread that looked at the ALUs of those cards and showed that there were two integer units (PS1.1-1.3) and one floating point unit (PS2.x) that doubled as a texture sampling unit. Of course, they can only work in parallel if the shader program uses both integer and floating point precisions. This is only possible through NVidia's OpenGL extensions for fragment processing.

Later discussions on this forum concluded that the integer units were replaced with "mini" floating point units for the NV35.
 
I think nvidia has adopted this approach in the design of NV40.

We all know that both the 2 ALUs in NV40 are not fully functional from a feature set point of view, they work in conjection to provide the whole feature set of SM3.0(with some overlapping functionality). The situation becomes a little complicated with dynamic branching. Say, you have ALU0 responsible for simple math ops, such as mul/add/mad and ALU1 for complex math ops and other functionality, such as rsq/rcp/sincos/tex. Now you have some code like this:

do ops in ALU0( or ALU1 )
if(condition)
do ops in ALU0;
else
do ops in ALU1;

Now, how do you like to implement a design coping with this situation efficently? A fragment could trigger any one of the 2 jumps, depending on the condition it meets, chance are that you'll probably need to switch ALU after the if statement is evaluated. And since each ALU has its own FIFO to support multi-threading, you may also need to transfer data between them back and forth as well. Although I dont know exactly how costly it is, I got a bad feeling that it'll consume a lot of internal bandwidth, and thus degrade the performance.

Even without dynamic branching, things only get slightly better. Suppose you have a shader like this:
do ops in ALU0
do ops in ALU1
do ops in ALU0
do ops in ALU1
...(worst case scenario, not likely happens in real world, but you get the idea)
Of course, the GPU can reorder the ops to avoid the switch, the overhead is still there. And the conclusion about splitting functionality is:
pros : you have more units working on different parts of the program, parallism is improved.
cons: the requirement for internal communication, data transfer and management is also enlarged, how much larger depends on how finely you split the pipeline.
 
It made some sense to have separate PS1.1 and PS2.0 units because the former used fixed point data and the later used floating point data. PS2.0 and PS3.0 both use floating point so that removes one reason to have separate logic.

Another reason not to have separate units for multiple shader models is efficiency or lack thereof. There might be times when multiple shader models can be used simultaneously, but I doubt this is very useful in practice. I expect games to use one shader model at a time so the extra hardware is just sitting idle.
 
In response to 991060:

That's not the way it works in NV40, AFAIK. Both ALUs, ALU0 (scalar SFU and vector MUL) and ALU1 (vector MAD/DP2ADD/DP3/4), are part of one long pipeline with a flow control unit at its end. Every quad passes both units and either the TMU or the bypass FIFO in between before it passes the FCU which then decides whether the shader continues (quad is sent back to start of pipeline), the shader is done (quad is sent to ROP FIFO), or there is a flow control instruction and the FCU has to create batches of quads for conditional execution. You can consider both units as one big single unit that can execute a MULTEXMAD per cycle (throughput).

So there is no need to "switch ALU" because every quad sequentially passes both ALUs every time.
 
Somewhat on the same topic. I haven't written too many shaders but on a gut feel it would seem that for advanced shaders there would be use for integer ALU's alongside FPU's in pixelshader pipelines. I haven't got the slightest idea what that would do with the control logic but the ALU's themselves wouldn't be that big of a transistor hog. Also AFAICS they could very well be parallel with the FPU's, as opposed to being another stage in the pipeline requiring larger FIFO's to hide the latency.

[edit] Any word if the next DX spec will include the req. for integer processing?
 
Back
Top