ALU's per quad on future chips

Frank

Certified not a majority
Veteran
There is no easy way to stop using quads on graphic chips without suffering a major performance loss when rendering models that don't use flow control. But when we look beyond the current chips to DXNext and OGL2.0, we see something that looks a lot like a general purpose vector processor and not at all the fixed-but-programmable design that made the current SIMD, shaders-and-quads design such a hit.

The largest piece of screen real-estate of current and near-future visual is actually some kind of low-poly and/or simple shader model, that benefits tremendously from the quad pipeline design of the chips. And a quad pipe with two full ALU's generally doubles the throughput.

To go from here to a fully programmable model requires flow control, which seems to require single-pixel pipelines. But when we see that the next generation unifies the vertex and pixel shaders into general vector units, there is another possibility: use four ALU's per quad.

That gives tremendous troughput for any shader that uses simple models on large triangles, while it enables each pixel to run it's own program at reduced speed.

Looking at it from a hardware (transistors) perspective, ALU's are expensive. But so are all the buffers needed to keep the pipes optimal filled! Add to that the observation, that the best current design of a quad consists of two (full) ALU's and two mini-ALU's, and it is easy to see that we are close to the point where four full ALU's per pipe while reducing the buffers drastically becomes quite an interesting way to go, especially when unifying the vertex and pixel shaders.

What do you think?
 
DiGuru, it was my understanding that currently 4 pipes are grouped to make a quad, so that the NV40 for example actually has 8 full ALUs per quad.

Furthermore, if you increase pipeline throughput x4 and texturing latency remains roughly constant, then you need to buffer 4x as many quads to hide that latency.

Also, if you want each pixel in a quad to be truly independent of the others, you still need to duplicate control logic (and possible instruction caches).

So as far as I can tell, your suggestion would actually increase on-chip storage requirements.

Edit: I should add that this does depend on the frequency of off chip memory access. As the ratio of arithmetic to texture ops goes up, less quads will have to be buffered to cover a 100 cycle+ off chip memory access latency. However, my guess is that increased computation per texture access will also result in increased register usage, thus increasing per quad storage requirements. I have no idea how these factors compare however.
 
What you need to do is combine all the ALU units in the quad and allow the pixel shaders for each pixel to use all available units. So if one pixel shader is manipulating a texture another pixel shader can use the free ALU units.
 
psurge said:
DiGuru, it was my understanding that currently 4 pipes are grouped to make a quad, so that the NV40 for example actually has 8 full ALUs per quad.

They have effectively 8 almost-full ALU's per quad, as long as both ALU's do the same thing for each fragment in the quad. Single Instruction, Multiple Data.

Furthermore, if you increase pipeline throughput x4 and texturing latency remains roughly constant, then you need to buffer 4x as many quads to hide that latency.

True. This only works if you mostly use shader programs to define the surface of each pixel. But that latency is here now, it won't get worse. As we saw with the NV3x, better shader throughput delivers a much larger gain than more or better texture fetches.

Also, if you want each pixel in a quad to be truly independent of the others, you still need to duplicate control logic (and possible instruction caches).

Not if you use 4 ALU's. You'll need the same control logic for each ALU. The difference is, that you can use each ALU to process a single fragment. You just cannot process multiple fragments with the same ALU at the same time any more when you do that.

As of now, not every shader program can fully use both ALU's (and the mini-ALU's) all the time. So your speed per pixel is actually about half the throughput you have now.

Even better: when you process a branch, you can still use two ALU's for all the fragments that take one path and the other two for the fragments that take the other path. Only when (after multiple branches) all pixels follow their own path, is the maximum throughput reduced to below half of the current throughput.

So as far as I can tell, your suggestion would actually increase on-chip storage requirements.

You only have to expand the mini-ALU's to full ALU's, while you can reduce the buffers quite a bit, as the throughput won't drop very much, compared to current hardware that is forced to switch to serial execution and creates a lot of stalls doing that.
 
How many instructions can a single ALU execute each clock? It depends.

One. But if you follow a computation with a move, that move will probably be done in the same clockpulse, as the result has to be stored anyway. Two? Well, preloading constants can be optimized as well. Three? More or less, if the right sequence of instructions is issued. Although the mini-ALU's are less sophisticated.

rwolf said:
What you need to do is combine all the ALU units in the quad and allow the pixel shaders for each pixel to use all available units. So if one pixel shader is manipulating a texture another pixel shader can use the free ALU units.

Yes, the compiler can issue the instructions out-of-order, so there are no dependencies between them. As is done already, to make use of multiple ALU's. But the texel has to be in the cache (L1 or L2) for the latency to be within bounds.
 
Ok i think i misunderstood what you were saying.

Here is my new understanding:
- You are suggesting 1 ALU for each fragment in a quad (not 4)
- separate control logic (instruction pointer/call stack) for each ALU
- instead of operating in data parallel fashion, you are proposing that each ALU executes instructions for a single fragment until it hits a long latency operation (e.g. texture fetch), at which point it switches to a different fragment.

is this correct?
 
psurge said:
Ok i think i misunderstood what you were saying.

Here is my new understanding:
- You are suggesting 1 ALU for each fragment in a quad (not 4)
- separate control logic (instruction pointer/call stack) for each ALU
- instead of operating in data parallel fashion, you are proposing that each ALU executes instructions for a single fragment until it hits a long latency operation (e.g. texture fetch), at which point it switches to a different fragment.

is this correct?

Yes. And as it is just a small step from the current designs, it can still be SIMD, so all the ALU's can work on the whole quad as well. That gives all the speed that can possibly be reached when each fragment runs the same shader as all the other ones (on all the 4 ALU's at the same time if the shader program allows it), while it can fall back to executing each of the fragments separately without stalling parts of the pipeline.
 
I'm not sure if this is the right thing to do, 2 reasons:

1, I don't see why IHVs can not implement a MIMD design in the pixel shader part with little effort, nVIDIA already did it in their vertex shader, especially considering the fact that vertex and pixel shader is going to converge in the not-too-far future, I personally see it's quite possible to have MIMD on both vertex and pixel processing model.
2, Making one ALU dedicated to one pixel at a time is good, but this is not necessarily applying to two or four ALUs. Because for keeping them all busy, you'll need to break the code into 2 or 4 parts with roughly equal size so that they can be executed in a parallel manner. This is a pretty hard job to do for the compiler and sometimes the algorithm itself is just very depedent, you can't split it at all.

edit: even if SIMD is still going to dominate for a while, the chance that each pixel in the quad takes different path is quite rare, we can tolerate the pipeline stall at low rate IMHO.
 
I thought the vertex engine of the NV40 was SIMD, while the pixel pipes were considered MIMD.

edit: whoops, got that backwards
 
Back
Top