triangle and pixel-independant pipelines

psurge

Regular
This was originally mentioned in one of those huge NV30 threads. I believe what was said was that having pixel pipelines capable of
rendering arbitrary pixels inside a triangle, or possibly even pixels from different triangles would 1.) cause texture cache efficiency to go down, 2.) require lots of extra logic over a K pipeline block set-up to render an NxM pixel grid.

Example - 4 pipelines set-up to render 2x2 tiles, or 3dlabs' p10, which looks like it's 64 texture coordinate processors and 64 integer pixel shader units are setup to render 8x8 blocks of pixels.

Well what happens to these units when you render really small triangles, ~1pixel, with complex shaders (i.e. if you are using your [V/G/?]PU to accelerate offline rendering)? It seems like you'd basically be wasting all but 1 pipeline for however many cycles it takes the shader to execute.

So I was thinking about how to allow pipelines to be a little more flexible. I'm no hardware designer, but it doesn't seem like it would be all that difficult :

The rasterizer still outputs NxM blocks of pixels - for each pixel covered by the triangle do Z and stencil testing. If the pixel passes store the pixel and it's associated shader inputs into a FIFO buffer (maybe in some sort of grid arrangement if possible). Beef up rasterization stage so that it can output more than 1 NxM pixel block per cycle. So long as different tris share the same pixel program, AFAICS it doesn't matter which tri these blocks come from - just insert the pixel program input into the buffer. Every time the K(=NxM) pipelines finished a batch of pixels, they would grab up to K new pixels from the buffer for processing. Seems like this would help pipeline utilization significantly...

It would also help if you went so far as to allow data-dependent loops/branches in the pixel shader stage - i think you would need an i-cache per pipeline, but you wouldn't have to wait for all pixels in the block you are processing to finish before starting on a fresh set of pixels.

Comments?

Regards,
Serge

P.S. I'm no hardware designer, I'd definitely be interested to hear about why this isn't a great idea...
 
As I've mentioned before this comes down to a tradeoff, transistor cost to implement vs performance benefit realised by the implementation.

The problem is yes in software terms this is a reasonably simple thing to do.
Hardware is unfortunately somewhat different and implementing any algorythm that requires conditional execution has a tendancy to greatly increase the complexity of the logic.

There are also some calculation benefits that can be exploited if you know you'll always be working on a block of pixels.

I think a more likely approach is allowing for a number of different configurations based on coverage e.g. allowing 4x2 x 1texture unit or 2x2 x 2 texture units or 2x1 x 4 texture units or 1x1 x 8 texture unit configurations and selecting the appropriate one based on number pixels discarded.
This wouldn't save you anything in the simple case (1 texture) but as shaders become more complex it would keep the texture pipelines busy, without loosing the brute force throughput when it's needed for say stencil fills.
 
ERP, for the sake of argument (perhaps you can educate me here):

Where is there any sort of conditional element in the algorithm? Basically the rasterizer idles when there are no free slots in the pixel buffer to fill, and the pipelines idle when they can't get anything from the pixel buffer. Isn't this a bog standard way of decoupling execution units in a chip? I'm probably missing something, but i don't see where the conditional part of the algorithm is really...

You are paying in transistors for the pixel buffer and also for the duplication of any logic which can efficiently output results for all pixels in an NxM block, but you gain quite a lot - IMO even over the configureable blocks you propose - and that is utilization of the programmeable units of each pipeline. Allowing one pipe to access 4 textures per clock is very nice, but how often is this going to happen when most instructions in some longish program process texture coordinates or the results of a texture lookup?

just my 2c,
Serge

Edit : What I'm saying is wrong if there isn't a distinction between texture unit and shading unit... But then if you can dedicate the computational power of 8 "texture/shader" units to a single pixel, you would still need to find ILP of 8 in the shader instructions to fully utilize the units, which IMO is unlikely. Am I making sense?
 
I think a more likely approach is allowing for a number of different configurations based on coverage e.g. allowing 4x2 x 1texture unit or 2x2 x 2 texture units or 2x1 x 4 texture units or 1x1 x 8 texture unit configurations and selecting the appropriate one based on number pixels discarded.

Interesting, isn't that close to what the rampage was going to do?

Now how much efficiency would that bring to the table in addition to a performance boost? Also it seems like in a high polygon scene this will have the most benefit when having a more flexable shading pipeline. Maybe help in doing anisotropic filtering as well by having more efficient fillrate capability.
 
ERP said:
I think a more likely approach is allowing for a number of different configurations based on coverage e.g. allowing 4x2 x 1texture unit or 2x2 x 2 texture units or 2x1 x 4 texture units or 1x1 x 8 texture unit configurations and selecting the appropriate one based on number pixels discarded.

You have to take into account that the data flow and processing needed for this is very different. Fetching data from 1 texture and sending "related" data to 4 or 8 pipelines is one thing, fetching data from 8 completely different textures at completely different locations in memory is something very very different. Optimising for one approach : ok, but for x different cases that can potentially dynamically change is not going to be easy or effective. Also your processing needs to adapt as well, combining 8 TMU results in an efficient way is different from processing 8 x 1 TMU result . The first is like CISC (very complex operation where you get 8 input sets that you need to "combine" using a complex ALU operation) while the second is like SIMD (8 independent data sets that all execute the same "simple" instruction).

Decoupling pipelines is a very nice idea but it will be pretty expensive in silicon area, but with decreasing triangle size it might be a must, going to a completely flexible array of processing elements is probably even nicer (on paper) but is going to be even more complex in dataflow, instruction flow and control (silicon overhead, non effective logic thats just there to make sure that everything goes to the right spot and executes the right things). I guess its a bit of a case of brute force (with some waste) against "clever" processing with little/no waste.

K-
 
Kristof -

Making the processing elements completely independent results in a something that sounds almost exactly like the CELL chip - do you think this is what gfx chips will converge towards?

What I'm describing (pipelines operating in lockstep on independent pixels, no flow control) has already been done in even more general form :
Imagine(paper here).

At 400Mhz they peak at 16 single precision GFlops. Apparently their chip is 2.6 cm2 on a .15u process, this includes 48ALUs and 128KB cache. Again i don't know very much about these things, but unless I'm mistaken that's a fraction of the die/size transistor count of the r300 for example...

Serge
 
Back
Top