Techniques used in future GPUs ?

eSa

Newcomer
I thought I would start a new thread about possible ways to continue the exponential growth in gfx processing power. Let my quickly set the example situation.

Let's assume we have nv30, running at 400 MHz. We can process an avarage of 50 assembler instructions / pixel in our application and achieve real-time performance. This is supported by peak rate of 4 instructions / pixel in every pipe for each cycle.

Now, we set a goal; we want to have same performance but with "photorealistic" 1024 assembler instructions / pixel. We also believe what nVidia as telling us ;) and we can expect this performance level in about 3.5 years. Spring 2006 is our release time.

How can we achieve this goal ?

For example, we can double the clock rate to 800 MHz. We can also double the pipelines to 16. Maybe we have to also make the pipelines independent each other, so that we can process small ~one pixel polygons. Also memory bandwidth gets raised.

BUT pixelshader performance is still roughly 20 / 4 = 5 times slower than the goal. Now, our nv30 can do peak 4 instructions in paraller. This is about as good as best _general_ superscalar processors can do today . Also if our fantasy hw has fully Touring-complete pixel shaders, we have 16 general superscalar processors...

So, any insights / ideas about how the performance can be increased 5 fold for each pixelpipe ? :) All suggestions are welcome !
 
eSa said:
This is supported by peak rate of 4 instructions / pixel in every pipe for each cycle.

...

Now, our nv30 can do peak 4 instructions in paraller. This is about as good as best _general_ superscalar processors can do today .

Just curious, did you pull this "4 instructions per pixel" out of the air or is there some fact behind this ? In which case the question is what kind of 4 instructions are we talking about... e.g. texture address, texture blending, scalar, vector, etc etc ?

K~
 
Kristof said:
eSa said:
This is supported by peak rate of 4 instructions / pixel in every pipe for each cycle.

...

Now, our nv30 can do peak 4 instructions in paraller. This is about as good as best _general_ superscalar processors can do today .

Just curious, did you pull this "4 instructions per pixel" out of the air or is there some fact behind this ? In which case the question is what kind of 4 instructions are we talking about... e.g. texture address, texture blending, scalar, vector, etc etc ?

K~

I'm sorry Kristof but it's just somewhat "educated guess". R300 can do 3 instructions in paraller (sorry don't remember what the instructions were) and both P4 and Athlons can do (in real life situation) roughly about 3 instructions / cycle at best.

My point really was not the exact instruction count, I just would like hear any ideas how to get the pixel shaders instruction count per pixel pass the performance of the current general processors. Maybe with VLIW a'la Itanium ?

It seems to me, that when nVidia and ATI are producing more "generic" processors also the advantage from the paraller nature of gfx is dimishing.
 
GeForce3/4 can do 2 register combiner ops per pixel per cycle. Each register combiner is roughly equivalent to 2 pixel shader instructions (1 vector, 1 scalar). It doesn't quite fully match because of the way the combiners work, but essentially in DX8, you can dispatch "4" instructions per cycle per pixel, except that 2 of them are in the alpha pipeline and don't really "count"

I would expect the NV30 to atleast improve this to 4 vector ALU ops per cycle by doubling up on the combiner stages and cutting latency.
 
I think I've seen it mentioned somewhere (think an NV statement) that NV30 has up to 4x the pixel shader throughput of GF3/4. Assuming that NV30 has 8 pipes then this means that it has 2x the per pipe performance, given that GF3/4 is up 2 per clock then 4 per clock makes sense. However it is unclear if this is at half or full precision (again NV have stated they process 16 bit floats at 2x rate).

John.
 
Back
Top