DaveBaumann said:
I believe the GFFX ( and some 3DLabs P10 VS ) is already a step in that direction, with its 32 FP calculators it dynamically allocates ( however, so little is know about it that I might be 100% wrong on that. Any info about how it really works? )
WTF are you talking about?
Well, it's a *very* small step in that direction. But it's worth noticing it.
Here's what I meant when I talked about the GFFX 32 FP calculators it dynamically allocates:
The GFFX got 32 FP calculators for the Pixel Shader, which are dynamically allocated on 8 pipelines.
http://www.extremetech.com/article2/0,3973,713549,00.asp
As Dave Kirk mentioned, thirty-two 128-bit floating-point processors that handle shader calculations are the heart and soul of GeForceFX. These processors aren't hard-wired up four to a pipe, but instead are dynamically allocated as shader programs dictate.
While it is not said there, those FP processors are only there for the PS. So the VS doesn't use the same ones it seem.
The idea seems to be that info is sent to those FP processors and that they then try to find an available pipeline to output the pixel. So it can output 8 pixels/clock, but it actually works on 32 pixels during a clock.
That's why you could consider the NV30 as 4x8 and the R300 as 3x8 ( ATI claim a R300 pipeline can work on three instructions in parallel ) - but then again, there probably are differences in the architecture which doesn't make it comparable. So only benchmarks can tell the whole story.
Thus, the way it's similar to what I said about the same calculators for the whole GPU is: The units that calculate don't do the outputting. And it's the exact same thing in case the same calculators are used everywhere, beside you don't limit it to the PS.
As I said, I might be confused on this. It's the one part of the GFFX architecture I'm the most confused about, mostly about how it compares to the R300.
As for the 3DLabs P10 comparaison. It's a lot less similar there. The P10 VS doesn't have waste when not using 4 elements in an instruction. So it's the same goal: less waste in unoptimal cases. I probably shouldn't have put that part of the comparaison I guess; it's way too ambiguous and it barely makes sense even after the explanation.
Uttar