Why such an architecture? First, let's explain the 4 color writes design choice.
Because nVidia expected their FP32 performance to be sufficent. So, considering they can only do 4 FP32 operations/clock, it makes a lot of sense to use a 4x2 design.
The problem, now, is that their FP32 performance is way below R300 FP24 performance. Way, way below. Here are the theorical numbers:
R300: 8x325 = 2600
NV30: 4x500 = 2000
That's about 25% slower, in a perfect theorical situation. In practice, early benchmarks show it could be even worse than that.
The idea behind 4 color outputs probably is that you'd waste much if you had 8 color outputs and you were using FP32. The problem with 4 color outputs, however, is that it's not optimal when using FP16.
As Wavey says, figuring out which instruction isn't impacted by another isn't easy at all. It's even very hard. On the plus side for nVidia, however, developers are suggested by ATI that there should be spaces between instructions who are not dependant of each other. That's because the R300 architecture *also* benefit from this.
So, even though the reasons for nVidia & ATI aren't exactly the same, both agree on it. So one won't try to educate the developers in a way which would ruin the other's architecture. And that's a good thing.
But it's probably near optimal with FP32, since it's basically the same thing as with the NV20 4x2 design.
Too bad FP32 is so slow and nVidia is forcing FP16 nearly everywhere
Poor nVidia!
Also, the 3DMark 2003 huge score increase could be attributed to *hardcoding* which instructions are independant, instead of using the default algorithm... So, the GFFX would use FP16 nearly everywhere, and its efficiency in 3DMark 2003 would be near perfect. Makes sense suddently, doesn't it?
The real question is wether the IQ difference between 90% FP16/10% FP32 & 100% FP24 is that huge... Why didn't anyone use 3DMark screenshot utility yet? Are people too cheap to buy the Pro version?
Now, why 8 Z/Stencil writes/clock?
The first reason, obviously, is the performance gain with the Z Pass: such a pass doesn't even need a color write, and it needs a LOT of Z writes. A fast Z pass lets the PS engine be used more rapidly, too, because it's waiting during the Z Pass.
But then, why can't the NV30 do 8 color writes when there's no Z Write?
Well, a first explanation would be that most people running that test use FP32. But let's assume such a test was done at FP16, and also suggested a 4x2 configuration ( a verification would be nice, too... )
Such a case would be rare, but it does exist ( HUDs might do that, for example ). So it's unlikely it's a driver bug and nVidia didn't implement it yet. Who knows... But let's assume not.
Let's see... 4 Color Writes, and 8 Z Writes? Where would this be highly optimal?
I'm going to give you a hint: what writes more different Z values than Color values?
You guessed it: MultiSampling.
"But wait!" , I hear you say. "If the GFFX could already do 2 Z/pipeline without MSAA, and it can do 4Z/pipeline with MSAA, why doesn't it have a native 8x MSAA mode?"
Because the 4Z/pipeline with MSAA thing probably includes the 2 Z/pipeline without MSAA. But then... Why can't the GFFX use the 4Z/pipeline without MSAA, to do the Z Pass even faster when there's no MSAA? Why does it only use 2Z/pipeline?
Well, a first explanation is bandwidth limitation. As pixelpipes say, the limit is 4GP/s when writing 32 bits/pixels on the GFFX.
An interesting question, thus, would be if the GFFX is capable of doing 8GP/s when only writing a 16 bit Z value/pixel. If it is, then it all makes sense - they've simply been sufficently smart to use the MSAA Z Output capabilities in all situations.
Or anyway, probably. They could be using something even more complex, but that's unlikely.
My conclusion, thus, is quite simple:
1. The GFFX is 4x2
2. The GFFX is capable of using its 4Z/pipeline capability ( or at least half of that capability ) even when it isn't using MSAA, which is a really good design idea ( although the benefits are all gone when using 4x FSAA, but using 4x FSAA in Doom 3 on a NV30 is kinda unrealistic anyway )
3. The GFFX design is smarter than the NV25's design, and more efficient. It is thus unfair to say it's a 4x2 "just like the NV25" - but it's still less powerful than the R300's design.
Please note that this conclusion is not correct if my reasoning is not correct. As always, I'd appreciate to be corrected if I'm wrong - after all, if I write this type of thing, it's in the hope of learning more...
Uttar
P.S. : About the NV30 T&L unit. Couldn't it be possible that the T&L unit is simply using the integer path, which is also usable in the shaders, but which is used a lot less in them because FP is available and developers love FP? While I don't think nVidia would hesitate to change FP32 in FP16 through driver "optimizations" to gain performance, I'd be surprised if they dared to change anything to integer while the program asks for FP...