I disagree. The quote says
Where the NV30 differs from the 9700 is that instead of a single 128bit colour call, it can perform two 64bit operations in the same time, providing what Adam called a 'sweet spot' between performance and colour.
This says nothing about pipeline combining. I could write the same quote for 16 vs 32-bit color or single precision vs double precision math on a CPU. All it says is that in the same span of time it takes to execute a 128-bit op, it can issue two 64-bit ops, which is exactly how I explained it in my original post. This same reasoning works for vectorized instructions on CPUs as well. If you drop back in precision, you can pack more ops per cycle.
You are proceeding from the theory that each pipeline cannot do 128-bit ops, but must steal an execution unit from another pipeline. You have no evidence for this theory, and it doesn't jive with the following quote
The 9700, however, can only call one 64 bit operation in the same time as a 128 bit, so decreasing the colour level will not enhance performance."
Let's procede on the theory that each R300 pipeline can do a 128-bit color op in 1 cycle for 8 pixels. It can also do a 64-bit color op in 1 cycle for all 8 pixels (NVidia's claim, 128-bit = 64-bit speed for R300) According to your theory, NVidia would take 2 cycles to do the same 128-bit color op for 8 pixels due to half the pipelines being shutdown, but could output 8 64-bit pixels in 1 cycle.
So in reality, the R300 would be no slower than the NV30 at 64-bit math, both outputting 8 pixels @ 64-bit in 1 cycle. So why would NVidia claim an execution advantage if the ATI card can still equal it at 64-bit? Let's assume the marketing guy is not talking about stuff he is clueless about (otherwise, we can't have this discussion), so NV30 must be faster than R300 at 64-bit.
We'd have to revise these conjectures in one of two ways to make it rational:
Theory #1: NVidia can do 8 128-bit ops in 1 cycle, and therefore 16 64-bit ops in 2 cycles. Therefore in 64-bit, NV30 is 2x R300 and equal to R300 @ 128-bit. Contradicts pipeline combining "50% slower" theory
Theory #2: NVidia takes 2 pipelines/cycles to do 128-bit, but R300 takes 2 cycles as well for 128-bit and 64-bit. NV30 only takes 1 cycle for 64-bit, therefore 2x the speed at 64-bit, but R300 == NV30 @ 128-bit (2 cycles)
Theory #3: Marketing guy doesn't know what he is talking about. R300 64-bit speed == NV30 64-bit speed, but NV30 128-bit speed is 50% slower than R300 128-bit speed.
No matter what you decide, can you really rationalize a 120M transistor NV30 that is "economical" with its transistors and, like an XP4, steals units from sibling pixel piplines? If, for example, they only included 50% of the floating point logic of the R300 in their pipelines, then what the hell are all those other transistors doing? Exotic deferred rendering architecture? I doubt it.
I bet in the end, the NV30 will be a straightforward IMR design like the R300, and the performance of the two will be mostly equivalent in many areas, and in some areas, each core will have their strength.
If the NV30 really has a 128-bit bus, and it doesn't have some truly exotic HSR, then Nvidia made an unfathomable error in their design, hitching up 120M transistors (for what?!?) and bandwidth hungry 128-bit/64-bit FP pipelines to a tiny memory bandwidth.
Would they be that incompetent, that idiotic?