The 12 ops per clock cycle fits what has already been established for the NV30.
The best result of that would be if it can distribute calculation opportunities over 8 pixels.
The big question is the figure of "12 ops per clock cycle".
Assuming it really can apply colors for 8 pixels at a time (which I tend to believe), that's fine for calling it an 8x1 part, and getting a boost in odd textured performance (but not non-textured unless zixel
TM Joe DeFuria performance is expanded again, which I don't expect at the moment).
However, I think there would still be significant limitations
if that number is real.
The performance boost possibility with that number that I see for fp is going up to 4 4 component fp32 ops per cycle. That's a very good thing for the ARB_fragment and DX 9 (non-FX12
), but nVidia isn't touting ARB_fragment and DX 9 benchmark results (and still wouldn't do well against the R300), so maybe not.
More important to marketing, it could also be potentially good in allowing the peak full PS 1.3 performance (8*(non dependent tex op + register combiners 1 op)) to happen more often if the register combiners follow 8x1 (which seems to me to go together with 8x1 color writes).
The problem is that with only 4 fp ops allowed, the intermixed peak (4*(op+up to 2 ops)) would still be just as rare, and scheduling would effectively reduce it to nv30 behavior when trying to take advantage of the fp units for intermixing.
For this reason, I think these figures indicate 8x1 in fixed function, the same peak intermixed and fp performance as the nv30 so effectively or actually 4 pipeline, possibly significant improvement in fp32 (as fast as fp16), and possibly significant enhancement of PS 1.3 (non dependent texture reads, FX 12) speed.
What this would be is the NV35 acting like a 9800 of the same clock speeds only in fixed function pixel output (sans filtering, and with some distinctions in fixed function T&L performance) and PS 1.3. This could be a winner in UT2k3 and earlier games at Ultra clock speeds, but once vertex shading and pixel shading complexity increase (and they seem to be already for games), AFAICS it would act more like a NV30 (i.e., run into the same limits).
Again,
if that 12 ops per cycle number is correct (it strikes me that because of that we are discussing the worst case).
I don't see a guarantee of high RAM clock speeds. It could have 32 GB/s bandwidth, but so could the 9800. The question seems simply whether nVidia will achieve more success (or spend more money on production) with the 256-bit bus to allow a higher bandwidth figure.
Nor do I see an inherent problem with DDR I...you pick the clock speed target and pick the cheapest effective technology to achieve it. DDRII or DDR I doesn't matter, the effect on performance does.
I think if the fp32, PS 1.3, and 8x1 function performance are true, it would be quite the improvement (not the improvement that I had theorized, though). I think the PS 1.3 and 8x1 fixed function are pretty easily achievable (but I don't know the internals of the FX12 architecture to be sure), and the major question is fp32 changes.
EDIT: reworded the PS 1.3/non dependent mention.