We all know that the NV30 is described as a 4x2 architecture, and the NV31 is often described as a 2x2 architecture, with something odd going on in the NV34.
I'd like to suggest something different. What if instead of making the NV31 and NV34 chips have fewer pipelines, they just removed math units so that these architectures take longer to do a single SIMD instruction?
This would explain a number of irregularities that having a reduced number of pipelines do not explain:
1. Sometimes the chips act like they have more pipelines than they seem to have from first analysis. This just says that they couldn't reduce the size of all parts of the functional units, so for some functions, the NV31 and NV34 must have the same amount of power as the NV30 (most likely operations not directly linked to math, such as that in shaders, filtering, or blending).
2. The lower-cost chips still support the DDX/DDY instructions, which require the state information for four pixels at once.
3. nVidia opted to not support a vec3 + scalar architecture. Choosing to achieve scalability by lengthening the number of clock cycles to complete a single SIMD instruction may make such a thing much harder to implement.
I'd like to suggest something different. What if instead of making the NV31 and NV34 chips have fewer pipelines, they just removed math units so that these architectures take longer to do a single SIMD instruction?
This would explain a number of irregularities that having a reduced number of pipelines do not explain:
1. Sometimes the chips act like they have more pipelines than they seem to have from first analysis. This just says that they couldn't reduce the size of all parts of the functional units, so for some functions, the NV31 and NV34 must have the same amount of power as the NV30 (most likely operations not directly linked to math, such as that in shaders, filtering, or blending).
2. The lower-cost chips still support the DDX/DDY instructions, which require the state information for four pixels at once.
3. nVidia opted to not support a vec3 + scalar architecture. Choosing to achieve scalability by lengthening the number of clock cycles to complete a single SIMD instruction may make such a thing much harder to implement.