"If" is the key here. I see no reason why GPU's should do this, considering they want to hide a whole heck of a lot of latency anyway.
Where did I say they _should_ do it that way. And you provide no reason that they should go any smaller than this (latency is not a reason, I'll explain below).
IF is the key point. If they targetted a FP add as the gating factor it would be slower. They absolutely don't make a pipeline's unit of work smaller than that, it would cost a ton of die space.
Lengthening pipelines and splitting work costs die space, especially on small operations like a single FP add. Not so much on large composite operations with clear "half way" points.
You are making the claim that somehow they would need to mask latency WITHIN A SINGLE ADD? Honestly, WTF? what could thay possibly have to wait for between the beginning and end of a single add? You have two numbers, and now you have to add them and get a result. Do you need any extra info in the middle of that calculation NO!
And since you are talking about masking latency, you do realize that adding extra stages and breaking apart a single stage into two (faster clocking ones) is an utter waste since latency is in TIME and not clocks. If they want to mask latency then just add a cycle of waiting in a pipeline.
In simple terms, if at the end of the add you have to wait for something else add a dummy pipe stage that does nothing, viola, masked latency! Don't split the add in two, its way more work!
Splitting a stage that is not clock limiting up is pointless costly work.
I'd like to hear one single good case for them making a pipe stage smaller than an FP add. I still think the stages are larger than that, but even if not, what would it gain them? You have 1/500Mhz in time, and a single add fits in that time. Why bother make it take two cycles? To mask latency! LOL, just wait or do some other work. The "mask latency" argument for splitting work up into multiple stages is bogus.
You split work up into multiple stages if the work is too much to fit in the target clock range and thats it. If you target 2.5Ghz then yes you'll pipeline that FP add or make it complete in more than one clock at a throughput less than 1 a clock.
If you up your target clock, then all stages everywhere must be smaller in work done, but the result is increased overall area and power used.
As I said the cost would be high in area and power use, and that is a fixed resource so a higher clocking chip (say, 2x faster) would probably need 3x to 5x the area, and thus have less than half the GPU pixel and vertex pipes and be even slower in aggregate throughput (though that one pipe would be faster). Three pipes at 1Ghz or 8 at 500Mhz, take your pick.
The relationship between area needed and clock is highly nonlinear. At CPU speeds it is extremely high, slight clockability improvments lead to very large area and power consumption growth. At GPU speeds it is less. The reason is that in CPUs you are breaking apart many small things into multiple parts, such as a FP add or Mult. These are small operations and can be hard to parallelize and break apart and require a lot of area and complexity to work at that speed.
Clearly, breaking apart somthing larger like a matrix transform is easier since there are several obvious boundaries in the process to break it itno stages. Clear and simple boundaries mean hardly any increase in area (complexity).
Clear and simple boundaries break down into more complicated situations when you cross over the fundamental units of work into sub-units.
The constraints are different, and thus the design decisions are different.
Exactly.
GPUs clock lower because they have pipeline stages that do more work.
This is so because the design decisions are different.
Such decisions are the area/power tradeoffs and the funamental units of work.
A circular relationship of course, as the design influences the targetted clock, and the achieved clock is a result of the design.
In a GPU you can add pixel and vertex pipelines, each new one is approximatly linear in area addition, and you get linear improvements in performance. Thus, being able to cut the area used in half is JUST AS VALUABLE as increasing clockability by a factor of two (performance-wise)
In a CPU your limiting factor is a single execution stream and extra pipelines can only parallelize this stream a little. Each extra pipeline (superscalar pipe) requires nonlinear extra area and power because the pipelines must coordinate on a common set of registers and a single instruction stream, and the logic to organize and coordinate this activity is expensive (most of the logic are in a CPU, actually).
Cutting area of the execution resources (which are now < 50% of the chip) is hardly valuable. Increasing clock is extremely valuable, so long as the instruction execution efficiency doesn't get too bad (IPC).
Thus, the optimal designs limit the number of pipes and go for high clock speeds at good IPC with the smallest common instructions (int add, boolean, logical, and other simple ops) as the clock limiting factors, this has been the primary CPU design philosophy since circa 1990 give or take a few years depending on the ISA, when the transistor count budgets allowed for pipelining at all (more proof that pipelining an action and splitting it up takes more transistors/area/power).
For a CPU that expects to have many simultaneously parallelizeable instruction streams, the design constraints are different, and lo and behold, Sun's Niagra is designed to have many simple parallel independant pipelines that are only moderately clock tuned. WIth throughput the goal, it would make little sense to quadruple each mini-cpu pipeline's size for more clock, since they're cramming lots of them into one core.