FP32 is not inherently slower than FP24, FP16, or FPwhatever. It just takes more silicon to support it natively. This means more transistors in your ALUs (about 2x as many for a multiplier); more transistors spent on registers to hold the same number of elements; and wider datapaths. On the other hand, note that R3xx already supports FP32 throughout most of its pipeline--the vertex shader is of course all FP32, and the pixel shaders load and store in FP32 format, so you're already taking the hit in terms of bandwidth, and possibly cache (?), up until you get to the pixel shaders themselves.
As for NV3x, FP16 and FP32 run at exactly the same speed, except for the issue of register file usage. NV3x's pixel shader pipeline is rather ridiculously poor in full-speed temp registers; indeed, testing shows a shader can only address 256 bits of registers without taking a performance hit. This equates to 4 FP16 values (FP16 * 4 componenets = 64 bits per pixel), which is bad enough, but only 2 FP32 values, which is downright awful. In fact, it's so awful that the best explanation I've seen for it is that they must have something buggy in their implementation which is being worked around by using a number of should-be GPRs as special-purpose registers, and thus taking them off the table. But the point is, if this presumed bug were fixed, or (if NV3x really were this register-poor by design) the design were a bit less braindead, NV3x's FP32 performance would be the same as its FP16 performance in the majority of cases. Still not great, to be sure, but nothing to do with FP32 being inherently slow.
As with all matters of computer performance, this is all about tradeoffs. And as with most matters of graphics performance (indeed, probably most matters of computer performance in general), the balance of the tradeoffs is mostly a function of contemporary process technology. As Moore's Law rolls along, the proper tradeoff inevitably changes from one side to the other. That is, the primary fact of hardware engineering is that your transistor budget roughly doubles every 1.5 years. At some point, tradeoffs you decided against because you couldn't justify the transistor expense eventually become worthwhile.
Of course, since graphics is an embarrassingly parallel problem, there is almost always a good default way to use up one's ever-increasing transistor budget: just slap on more pixel pipelines, TMUs or vertex units. This is quite a different situation from CPUs, where most problems are not embarrassingly parallel and thus extra transistors are used in periodic (every 5 years or so) redesigns for ever-more-complicated control to try to extract as much parallelism as possible from code, and in between are just donated to more and more cache. Unfortunately, the default uses of extra transistors--more pipes on a GPU, and more cache on a CPU--are subject to diminishing returns on many applications; with GPUs, eventually adding more pipes will get you nothing because you are bandwidth limited. (And indeed most GPUs already have more or less enough pipes for their available bandwidth.)
So there is a space in which such tradeoffs are evaluated: transistor budget (which is itself a tradeoff of performance vs. functionality vs. manufacturing cost) vs. the benefits of the feature vs. the benefits of the default use for extra transistors (i.e. extra pipelines, or perhaps some other worthy feature). So, getting back to FP32 vs. FP24: we already know the transistor cost (2x bigger multipliers, and 33% bigger registers); what are the benefits?
There are basically four issues:
- color fidelity: essentially no need for anything more than FP24 or indeed anything more than FP16. After all, the colors are going to be output on an FX8 monitor for the forseeable future (maybe FX10 sometime soon)
- texture addressing: a 2048x2048 texture at 32 bits per texel runs 16MB (which is to say, nothing larger will be used for quite some time); FP24 can accumulate 4 bits of error and still address such a texture accurate to 2 subpixel precision bits. On the other hand, FP16 is not sufficient to address large textures with subpixel accuracy; it is for precisely this reason that PS 2.0 and ARB_fragment_program both require at least FP24 support. (Sireric makes exactly this point early in the thread)
- world-space coordinates: these basically need to be FP32 for any sort of accuracy over large distances (which is why they are FP32 in the vertex pipeline). To the extent you want to use positional coordinates as input to a pixel shader (Rev discusses a perfect example here), you may be able to get away with FP24 with some hacks or restrictions, but in many situations you will get artifacts.
- accumulated error: the longer a shader is, the more error can build up; FP24 shaders of moderate length may start to give incorrect answers for texture addressing, and eventually even for color output (although in practice you'd need a really long shader for that).
The last one is the most interesting, because it brings up another important point about hardware tradeoffs: they have to take into account the prevailing performance environment they will be used in. This is particularly important in realtime graphics, because there is a very narrow target you are shooting for: a realized fillrate of 40-150 million pixels per second. Anything more than that is essentially wasted; much less, and you might as well not bother.
Given that target range, and given the throughput of today's pixel shader implementations, shaders long enough to bring out precision artifacts in FP24 are pretty unlikely to arise in realtime use for the next few years. Which is not to say never: a shader might be particularly poorly behaved, or a game might get away with a couple really long shaders if they're used on a relatively small portion of the screen. And it certainly doesn't address non-realtime use, where the range of useful performance is much wider.
So what's the conclusion from all this rambling? In a sentence, FP24 is probably the best choice for current generation GPUs, but FP32 will be the best choice soon enough. Right now about the only thing FP24 can't handle well is positional coordinates; a couple generations down the road, however, GPUs will have the shader processing power to allow those long shaders which will bring out FP24's limitations in other uses as well. (After all, they won't be using their extra power for more pixels, because above ~150 Mp/s, there's no point.)
Plus, while the transistor overhead FP32 requires over FP24 might be a bad tradeoff in .15u and even .13u, as process technology improves it will look better and better; at .09u it's probably a shoo-in. Remember, it's not slower, it just takes more transistors; and transistor budgets are skyrocketing all the time. And there are other advantages to full FP32 support, most notably that it allows the unified pixel/vertex shader model Uttar and Demalion are always talking about.
As always, a particular hardware feature is almost never "good" or "bad" in isolation, but only when considered as a tradeoff between the manufacturing constraints and end-user environment it will spend its life in. The age-old "CISC vs. RISC" debate is a perfect example of this. Which is better? Neither: each was a function of the prevailing environment in its time. "CISC" was the best choice throughout the 70s and early 80s, with a heyday in the late 70s. For a number of reasons, primarily that core memory was too expensive, so minimizing code size was of primary importance; but also because the process technology of the time didn't allow for significant on-chip register files, and the compiler technology of the time wasn't good enough for high-level languages to be a win over assembly for most uses. RISC was the clear best choice from the mid 80s until recent times, but particularly in the early 90s. That's because memory became cheap enough that code bloat wasn't much of a problem; compilers became good enough for high-level languages to become the obvious choice, and to do the simple optimizations necessary for decent in-order RISC performance; and process technology was good enough to allow first large register files, then ever increasing levels of pipelining, then superscalar designs and then out-of-order designs, all of which were more easily realized with CISC than RISC architectures. In the late 90s, CISC ISAs (well, x86) became increasingly competitive with RISCs, because transistor budgets had increased to the point where CISC-to-"RISC" decoders could be stuck on the front-end, thus allowing all the design benefits of RISC (easy pipelining, superscalar, and OoO) for an increasingly negligable silicon cost; and because the increasing importance of system bandwidth as a bottleneck meant CISC's code-size advantage counted for something again. Looking to the future, it appears that compiler advances will indeed bring Intel's much-maligned EPIC (plus the VLIWs that are increasingly moving into the media-processor space) significant implementation-normalized advantages over competing architecture philosophies.
No approach is "better"; rather they can all only be judged in terms of the times they were designed for. During crossover periods there is certainly much valid debate over the best solution for the time; but for the most part such discussions are more a matter of "when" and not "if". Which is not to say that having an "ahead of its time" design is a good thing; in hardware, being ahead of your time is just as much of a sin as being behind the times.