FP - DX9 vs IEEE-32

Deflection said:
K.I.L.E.R said:
I would LOVE to see an instance where FP32 has an advantage over FP24.

Humus's Mandlebrot demo. It's basically a worst case scenario demo where precision errors can "spiral" out of control (pardon the pun:) Even there you really have to zoom to see it. Kind of like the SS rotating floors for AF on the radeon. The stuff we've seen so far seems to be that FP24 can handle pretty much all that's out there to a very acceptible level.

Where I'm not sure, is that the same can't be said of FP16. ATI and MS don't seem to think so, but Humus's demo can't really be used to judge that because it is a worst case scenario. The 3Dmark demo did show differences too under close examination. The question is, does it fall more on the side of "worst case scenario" or "real games will see results like these". The framerates are rather low which implies intensive shaders that might not make it in to DX9 games. Some people on this forum have said textures need the extra precision, but I don't have the knowledge to judge that.

In any case, we're just now starting to see DX8 pixel shader games. I think it's safe to say the r300 is the best DX9 design so far, but it's tough to say by how much without the games to compare.

Is there any other demo? Like a small level with smoke, etc... that would also show the difference between FP32 and FP24?
 
One thing I dont understand about this entire Discussion is the fact that even though Nvidia is Currently supporting FP32 you can hardly say its a practical and usable feature. The Subject has been brought up a few times wether FP32 is needed.

How can one say that offering FP32 to developers to use in games is a better option if that feature is completely impractical and slow for anything but a flash pan effect in a game? Why is ATi's Choice with fP24 any different than Nvidias choice to only Partially support FP32?
This comment specifically concerned me (though i have read every word of this thread)
More instruction slots and more registers are needed right now, not really more precision. Of course, GFFX has all three, so Nvidia at least did something right with it.
How do you figure??? If ATi is limited (in R300) by instruction then Nvidia is Clearly limited by Pecision. Becuase they cant offer than added percision in a way that can be mass used in a game. They have to revert to FP16 or even lower modes than that. What are you possibly coding that requires more instructions than what R300 currently supports?? Hell look at Hl2 and some of their effects.. yet it is running great on an R350.

You take any Ps 2.0 based application or demo currently exsisting and ATi's hardware is running it faster. Can you tell me that the GFFX you are developing on is actually handeling a large number of instructions at playable Frame rates currently? What about the FX5600 or 5200. How are they supposed to deal with these really long shaders you seem to be working on. The eveidence so far is not there to back it up from what I can see. Hell, Nvidia is even complaining about the way 3dmark03 is coded with to many redundant instructions.

I guess the big issue for me is how can one complain that ATi is perhaps holding everyone back by going with FP24, when at the same time the competition just released and entire line of cards that might as well be FP24 for all practical purposes.
 
I'm not a techie like you guys and i'm sure the more is allways better when all things are considered. But when you can't use something it becomes worthless. Imho nvidia is doing more harm than good. They included support for something that can't run with any decent speed. Knowing this they added support for something much lower than the spec called for but left in the slow one to be compliant. Wouldn't it have been much better for the advancment of games if nvidia supported 24bit and offered it at a decent speed ?
 
FP32 is not inherently slower than FP24, FP16, or FPwhatever. It just takes more silicon to support it natively. This means more transistors in your ALUs (about 2x as many for a multiplier); more transistors spent on registers to hold the same number of elements; and wider datapaths. On the other hand, note that R3xx already supports FP32 throughout most of its pipeline--the vertex shader is of course all FP32, and the pixel shaders load and store in FP32 format, so you're already taking the hit in terms of bandwidth, and possibly cache (?), up until you get to the pixel shaders themselves.

As for NV3x, FP16 and FP32 run at exactly the same speed, except for the issue of register file usage. NV3x's pixel shader pipeline is rather ridiculously poor in full-speed temp registers; indeed, testing shows a shader can only address 256 bits of registers without taking a performance hit. This equates to 4 FP16 values (FP16 * 4 componenets = 64 bits per pixel), which is bad enough, but only 2 FP32 values, which is downright awful. In fact, it's so awful that the best explanation I've seen for it is that they must have something buggy in their implementation which is being worked around by using a number of should-be GPRs as special-purpose registers, and thus taking them off the table. But the point is, if this presumed bug were fixed, or (if NV3x really were this register-poor by design) the design were a bit less braindead, NV3x's FP32 performance would be the same as its FP16 performance in the majority of cases. Still not great, to be sure, but nothing to do with FP32 being inherently slow.

As with all matters of computer performance, this is all about tradeoffs. And as with most matters of graphics performance (indeed, probably most matters of computer performance in general), the balance of the tradeoffs is mostly a function of contemporary process technology. As Moore's Law rolls along, the proper tradeoff inevitably changes from one side to the other. That is, the primary fact of hardware engineering is that your transistor budget roughly doubles every 1.5 years. At some point, tradeoffs you decided against because you couldn't justify the transistor expense eventually become worthwhile.

Of course, since graphics is an embarrassingly parallel problem, there is almost always a good default way to use up one's ever-increasing transistor budget: just slap on more pixel pipelines, TMUs or vertex units. This is quite a different situation from CPUs, where most problems are not embarrassingly parallel and thus extra transistors are used in periodic (every 5 years or so) redesigns for ever-more-complicated control to try to extract as much parallelism as possible from code, and in between are just donated to more and more cache. Unfortunately, the default uses of extra transistors--more pipes on a GPU, and more cache on a CPU--are subject to diminishing returns on many applications; with GPUs, eventually adding more pipes will get you nothing because you are bandwidth limited. (And indeed most GPUs already have more or less enough pipes for their available bandwidth.)

So there is a space in which such tradeoffs are evaluated: transistor budget (which is itself a tradeoff of performance vs. functionality vs. manufacturing cost) vs. the benefits of the feature vs. the benefits of the default use for extra transistors (i.e. extra pipelines, or perhaps some other worthy feature). So, getting back to FP32 vs. FP24: we already know the transistor cost (2x bigger multipliers, and 33% bigger registers); what are the benefits?

There are basically four issues:
  • color fidelity: essentially no need for anything more than FP24 or indeed anything more than FP16. After all, the colors are going to be output on an FX8 monitor for the forseeable future (maybe FX10 sometime soon)
  • texture addressing: a 2048x2048 texture at 32 bits per texel runs 16MB (which is to say, nothing larger will be used for quite some time); FP24 can accumulate 4 bits of error and still address such a texture accurate to 2 subpixel precision bits. On the other hand, FP16 is not sufficient to address large textures with subpixel accuracy; it is for precisely this reason that PS 2.0 and ARB_fragment_program both require at least FP24 support. (Sireric makes exactly this point early in the thread)
  • world-space coordinates: these basically need to be FP32 for any sort of accuracy over large distances (which is why they are FP32 in the vertex pipeline). To the extent you want to use positional coordinates as input to a pixel shader (Rev discusses a perfect example here), you may be able to get away with FP24 with some hacks or restrictions, but in many situations you will get artifacts.
  • accumulated error: the longer a shader is, the more error can build up; FP24 shaders of moderate length may start to give incorrect answers for texture addressing, and eventually even for color output (although in practice you'd need a really long shader for that).
The last one is the most interesting, because it brings up another important point about hardware tradeoffs: they have to take into account the prevailing performance environment they will be used in. This is particularly important in realtime graphics, because there is a very narrow target you are shooting for: a realized fillrate of 40-150 million pixels per second. Anything more than that is essentially wasted; much less, and you might as well not bother.

Given that target range, and given the throughput of today's pixel shader implementations, shaders long enough to bring out precision artifacts in FP24 are pretty unlikely to arise in realtime use for the next few years. Which is not to say never: a shader might be particularly poorly behaved, or a game might get away with a couple really long shaders if they're used on a relatively small portion of the screen. And it certainly doesn't address non-realtime use, where the range of useful performance is much wider.

So what's the conclusion from all this rambling? In a sentence, FP24 is probably the best choice for current generation GPUs, but FP32 will be the best choice soon enough. Right now about the only thing FP24 can't handle well is positional coordinates; a couple generations down the road, however, GPUs will have the shader processing power to allow those long shaders which will bring out FP24's limitations in other uses as well. (After all, they won't be using their extra power for more pixels, because above ~150 Mp/s, there's no point.)

Plus, while the transistor overhead FP32 requires over FP24 might be a bad tradeoff in .15u and even .13u, as process technology improves it will look better and better; at .09u it's probably a shoo-in. Remember, it's not slower, it just takes more transistors; and transistor budgets are skyrocketing all the time. And there are other advantages to full FP32 support, most notably that it allows the unified pixel/vertex shader model Uttar and Demalion are always talking about.

As always, a particular hardware feature is almost never "good" or "bad" in isolation, but only when considered as a tradeoff between the manufacturing constraints and end-user environment it will spend its life in. The age-old "CISC vs. RISC" debate is a perfect example of this. Which is better? Neither: each was a function of the prevailing environment in its time. "CISC" was the best choice throughout the 70s and early 80s, with a heyday in the late 70s. For a number of reasons, primarily that core memory was too expensive, so minimizing code size was of primary importance; but also because the process technology of the time didn't allow for significant on-chip register files, and the compiler technology of the time wasn't good enough for high-level languages to be a win over assembly for most uses. RISC was the clear best choice from the mid 80s until recent times, but particularly in the early 90s. That's because memory became cheap enough that code bloat wasn't much of a problem; compilers became good enough for high-level languages to become the obvious choice, and to do the simple optimizations necessary for decent in-order RISC performance; and process technology was good enough to allow first large register files, then ever increasing levels of pipelining, then superscalar designs and then out-of-order designs, all of which were more easily realized with CISC than RISC architectures. In the late 90s, CISC ISAs (well, x86) became increasingly competitive with RISCs, because transistor budgets had increased to the point where CISC-to-"RISC" decoders could be stuck on the front-end, thus allowing all the design benefits of RISC (easy pipelining, superscalar, and OoO) for an increasingly negligable silicon cost; and because the increasing importance of system bandwidth as a bottleneck meant CISC's code-size advantage counted for something again. Looking to the future, it appears that compiler advances will indeed bring Intel's much-maligned EPIC (plus the VLIWs that are increasingly moving into the media-processor space) significant implementation-normalized advantages over competing architecture philosophies.

No approach is "better"; rather they can all only be judged in terms of the times they were designed for. During crossover periods there is certainly much valid debate over the best solution for the time; but for the most part such discussions are more a matter of "when" and not "if". Which is not to say that having an "ahead of its time" design is a good thing; in hardware, being ahead of your time is just as much of a sin as being behind the times.
 
Dave H said:
FP32 is not inherently slower than FP24, FP16, or FPwhatever. It just takes more silicon to support it natively.
Are you sure that it's possible to keep the calculation time constant just by throwing even more silicon at the problem?
I'm not convinced - surely it must still be slower.
 
Dave H said:
[*]world-space coordinates: these basically need to be FP32 for any sort of accuracy over large distances (which is why they are FP32 in the vertex pipeline). To the extent you want to use positional coordinates as input to a pixel shader (Rev discusses a perfect example here), you may be able to get away with FP24 with some hacks or restrictions, but in many situations you will get artifacts.
This may make FP32 a shoe-in for the gen4 designs (NV4x, R4xx), if they have unified shader architectures. That is, they really need at least FP32 for vertex positions, so if the same hardware is used for PS ops...
 
Dave H said:
In the late 90s, CISC ISAs (well, x86) became increasingly competitive with RISCs, because transistor budgets had increased to the point where CISC-to-"RISC" decoders could be stuck on the front-end, thus allowing all the design benefits of RISC (easy pipelining, superscalar, and OoO) for an increasingly negligable silicon cost; and because the increasing importance of system bandwidth as a bottleneck meant CISC's code-size advantage counted for something again. Looking to the future, it appears that compiler advances will indeed bring Intel's much-maligned EPIC (plus the VLIWs that are increasingly moving into the media-processor space) significant implementation-normalized advantages over competing architecture philosophies.

Not graphics related, so OT, -but this is a pet peeve of mine.

CISC has *zero* code size advantage over RISC. Compare ARM thumb or MIPS16 to x86 and you'll find the latter losing. The average instruction size of the new x86-64 is 5 bytes per instruction, -yes you can have a memory operand in there, but at the same time you only have a 2-adress instruction format, -and fewer registers, so you'll end up with more instructions shuffling data around than in a typical RISC.

Also decoding ia32 into uOps does not take negligable resources. decoders are either big and power hungry (Athlon) or less power hungry but even bigger (P4; trace cache). A 21264 core is half the die size of the P4 in a similar process and yet has higher performance. The succes of x86 is solely due to economy of scale, which has allowed the companies behind the MPUs to pour $$$$ into process and uarch developments while still maintaining a price/performance edge.

Finally: The compiler advancements that will benefit EPIC (VLIW) will also benefit every single other architecture out there. The only thing EPIC has going for it is the large register file, -and with SMT becoming ever more popular even that is looking likely to be a liability (big ass context-> fewer contexts juggled at the same time->lower throughput).

Cheers
Gubbi
 
Simon F said:
Dave H said:
FP32 is not inherently slower than FP24, FP16, or FPwhatever. It just takes more silicon to support it natively.
Are you sure that it's possible to keep the calculation time constant just by throwing even more silicon at the problem?
I'm not convinced - surely it must still be slower.

Well, for addition, it can certainly be made constant. Addition is o(n), and with n-adders you can do it in constant time. For multiplication, it depends on how much logic you want to burn. Simple multiplication is o(n^2),( but that can be lowered to o(n^lg(3)) , or o(n lg n) if you use fast fourier techniques. However, those tricks only pay off on truly large numbers useful for huge number theoretic math and cryptography)

What this means is that multiplication requires a quadratic increase in circuit complexity if you want to preserve constant speed. I recall that there are two ways this is implemented today: Booth encoding with arrays of adders, and Wallace trees. You can preserve constant speed as long as your critical path doesn't get too long. 32-bit multiplication with single cycle throughput is a mature technology, and so yes, throwing silicon at the problem has resulted in constant speed.

Whether this could be extended further (say, FP64 or FP128) is unknown.


Edit: BTW, I just found this on google, should be educational http://www-2.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15828-s98/lectures/0126/sld001.htm
 
There's also the problem with power. A FP32 FMADD unit will use twice the amount of power of a FP24 one, and with current (and future) power densities, this is likely to impact performance in a negative way.

Cheers
Gubbi
 
DemoCoder said:
Simon F said:
Dave H said:
FP32 is not inherently slower than FP24, FP16, or FPwhatever. It just takes more silicon to support it natively.
Are you sure that it's possible to keep the calculation time constant just by throwing even more silicon at the problem?
I'm not convinced - surely it must still be slower.

Well, for addition, it can certainly be made constant. Addition is o(n), and with n-adders you can do it in constant time.

I'm still not quite following you. With an N-bit carry save adder can do MOST of the add in constant time, but surely you eventually you have resolve the carries which is surely going to be at least an o(log(n)) operation or perhaps even linear.

For multiplication, it depends on how much logic you want to burn. Simple multiplication is o(n^2),( but that can be lowered to o(n^lg(3)) , or o(n lg n) if you use fast fourier techniques. However, those tricks only pay off on truly large numbers useful for huge number theoretic math and cryptography)
I'm aware of the other methods, but I don't see that you can completely trade-off time vs area....

What this means is that multiplication requires a quadratic increase in circuit complexity if you want to preserve constant speed. I recall that there are two ways this is implemented today: Booth encoding with arrays of adders, and Wallace trees. You can preserve constant speed as long as your critical path doesn't get too long.
Ahh.. there you have it. "as your critical path doesn't get too long". The time is not constant.

[/quote]
I'll have look shortly.
 
Constant speed is O(1), not O(n). With both adders, multipliers, and barrel shifters (the basic circuits from which FP units are made), since the MSB of the result potentially depends on all the bits of the inputs, you can't get lower than O(log n) gate delays (and O(n) interconnect delay) no matter how you design the circuit, which is hardly constant.

FP24 will always be faster than FP32 if you spend equal effort on optimizing them, but the speed difference will be small, perhaps 5% or so in present-day processes. Similarly, FP16 wll be only slightly faster than FP24 in turn. These small differences often disappear once you align the circuit timings to a clock.
 
What is the expected time it takes to calculate an FP32 addition or multiplication?
Can one expect that a 0.15 circuit is able to execute 150M, 300M or 600M of such operations?

I ask this because we usually know the troughput only, but not the latency. Latency is usually hidden in GPUs, by processing multiple {vertices / fragments} paralelly.

I was surprised to find out that on the GF3 it takes 6 cycles to execute a VS instuction.

Can it the same for FP pixel shader architectures?
Can it be that the NV30 is not so sensitive to texture lookup latency because the FP units has similar latency, so it doesn't really matter whether you use one or the other?
 
arjan de lumens said:
Constant speed is O(1), not O(n). With both adders, multipliers, and barrel shifters (the basic circuits from which FP units are made), since the MSB of the result potentially depends on all the bits of the inputs, you can't get lower than O(log n) gate delays (and O(n) interconnect delay) no matter how you design the circuit, which is hardly constant.
Wouldn't pipelining make the delays inconsequential? If you have a 4 stage tree, calculate each stage per clock. You have higher latency, but you're generally working with a long stream of data so it shouldn't matter.
 
Not 100% sure how fast an FP32 unit is in the various process, but can do some guesswork: in highly custom logic on a high-leakage process, the AthlonXP @ 0.13 can do an FP32 mul or add with a latency of 4 clock cycles @ 2.25 GHz = ~1.8 nS. For standard logic at TSMC/UMC 0.13, I would estimate it takes about twice as long , giving about 3.5 nS. If you are doing fused multiply-add, you may add perhaps 50% to the delay for ~5-6 ns. If you are doing DOT3, RCP or other composite operations, I would expect the delay of a fused-multiply-add + a standard add + a little bit more, giving ~8-11 ns. Numbers get even worse for 0.15 micron. In any case, these operations will have a latency of several clock cycles if you want to reach a remotely reasonable clock speed. These are latency numbers, you can get the throughput as high as you want with sufficiently deep pipelining.

As for texture lookup, this operation is far slower and more complex than any FP32 arithmetic operation in turn - the operations for a texture lookup go approximately as follows:
  • Compare texture coordinates and divide the smaller ones by the larger one (if doing cube-mapping)
  • Determine mipmap level (measure differences in texture coordinates between adjacent pixels, then do calculations involving many multiplies and at least one logarithm)
  • Scale and wrap/clamp the texture coordinates (simple)
  • Look up the needed texels from the texture cache. The delay through this stage will be much larger than going through a standard CPU cache (on the order of ~1 external memory latency), or else you won't be able to mask memory latency and overlap cache line fills, ruining performance.
  • Perform bi/trilinear interpolation on the resulting texels.
 
Would that mean that the fp32/texture unit (the fpu cannot function on shader and texture ops concurrently), present within the NV3x architecture, is most likely composed of much more than 4 fmads/inverse logic units (it can achieve 2 texture lookups per clock cycle)? In fragment program mode, NV3x does not compute texture lod with the common tex instruction, and requires txd (a texture lookup which references computed partial derivates) for proper lod (?).
 
RussSchultz said:
Wouldn't pipelining make the delays inconsequential? If you have a 4 stage tree, calculate each stage per clock. You have higher latency, but you're generally working with a long stream of data so it shouldn't matter.
But that's another tradeoff. You are right in that the likely way to do it would be to add more pipe stages to absorb the added propagation delays, but then the architecture has to be retuned to absorb the added latencies, which may add additional costs and/or impact performance.
 
arjan de lumens said:
If you are doing DOT3, RCP or other composite operations, I would expect the delay of a fused-multiply-add + a standard add + a little bit more, giving ~8-11 ns.

For DOT3 I'd be surprised if it took that much.

FP add contains the following ops:
1. Compare the exponents and determine the amount of mantissa shift.
2. Shift the mantissa of one of the numbers.
3. Add the two mantissas together.
4. Search the highest bit in the result
5. Shift the mantissa for renormalization.

DOT3 requires a 3 parameter add, instead of the 2 parameter add used in MAD operations.
Stages 2, 4 and 5 should take exactly the same time, only stage 1 and 3 takes more time, but I'm not sure it matters as much as (a standard add + a little bit more)

OTOH, I don't know how RCP is implemented, how much of it is a table lookup, and how much more work is done.
 
arjan de lumens said:
Not 100% sure how fast an FP32 unit is in the various process, but can do some guesswork
I doubt if that's really too comparable - there are so many uncertainties involved. One major point omitted is that the x86 unit on the Athlon is FP80, not FP32, for a 4-clock latency.
 
Back
Top