View Full Version : FP - DX9 vs IEEE-32
Reverend
03-Jun-2003, 03:31
I had been playing around much with a R300 and a NV34 and something that has continuously stayed at the back of my mind was the apparent murky differences between DX9's FP "specs" and that of IEEE-32.
The thing that primarily stood out to me was the R300's 24-bit FP internal pixel pipeline vs IE3-32. The more I think about it, the more I favour the thought that the R300's designers wanted a DX9-compliant feature yet they definitely do know what IE3 is. Maybe it's a combination of transistors, a trusted process technology they feel they must go with and, well, performance. The R300 should've been IE3-32, not DX9-minimum-spec-compliancy IMHO.
Why?
IEEE defines all basic arithmetic operations (add, subtrace, multiply, divide, square root) as producing the properly rounded 32-bit version of the (theoretical) exact result. No problems here with DX9.
But when you get into compound operations like mulad, dp4, etc., the DX9 spec doesn't specify an "order of operations", so the result may differ depending on (for example) whether mulad is round(a+round(b*c)) or round(a+b*c), etc. Further, the trig operations aren't well-defined at all, so implementations may differ. In this day and age, when much of the functionality you'd expect (such as blending of floating-point textures) remains unimplemented, this is not at all a limiting factor. But in a few years, when the more dire problems have been solved, this one will be start poking up as a problem. Eventually, people will expect basic hardware operations to produce precisely-defined results, regardless of hardware manufacturer, control panel settings, time of day, etc.
Obviously, the R300 is a good part and ATi probably felt that their decision to go with FP24 internally is a good choice and I wouldn't really disagree with that because it is the first available "DX9 part". Hey, we gotta start somewhere! However, one has to wonder why they didn't have full FP32 internally and via drivers do some possible, er, optimizations depending on the scenarios, but leave the options open.
Am I asking for too much from the R300 given the current timeframe? Am I being a bit critical of DX9 in this aspect?
AndrewM
03-Jun-2003, 04:21
I remember that ATI actually did some similar funk with the Rage. They seemed to take less of a performance hit going from 16bit to 32bit (framebuffer), but it turned out that they were cutting corners on depth precision. It's not really a _big_ issue, but it just highlights they they have done similar things in the past.
FP24 is a good middle ground (and other parts of the architecture are good performance wise), but you're right about the IEEE thing.
Edit: 16->32bit compared to nvidia's chips
Doomtrooper
03-Jun-2003, 04:24
Talk to M$..
Chalnoth
03-Jun-2003, 04:25
I'd just like to take this time to gripe on a major problem of DX9:
The documentation is crap! Ambiguities like you just spelled out should not exist. It would be just so much easier for everybody involved not to have to rely upon the reference rasterizer to check for adherence to the spec.
I hate me too posts, but I have to say that I agree.
There are a lot of gotchas in floating point programming, and IEEE at least hides some of them. It isn't needed for standard scanline graphics really, but it enables the gfx card to do other fun algorithms like matrix solvers and ray tracing. It turns it into an old school math co-processor.I wish that ATI could have crammed 32bit into the R300, but I would rather have R300 out when it was rather than wait until they could fit 32bit in as well.
I must say that even R300 was more than I was expecting. I expected 16bit/channel (64bit framebuffer) to be an intermediate step before "real" floating point came along.
If you think about it, there wern't too many artifacts at 8bpp, did we really need 16.7M times more accuracy? or even 65K times?
Edit: Oops. I meant 8bpc, not 8bpp. :oops:
AndrewM
03-Jun-2003, 05:46
If you think about it, there wern't too many artifacts at 8bpp, did we really need 16.7M times more accuracy? or even 65K times?
We do, but some people are getting their knickers in a knot over precision and what it means for Image Quality in terms of the games they're playing (or the benchmarks they're playing *snicker* :lol: .
We dont have floating point framebuffers yet, but render-to-texture stuff is just as useful..
bloodbob
03-Jun-2003, 06:47
umm what happens when you got a fixed platte on 8bpp images go down hill pretty quickly when you can define the index on 8bpp they look pretty good. Still with 24bpp framebuffer(z/stenicil don't count to image/alpha) on games I still see the banding at times which annoys me the biggest place I see it is with fog ( eww its horrible) so I'm still wainting for those 12-16 bit DACS and 12+ bit framebuffers.
I think full IEEE-754 compliancy is a bit too much for pixel shaders. The traps, for example, are not too useful for most pixel shaders. However, I think it's important to have a good precision (24 bits is good, but 32 bits is better, of course).
16 bits FP is enough for computing colors. However, you need to store something other than colors in pixel shaders. The most important example is perhaps position. 16 bits FP is also not enough for texture address.
To ensure good behaviour, I think it's important to have a guideline (or specification) of every instructions in shaders. IEEE 754 has only four basic operations (add, sub, mul, and div). 754R added fma, but shaders have more (such as dp3/dp4). The shaders specification should specify how they round, their preicisions, what to do when overflow or underflow, etc. These are specified in 754 but we don't need them to be that complete. However, we also have to extend 754 for other "basic" operations such as dp3.
Currently it seems not that important to specify shader operations precisely. However, if you don't want to limit shaders for the most simple works, you'll need to do so and force all IHVs to obey these guidelines.
Ilfirin
03-Jun-2003, 08:28
whether mulad is round(a+round(b*c)) or round(a+b*c), etc.
Why would there be rounding at all? If a, b, and c, along with all the intermediate results are 24-bit then a+b*c = a+(b*c). But if you're saying that the r300 does the intermediate stuff at 32-bit and converts the results to 24-bit then yes, all the math would stop making numerical sense (granted probably never enough to notice).
I guess what I'm asking is, care to explain the quoted line further? :)
Chalnoth
03-Jun-2003, 08:33
If you think about it, there wern't too many artifacts at 8bpp, did we really need 16.7M times more accuracy? or even 65K times?
For two things:
1. Non-color data
2. Very long shaders that don't hide errors well (Mandelbrot sets, for example).
OpenGL guy
03-Jun-2003, 08:37
whether mulad is round(a+round(b*c)) or round(a+b*c), etc.
Why would there be rounding at all? If a, b, and c, along with all the intermediate results are 24-bit then a+b*c = a+(b*c). But if you're saying that the r300 does the intermediate stuff at 32-bit and converts the results to 24-bit then yes, all the math would stop making numerical sense (granted probably never enough to notice).
I guess what I'm asking is, care to explain the quoted line further? :)
If a and b are 24-bits, then a+b can require more than 24 bits of precision. Here's a real (dumb) example. a and b are both 1 bit. If a = b= 1. Then a + b = 10, which cannot be represented in a single bit. Here's another example. a and b are 2 bits, a = b = 11. Then a * b = 1001.
This is why most (many/all?) floating point units usually carry a little extra precision for intermediate results. However, when the data is written to memory, that extra precision is lost.
Ilfirin
03-Jun-2003, 08:45
If a and b are 24-bits, then a+b can require more than 24 bits of precision. Here's a real (dumb) example. a and b are both 1 bit. If a = b= 1. Then a + b = 10, which cannot be represented in a single bit. Here's another example. a and b are 2 bits, a = b = 11. Then a * b = 1001.
This is why most (many/all?) floating point units usually carry a little extra precision for intermediate results. However, when the data is written to memory, that extra precision is lost.
Oh right, I knew there was something, I just couldn't think of it (was thinking in terms of fractions, not overflows) while reading (hey, it's 3:45AM here).
Well, with that aside, shouldn't a MAD, or any other instruction (macros would of course be different) be done totally at full precision, with only the final answer rounded down to 24-bit?
And to answer the last part:
DX9 fp32 is stored in IEEE 754 format in framebuffers/textures. (Fp24 is converted to fp32 before it's stored.)
The calculations are not IEEE 754 compliant.
Simon F
03-Jun-2003, 09:03
I had been playing around much with a R300 and a NV34 and something that has continuously stayed at the back of my mind was the apparent murky differences between DX9's FP "specs" and that of IEEE-32.
Are you discussing the VS, PS, or both? The PS would need to be a lot faster (i.e. more highly parallel) than the VS and so is more likely to use lower precision (for silicon reasons). Besides, the accuracy of the VS is probably more critical anyway.
The more I think about it, the more I favour the thought that the R300's designers wanted a DX9-compliant feature yet they definitely do know what IE3 is.
Rev, I think (but don't quote me!) that today's silicon compilers have all the built-in IEEE functionality mul/add etc functionality if you want to use them so it's probably just an issue of how much silicon area you want to use.
IEEE defines all basic arithmetic operations (add, subtrace, multiply, divide, square root) as producing the properly rounded 32-bit version of the (theoretical) exact result. No problems here with DX9.
I think DX9 specifies a lower minimum accuracy so that the chips don't get ridiculously large.
But when you get into compound operations like mulad, dp4, etc., the DX9 spec doesn't specify an "order of operations", so the result may differ depending on (for example) whether mulad is round(a+round(b*c)) or round(a+b*c), etc.
This is certainly one area where you don't want to follow the ref-rast. Its DP4 order of operations is exactly the one you wouldn't want to use in a hardware implementation!
Anyway, it's not a problem to have HW 'A' and HW 'B' use slightly different ordering provided each always uses a consistent ordering. For example, a little while ago, a leading manufacturer's drivers used different transform operations depending on whether lighting was turned off or on, which led to Z sorting issues.
Further, the trig operations aren't well-defined at all, so implementations may differ. In this day and age, when much of the functionality you'd expect (such as blending of floating-point textures) remains unimplemented, this is not at all a limiting factor. But in a few years, when the more dire problems have been solved, this one will be start poking up as a problem. Eventually, people will expect basic hardware operations to produce precisely-defined results,
I'm sure in a few years the next generation of specifications will probably tighten up the specification. I don't think it's necessary now.
Obviously, the R300 is a good part and ATi probably felt that their decision to go with FP24 internally is a good choice and I wouldn't really disagree with that because it is the first available "DX9 part". Hey, we gotta start somewhere! However, one has to wonder why they didn't have full FP32 internally and via drivers do some possible, er, optimizations depending on the scenarios, but leave the options open.
IANAHE, but I suspect that because multiplication is possibly an O(n^2) operation, a change from, say, a 16bit to 24 bit mantissa probably represents a 2.25x increase in area. That's a big jump.
The shader spec should IMO be left as flexible as possible. It shouldn't matter what happends on the least significant bits. A shader that relies on exact operation is faulty. Everyone knows that floating point has some twitchiness, even on CPUs with all the IEEE standards crapping down on it. You typically don't do exact compares, such as "if (a == b)" statements.
I want the driver and hardware to have as much flexibility as possible. If I write
float a = b / 3;
then I'm perfectly fine with it composing that into a
MUL a, b, 0.3333333333
I would be dissappointed if it inserted RCP instructions and slowed down my performance.
There's also all the nan (not a number) stuff in IEEE that is completely useless in a shader since you don't want pixels to be void, but rather just saturate values to max, min or round to zero. This is true for both vertex and pixel shaders.
Cheers
Gubbi
antlers
03-Jun-2003, 17:20
Yes, even though the FX uses the IEEE float format for FP32, I'd be really surprised if it did all its calculations to the IEEE standards.
sireric
03-Jun-2003, 17:25
About some misconceptions, and some comments:
1) IE^3 standards do not specify what should be returned for transcendental functions (sqrt, sin, etc...). They specify the format of data (including nans, infinites, denorms) and the internal roundings for results -- This rounding is not the f2i conversion, but how to compute the lsbs of the results. Different HW can return different results. People have learned to live with this. If you need 24b of mantissa precision, FP32 is not enough for you anyway.
2) IE^3 does not guarantee, in any way, that operations are order independant. For example, consider:
result = a + b + c
The above, generally, needs to be broken down into 2 operations. If we select a = 1.0, b = -1.0 and c = 2^(-30), then the result can be 0 or c. Depends on the implementation. IE^3 does not "specify" anything. The programmer needs to specify what he wants (i.e. (a+b)+c).
2.5) IE^3 support for nans, inf and denorms is just not needed in PS. In the final conversion to [0,1.0] range, inf and 2.0 would give the same output. For that matter, it's not needed in VS either.
3) FP24 has less precision than FP32, but has no worst other characteristics (well, range is reduce by 2^63 as well). Order of operations would be no "worst" than fp32, beyond the precision limits.
4) What are the outputs of the shader? There are two: The 10b color or 11b texture address (actually, with subtexel precision for filters, you could expect the texture address to be up 15b). With FP24, you get 17b of denormalized precision, which would allow you to have up to 8 (assuming 1/2 lsb of error per operation) ALU instructions at maximum error rate before even noticing any change in the texture address or texture filtering. Until texture sizes increase significantly (2kx2k right now, will give you a 16MB texture -- I don't know of many apps that even use that now), or that texture filtering on FP texture exists, there really is no need for added precision. On the other hand, it's obvious that FP16 is not good enough, unless you have smallish textures (i.e. < 512x512).
5) The only exception to 4 above is for long procedural texture generation. In those cases you could expect that some fp24 limits would come into play. There's no real use of that out there right now, and we are probably years from seeing it in mainstream. Nevertheless, our Ashli program can take procedural texture generation code from Renderman, and generate long pixel shaders. We've generated shaders that are thousands of instructions long. What we found is that it looks perfect, compared to the original image. No one would ever complain about that quality -- It's actually quite amazing. So, empirically, we found that even in these cases, FP24 is a nearly perfect solution.
6) FP32 vs. FP24 would not only be 30% larger from a storage standpoint, the multipliers and adders would be nearly twice as big as well. That added area would of increased cost and given no quality benefits. We could of implemented a multi-pass approach to give higher precision, but we felt that simplicity was a much higher benefit. Given the NV3x PP fiasco, we even more strongly believe now that a single precision format is much better from a programming standpoint (can anyone get better than FP16 on NV3x?).
At the end of it all, we believe we did the right engineering decision. We weighed all the factors and picked the best solution. DX9 did not come before the R300 and specify FP24. We felt that FP24 was the "right" thing, and DX9 agreed.
LeStoffer
03-Jun-2003, 19:20
6) FP32 vs. FP24 would not only be 30% larger from a storage standpoint, the multipliers and adders would be nearly twice as big as well.
Interesting info. The FP24 choice obviously had a lot to do with your silicon budget, but I didn't realize that FP32 is so demanding over FP24.
DegustatoR
03-Jun-2003, 19:33
sireric
Santa Clara, CA you say?.. Hmm... ;)
6) FP32 vs. FP24 would not only be 30% larger from a storage standpoint, the multipliers and adders would be nearly twice as big as well.
Interesting info. The FP24 choice obviously had a lot to do with your silicon budget, but I didn't realize that FP32 is so demanding over FP24.
Though hardly an expert in the subject I would maybe not have guessed it to be nearly twice as big, but easily > 50% larger. A quick thought over the operations involved in floating point math kinda hints that transistor count would grow closer to a sqaure function of the bit-count rather than linearly.
sireric
Santa Clara, CA you say?.. Hmm... ;)
Like that: http://www.ati.com:80/companyinfo/careers/santaclara.html
LeStoffer
03-Jun-2003, 19:59
A quick thought over the operations involved in floating point math kinda hints that transistor count would grow closer to a sqaure function of the bit-count rather than linearly.
Thanks, Humus. It kinda gives us a hint about what all those transitors went into on the NV3X architecture. Expensive stuff, some of those FP32 ALU units.
sireric
03-Jun-2003, 19:59
The multipliers area is roughly proportional to the product of the mantissa sizes (i.e. 16x16 vs. 24x24 is 256 vs. 576).
Adder area scales with mantissa size, plus overhead, depending on the final shifter (which is closer to the sum of the mantissa sizes).
The 30% for storage is simply 32 vs 24. That's a minimum across the board. The DP area is between 30% and 100% area growth.
Overall, that would translate to extra $$ that users would pay for, and they would get *nothing* in return.
Not a good deal.
LeStoffer
03-Jun-2003, 20:04
sireric, thanks for the info!
Sometimes I just love this forum. Information at your fingertips indeed! 8)
RussSchultz
03-Jun-2003, 20:13
Overall, that would translate to extra $$ that users would pay for, and they would get *nothing* in return.
That's sounding a little bit like the old 3dfx "16 bit is good enough".
You'd get something in return, it just might not be useful now; or worth what you pay for it.
Surely sometime in the future FP32 will be required, and even then, somebody will be bitching for more precision.
sireric
03-Jun-2003, 20:20
Don't get me wrong. Certainly at some point it will be required. But, from the analysis I showed, it will require larger textures, filtering on FP texture, probably native FP frame buffer as well as the development of procedural type shading.
Those things are here yet, so adding 32b SPFP will really not give anything to anyone. For that matter, even FP24 is overkill for all DX7 and DX8 apps (i.e. NV3x forces FP16 for all shaders, without any trouble). R300, with FP24, shoots to really benefit DX9 apps, which will come this year and in two to three years.
Beyond that, there will be more graphics chips and different requirements.
Deflection
03-Jun-2003, 20:48
Don't get me wrong. Certainly at some point it will be required. But, from the analysis I showed, it will require larger textures, filtering on FP texture, probably native FP frame buffer as well as the development of procedural type shading.
Those things are here yet, so adding 32b SPFP will really not give anything to anyone. For that matter, even FP24 is overkill for all DX7 and DX8 apps (i.e. NV3x forces FP16 for all shaders, without any trouble). R300, with FP24, shoots to really benefit DX9 apps, which will come this year and in two to three years.
Beyond that, there will be more graphics chips and different requirements.
Thanks for the posts and congratulations on the excellent dx9 Radeon designs.
Luminescent
03-Jun-2003, 21:06
I find it very admirable that someone with such an important position within the 3D hardware and technology community has taken the time to explain and clarify, what many times are, behind the scenes, undisclosed, facts.
Thankyou sireric. :D
Another thing about IEEE 754 compliant floats. They require muls and adds to give the result of a correctly rounded exact calculation. That definition gives repeatability between different hardware, but at a steep price.
The maximum rounding error after a multiplication is then 0.5*least significant bit in mantissa. But if you allow that error to increase as little as to 0.8*lsb in mantissa, you can save ~30% of the gates in the multiplicator (24 bit mantissa).
The same approximation with 16 bit mantissa can save ~25% gates if the error is allowed to be 0.85*lsb in mantissa.
Since the DX9 definition says at least fp24, we will (already do) have hardware that use different precision. So all code written should never rely on precision in any other way than "at least as good as...". Or in other words, pixel programs should not depend on exact repeatability between different hardware.
So I think the approximation I said above would be good in a PS unit.
[Edit]
Corrected some numbers above.
Chalnoth
04-Jun-2003, 00:24
The maximum rounding error after a multiplication is then 0.5*least significant bit in mantissa. But if you allow that error to increase as little as to 0.8*lsb in mantissa, you can save ~30% of the gates in the multiplicator (24 bit mantissa).
...
So I think the approximation I said above would be good in a PS unit.
I'm not so sure. There are many different ways to deal with the error. It is, for example, quite feasible for the error in these calculations to always be additive.
Such error would be absolutely devastating for the accuracy of anything resembling a long program.
Chalnoth
04-Jun-2003, 00:27
Those things are here yet, so adding 32b SPFP will really not give anything to anyone. For that matter, even FP24 is overkill for all DX7 and DX8 apps (i.e. NV3x forces FP16 for all shaders, without any trouble). R300, with FP24, shoots to really benefit DX9 apps, which will come this year and in two to three years.
I'm not sure about that. Graphics cards have supported 2048x2048 textures for some time now. Such textures require 11 bits of integer data for addressing, plus buffer bits to ensure accuracy. FP16 isn't enough for this (With its 10-bit mantissa, FP16 would probably work well for up to about 8-bit addressing, or 256x256 texture sizes).
Reverend
04-Jun-2003, 01:33
As some has said, thanks for your comments Eric. However...
About some misconceptions, and some comments:
1) IE^3 standards do not specify what should be returned for transcendental functions (sqrt, sin, etc...). They specify the format of data (including nans, infinites, denorms) and the internal roundings for results -- This rounding is not the f2i conversion, but how to compute the lsbs of the results. Different HW can return different results. People have learned to live with this.
Correct but here you have a fine alternative: if you need a reproducable version of cos, you can implement it yourself as a Taylor series, using floating point add and mul, if add and mul are deterministic. But if add and mul aren't deterministic, it's impossible to implement anything deterministically at all. The basic arithmetic ops are the building blocks.
If you need 24b of mantissa precision, FP32 is not enough for you anyway.
FP32 is 1.8.23
2) IE^3 does not guarantee, in any way, that operations are order independant. For example, consider:
result = a + b + c
The above, generally, needs to be broken down into 2 operations. If we select a = 1.0, b = -1.0 and c = 2^(-30), then the result can be 0 or c. Depends on the implementation. IE^3 does not "specify" anything. The programmer needs to specify what he wants (i.e. (a+b)+c).
Er...no, not according to my understanding. IEEE doesn't have any such operation as "a+b+c" whose order is undefined. IEEE only has "a+b", so to add three numbers you need to either specify "(a+b)+c" or "a+(b+c)", either of which is reproducible. Or use a language like C which has well-defined precedence and associativity rules, so that "a+b+c" is defined as being exactly the same as "(a+b)+c" and not "a+(b+c)". Any violation of this in modern languages is an optional compiler optimization that defaults to "off" and is called something like "optimize floating-point operations aggressively". We don't want this even if it's already happening ;) :)
2.5) IE^3 support for nans, inf and denorms is just not needed in PS. In the final conversion to [0,1.0] range, inf and 2.0 would give the same output. For that matter, it's not needed in VS either.
They're not needed if and only if you don't care about reproducability. But if you're going to do like NVidia does and sometimes run VS operations on the CPU for load-balancing, then you get different results along both paths, which is bad. This is exactly the kind of problem the IEEE spec was designed to remedy, so why not use it?
Saying "this device is IEEE compliant, with this small set of exceptions" is like saying "I am a virgin, with this small set of exceptions". Either Tagrineth is a virgin, or she's not :lol: .
Again it all comes down to whether you see 3D hardware as a deterministic computational device which produces well-defined output for any input, or it's just some black box that you feed polygons into to produce some sort of random approximation of your scene.
3) FP24 has less precision than FP32, but has no worst other characteristics (well, range is reduce by 2^63 as well). Order of operations would be no "worst" than fp32, beyond the precision limits.
You're implying that the effect of precision loss on FP24 vs FP32 is linear, a sort of one-time penalty -- that's not the case at all in my books. In the worst case, cascading loss-of-precision errors can increase exponentially as a function of instruction count divided by mantissa bit count.
4) What are the outputs of the shader? There are two: The 10b color or 11b texture address (actually, with subtexel precision for filters, you could expect the texture address to be up 15b). With FP24, you get 17b of denormalized precision, which would allow you to have up to 8 (assuming 1/2 lsb of error per operation) ALU instructions at maximum error rate before even noticing any change in the texture address or texture filtering. Until texture sizes increase significantly (2kx2k right now, will give you a 16MB texture -- I don't know of many apps that even use that now), or that texture filtering on FP texture exists, there really is no need for added precision. On the other hand, it's obvious that FP16 is not good enough, unless you have smallish textures (i.e. < 512x512).
"Assuming 1/2 lsb of error per operation" is not realistic. The number of lsb's of error in the worst case can be equal to the difference between the two exponents in the computation, so you can easily have 2, 4, 8, or 16 lsb's of error in any given computation. For example, in a shader that says something like 1/square(magnitude(LightPosition-TexelPosition)), unless your light and texel are real close together, that subtract can easily have many bits of lsb error, and squaring that quantity then doubles the number of error bits.
5) The only exception to 4 above is for long procedural texture generation. In those cases you could expect that some fp24 limits would come into play. There's no real use of that out there right now, and we are probably years from seeing it in mainstream.
If Intel applied this kind of "gee, there's no real use for this combination of instructions" when designing the x86, it would be impossible to write the kind of programs people wrote.
Nevertheless, our Ashli program can take procedural texture generation code from Renderman, and generate long pixel shaders. We've generated shaders that are thousands of instructions long. What we found is that it looks perfect, compared to the original image. No one would ever complain about that quality -- It's actually quite amazing. So, empirically, we found that even in these cases, FP24 is a nearly perfect solution.
Sure, it's easy to come up with a 1000-instruction shader that looks perfect with FP24, and easy to come up with a 3-instruction shader that looks like crap with FP24 then looks great with FP32, and then a 3-instruction shader that looks like crap with FP32 but is great with FP64. Floating point is like that. :)
6) FP32 vs. FP24 would not only be 30% larger from a storage standpoint, the multipliers and adders would be nearly twice as big as well. That added area would of increased cost and given no quality benefits. We could of implemented a multi-pass approach to give higher precision, but we felt that simplicity was a much higher benefit. Given the NV3x PP fiasco, we even more strongly believe now that a single precision format is much better from a programming standpoint (can anyone get better than FP16 on NV3x?).
FP24 was a reasonable decision for the R3x0, which was available basically a year before NV30. It's a lot better than 8-bit integer and gave everyone a sneak peak at DX9's capabilities. But it should be considered a stepping stone, to be phased out as soon as FP32 is commercially viable, rather than being considered a long-term solution. And FP32 may be becoming viable now with NV35 (haven't got one, can't really say for sure). It's just like 3dfx's situation with 16-bit: it was the right solution in 1997, but when 1999 came and they were arguing that it was good enough and nobody needed 32-bit, well, that was not a realistic view.
[edit]Eh, didn't see Russ already mentioned the 3dfx-16-bit stance
At the end of it all, we believe we did the right engineering decision. We weighed all the factors and picked the best solution. DX9 did not come before the R300 and specify FP24. We felt that FP24 was the "right" thing, and DX9 agreed.
Like I said, the R3x0's is a fine part, a good start wrt FP. I just hope it doesn't become set in stone.
Like Simon F said, IANAHE so these are just my understanding.
[edit]minor touch-ups
Dave Baumann
04-Jun-2003, 01:50
Rev, where is the suggestion that this will or won't be replaced? We're only on the second iteration of shader architectures and we've jumped from 8-12 bit integer to fast 24bit FP - this is quite a leap. More than likely the precision will increase on a general level but this is unlikely to happen until newer silicon process (90nm) allow the footprint needed for it to occur at higher performance levels and we see newer API's.
"Assuming 1/2 lsb of error per operation" is not realistic. The number of lsb's of error in the worst case can be equal to the difference between the two exponents in the computation, so you can easily have 2, 4, 8, or 16 lsb's of error in any given computation. For example, in a shader that says something like 1/square(magnitude(LightPosition-TexelPosition)), unless your light and texel are real close together, that subtract can easily have many bits of lsb error, and squaring that quantity then doubles the number of error bits.
I think you'd better think some more about the above example. The error bits you refer to are beyond the precision of your numbers so they basically don't count and you end up with a 1 bit error on your subtraction. In other words, if one number is of much larger magnitude than the other then the result of your subtraction is the larger number with the lsb being an error bit.
FP24 was a reasonable decision for the R3x0, which was available basically a year before NV30. It's a lot better than 8-bit integer and gave everyone a sneak peak at DX9's capabilities. But it should be considered a stepping stone, to be phased out as soon as FP32 is commercially viable, rather than being considered a long-term solution. And FP32 may be becoming viable now with NV35 (haven't got one, can't really say for sure).
Let's ask nVidia how viable FP32 is on the NV35. Answer: They'd rather run games/benchmarks in DX8 mode than DX9 mode. Oops! I guess that means that FP16 isn't even viable. Now, please tell us again who is giving the "sneak peek" at DX9 functionality?
-FUDie
Reverend
04-Jun-2003, 01:58
Dave, ATi is on a roll. Some of their representatives have said "FP24 is good enough" many a times. I'm not saying this won't improve... I'm just worried it may not for a long time to come. I don't want, for instance, R400 to still have this. It's a possible scenario, given the comments by ATi representatives paraphrased above. I guess I'm just a pessimist.
Who knows -- perhaps if the R3x0 is full FP32, it may encounter the same performance problems we're seeing with the NV3x, so, like I said, it is a good initial step forward. When you mess around with codes, you're likely to get a little more frustrated than if you don't mess around.
Like I said, it's just my views.
PS. I'm leaving for work now. Will respond if the need arises when I'm at the office. Ta!
Dave Baumann
04-Jun-2003, 02:03
Dave, ATi is on a roll. Some of their representatives have said "FP24 is good enough" many a times.
And thats probably a reasonable statement most of the time for the types and sizes of shaders available in DX9. I'd guess that by the time DX10 gets here FP32 may well be the baseline, but thats also probably a resonable time scale for the FP24 baseline to have been around and the shader specifications available in DX10 will also probably make it more worthwhile more of the time.
RussSchultz
04-Jun-2003, 02:37
And thats probably a reasonable statement most of the time for the types and sizes of shaders available in DX9. I'd guess that by the time DX10 gets here FP32 may well be the baseline
I think what Reverend was worrying about was that an ATI engineer is here today saying "FP32 is too much" when they're likely well into the design cycle for their R400/DX10 card--which means their next generation won't support it.
Dave Baumann
04-Jun-2003, 02:49
Yes, and he seems to be talking in relation to DX9, not any future revisions, since his NDA prevents him talking about things that may occur in the future. "R400" (or whatever is next) is likely still to be a DX9 part so that will probably stay FP24 as well, I would expect DX10 parts to have different base precisions.
Reverend
04-Jun-2003, 03:21
Dave, ATi is on a roll. Some of their representatives have said "FP24 is good enough" many a times.
And thats probably a reasonable statement most of the time for the types and sizes of shaders available in DX9. I'd guess that by the time DX10 gets here FP32 may well be the baseline, but thats also probably a resonable time scale for the FP24 baseline to have been around and the shader specifications available in DX10 will also probably make it more worthwhile more of the time.
No arguments there with you -- it's a reasonable statement and perfectly acceptable to me given the current industry progress/state.
I think I probably shouldn't be mentioning IHVs and their parts here -- I am just concerned about FP32 and used ATI and their R3x0 as an example for expressing my concerns about the possibility of FP24 being here to stay for a while to come, when I have my own conceptions about DX9 and IEEE.
Let's focus on FP24, FP32, DX9 and IEEE and what they all mean, without mentioning IHVs and parts.
"R400" (or whatever is next) is likely still to be a DX9 part so that will probably stay FP24 as well, I would expect DX10 parts to have different base precisions.
I hope this is not the case -- ATi is the leader now, with a lot of folks buying their parts, which is something developers always take into consideration... ATi should start pushing things, like NVIDIA did with 32-bit, static TnL and DX8 shaders back then. I'm sure nobody will complain. But then, time and market penetration is a very important factor so if R400 still has a FP24 "limit", I'd understand the decision but I personally would wish for more.
You seem to speak from the POV of "what's available and, therefore, what should be the case" while I wanted this very specific topic to be about going forward when FP24 made its debut while FP32 was already understood to be available to many developers as evangelized by ATi devrel wrt R3x0. Sorry if this intention of mine in starting this thread wasn't clear. It's kinda like 3DMark03 (oh no, not that word!) -- it's useless to look at it as representative of the way things are at the moment, in terms of both HW availability and, hence, SW implementation based on HW availability... it's more a look at what should come. You've always viewed 3DMark03 as primarily a glimpse of what is to come -- I am just expanding on that but not necessarily purely on a SW POV. Something like that, I think :)
I think we should not continue this line of discussion and just focus on what FP24 vs FP32 would mean to programmers. The difference is that I know what IEEE-32 avails to me that FP24 cannot regardless of HW availability at this time. Hopefully you will understand what I mean in this aspect.
"R400" (or whatever is next) is likely still to be a DX9 part so that will probably stay FP24 as well, I would expect DX10 parts to have different base precisions.
I hope this is not the case -- ATi is the leader now, with a lot of folks buying their parts, which is something developers always take into consideration... ATi should start pushing things, like NVIDIA did with 32-bit, static TnL and DX8 shaders back then. I'm sure nobody will complain. But then, time and market penetration is a very important factor so if R400 still has a FP24 "limit", I'd understand the decision but I personally would wish for more.
Given the recent rumours re: R400 program, we can be almost certain that what will now be marketed as R4XX will be FP24 limited. Sireric's (many thanks) view on FP framebuffers, etc, put the real limitations of FP24 into context. I, too, suspect no real change until the next process technology shift.
sireric
04-Jun-2003, 05:18
Correct but here you have a fine alternative: if you need a reproducable version of cos, you can implement it yourself as a Taylor series, using floating point add and mul, if add and mul are deterministic. But if add and mul aren't deterministic, it's impossible to implement anything deterministically at all. The basic arithmetic ops are the building blocks.
By “reproducible” and “deterministic”, I assume you mean “identical in all implementations”. Otherwise, yes, operations are always reproducible and are deterministic. We did not add a random number generator J
On the other hand, even if floating point adds and muls have identical implementations, in general, you would still not be able to guarantee that two implementations of the same taylor expansion would always give the same results, with the same inputs. You would also need to have not only identical hw, but identical compilers, identical source code and identical OS/APIs. IE^3 makes no guarantees on a sequence of operations. Floating point numbers are inexact and so are operations performed on them. Programmers have learned to live with that.
FP32 is 1.8.23
The 23 bits is a normalized mantissa form of a 24b number. You have 24b of precision.
Er...no, not according to my understanding. IEEE doesn't have any such operation as "a+b+c" whose order is undefined. IEEE only has "a+b", so to add three numbers you need to either specify "(a+b)+c" or "a+(b+c)", either of which is reproducible. Or use a language like C which has well-defined precedence and associativity rules, so that "a+b+c" is defined as being exactly the same as "(a+b)+c" and not "a+(b+c)". Any violation of this in modern languages is an optional compiler optimization that defaults to "off" and is called something like "optimize floating-point operations aggressively". We don't want this even if it's already happening ;) :)
All I meant is that more complex operations do not have results that are guaranteed by IE^3. The implementation details influence the results. If a PowerPC implements an FMAD with higher precision than MUL/ADD combo, that will lead to slight differences between that HW and other. Doesn’t seem to offend most programmers.
Though there are some that require the exact same results. But they aren't programming pixel shaders :-)
They're not needed if and only if you don't care about reproducability. But if you're going to do like NVidia does and sometimes run VS operations on the CPU for load-balancing, then you get different results along both paths, which is bad. This is exactly the kind of problem the IEEE spec was designed to remedy, so why not use it?
One has to judge the cost of an item and make a call. Certainly if one is with offloading VPU activity to the CPU, having identical implementations is required. On the other hand, PS shaders do not have the luxury of being offloadable. They could be used to offload the CPU, but there’s no API available to do that. Would you increase the cost of the product for something that could not be used?
Saying "this device is IEEE compliant, with this small set of exceptions" is like saying "I am a virgin, with this small set of exceptions". Either Tagrineth is a virgin, or she's not :lol: .
Sure. Quite colorful.
Again it all comes down to whether you see 3D hardware as a deterministic computational device which produces well-defined output for any input, or it's just some black box that you feed polygons into to produce some sort of random approximation of your scene.
Again, 3D hardware is very deterministic. You plug in something, and the same thing comes out. Every time. However, the HW is just part of the whole. The system HW, the OS, the application, the API, the drivers, etc… are all changing. Expecting exactly the same output on all systems is not realistic.
You're implying that the effect of precision loss on FP24 vs FP32 is linear, a sort of one-time penalty -- that's not the case at all in my books. In the worst case, cascading loss-of-precision errors can increase exponentially as a function of instruction count divided by mantissa bit count.
I did not say that it was linear. I was saying that it has the same properties. Yes, the error ranges are larger, but the properties are the same. Going to FP24 does not require a “Change” of philosophy. It just needs the programmer to be aware of the ranges, and to code the applications taking that into account.
"Assuming 1/2 lsb of error per operation" is not realistic. The number of lsb's of error in the worst case can be equal to the difference between the two exponents in the computation, so you can easily have 2, 4, 8, or 16 lsb's of error in any given computation. For example, in a shader that says something like 1/square(magnitude(LightPosition-TexelPosition)), unless your light and texel are real close together, that subtract can easily have many bits of lsb error, and squaring that quantity then doubles the number of error bits.
No, for each operation, assuming ½ lsb of error is correct. Your example is a composite operation. A ½ lsb of error in one operation can be amplified in the next operation, irregardless of the precision. But, you are correct that the errors do not add; you can certainly construction operations that magnify errors. However, when coding PS operations, one should strive for stable code. In most shader codes I’ve seen, things are simple and errors are stable.
If Intel applied this kind of "gee, there's no real use for this combination of instructions" when designing the x86, it would be impossible to write the kind of programs people wrote.
Sure. VPUs really aren’t replacements for CPUs. If ATI gets in that business, we will lose. Intel and AMD are much better at it. Given that VPU outputs eventually get truncated down to 10b for color and 11~15b for texture addresses, things are good (for now).
Sure, it's easy to come up with a 1000-instruction shader that looks perfect with FP24, and easy to come up with a 3-instruction shader that looks like crap with FP24 then looks great with FP32, and then a 3-instruction shader that looks like crap with FP32 but is great with FP64. Floating point is like that. :)
Sure, FP is inexact. However, I was using empirical evidence to show that our assumptions appear justified.
FP24 was a reasonable decision for the R3x0, which was available basically a year before NV30. It's a lot better than 8-bit integer and gave everyone a sneak peak at DX9's capabilities. But it should be considered a stepping stone, to be phased out as soon as FP32 is commercially viable, rather than being considered a long-term solution. And FP32 may be becoming viable now with NV35 (haven't got one, can't really say for sure). It's just like 3dfx's situation with 16-bit: it was the right solution in 1997, but when 1999 came and they were arguing that it was good enough and nobody needed 32-bit, well, that was not a realistic view.
Never said that FP24 is the end. Neither is FP32, for that matter. At SGI, on GE11 (IR, Impact), we had double precision ALUs, just to compute higher order geometries (circles, spheres). But my point is that FP24 is still brand new and there are no applications yet showing up that push it at all. I explained that FP32 (at full speed) is significantly more expensive than FP24. I also noted that other items (Larger textures, FP displays, etc…) need to kick in as well to justify FP32. One has to weigh the cost and the benefits. I stand by our decision to use FP24. It’s fast and it’s high precision; nobody else can claim those things.
Like I said, the R3x0's is a fine part, a good start wrt FP. I just hope it doesn't become set in stone.
You’re being silly. It’s obvious that FP32 will come, when it’s needed and cost effective. I don’t really understand what you meant to do by all this; FP24 is justified and makes sense, for now; FP32 is not. Why not enjoy the benefits of what is available now? Anyway, I’m glad you think R300 is a “Fine” part.
I'm not so sure. There are many different ways to deal with the error. It is, for example, quite feasible for the error in these calculations to always be additive.
Such error would be absolutely devastating for the accuracy of anything resembling a long program.
Are you assuming that what I said would be more prone to biased rounding? I don't see why.
I can put it in a more "marketing worthy" way. :) Is it worth 40% more gates to gain ~0.7 bit higher precision in the case with 24 bit mantissa (fp32).
Reverend
04-Jun-2003, 06:23
Addressing the last point first since I don't like being called silly :) :
Like I said, the R3x0's is a fine part, a good start wrt FP. I just hope it doesn't become set in stone.
You're being silly. It's obvious that FP32 will come, when it's needed and cost effective. I don't really understand what you meant to do by all this; FP24 is justified and makes sense, for now; FP32 is not. Why not enjoy the benefits of what is available now?
I should've said "long term" instead of "set in stone". [edit]Note that R300 debuted in, what, Sept last? And the R4xx, say, this year and presumably still FP24? Let's assume R300-based boards are the majority consumers own this time next year. Do you think FP32 will be really important to developers if this is the scenario next year? Is it "needed" then based on this scenario? Certainly most developers now, with the R300 being the first DX9 board commercially available, and having the timeframe advantage over the NV30, won't be doing stuff requiring FP32 to work or run without looking shitty... not until it's needed according to you, which means not until we see FP32 hardware being the majority. Yes?
But I find it strange to hear you say "...FP32 will come, when it's needed and cost effective". I would argue with the "when it's needed part" but I can understand the cost effective part.
"when it's needed" sound awfully familiar to me and usually for the wrong reasons in my books -- surely you won't fault me for comparing this to 32-bit color depth back when the TNT debuted it... it "wasn't needed" back then but I could think of instances where it is definitely needed even before it made its debut. For that matter, many other things are not "needed"... until the hardware arrives, right? :) Why is AA and AF needed now? Because it was made available and folks got to see the difference. I think, for the few reasons I expressed either here or here (http://www.beyond3d.com/forum/viewtopic.php?p=91712#91712), that once folks see on display what I expressed, it will be "needed". Hey, wait a minute, if I want to do this already, it is already needed! Of course, I don't work for a AAA development house... maybe that's one of the main points :)
I think I have expressed why I think FP32 is needed, even now, but that's probably because I know what IEEE-32 is, and what DX9 is not, compared to it. I don't know the entire history of DX9, from conception to fruition, but my understanding was that IEEE-32 was a basis for it wrt precision. Perhaps that is the entire summary of my thoughts in this thread! :)
I don't really understand what you meant to do by all this; FP24 is justified and makes sense, for now; FP32 is not.
"What I meant to do by all this" is to give examples of why I think FP32 is a necessity and why FP24 will not be good enough. Having brought R3x0 and ATi into the discussion probably wasn't a good idea but I needed an existing hardware with FP support but not quite full FP32 support to illustrate why I know FP32 is important. If this is redundant or a "duh" point, excuse me... I'm not so sure many others would know the imporatnt difference between FP24 and FP32.
Why not enjoy the benefits of what is available now?
I do enjoy seeing what FP24 gives me now via the R3x0. I would enjoy it more if I see full FP32, that's all.
Anyway, I'm glad you think R300 is a "Fine" part.
Aw, so sorry I didn't say a "very good part". :) It is as fine a part as NV20 was when it made its debut along with DX8, that's what I meant basically.
As for the rest, I think things are little clearer between us. Thanks for your comments. In the end, I gave my views in the hope of learning things that may show me I'm wrong in my understanding, that's all. I'm not here to argue the virtues or bad stuff about any one chip from any IHV. If anything, show me why you think it's wrong for me to think FP32 is that much more important than FP24 per se, leaving aside any IHV favoritisms or chip favouritisms or who you work for.
I'm in a hurry for lunch so excuse any inconsistent or dumb ramblings!
OpenGL guy
04-Jun-2003, 06:44
I do enjoy seeing what FP24 gives me now via the R3x0. I would enjoy it more if I see full FP32, that's all.
You would? What would justify the cost? What application would require it? We've barely even touched the limits of FP24 and already you are complaining it's not enough. Sounds very similar to what I read before the NV30 launch. "FP24 is not enough, you gotta have FP32. Don't buy the R300, the NV30 will be better in all respects because it has real DX9 support." Yeah right.
Show me an interesting application that requires FP32 right now. Hell, show me an interesting application within the next year that will require FP32.
Reverend
04-Jun-2003, 07:04
OpenGL guy, calm down. This isn't really about NVIDIA or ATI or some chip by either. Please read what I wrote. Your "show me an app" comment is really a chicken-and-egg situation, isn't it?
I'm not talking about current apps, please. I'm talking about specific instances where FP32 has an advantage over FP24, that's it, period. Is it wrong to wish for more than the R300 offers? You make it sound like I'm bashing the R300 when that is most definitely not what I'm doing.
Analogy - what's the point of discussing better AA algorithms and its benefits than what current hardware offers? There's no point in discussing it then?
[edit]I started talking about IEEE and DX9 and thought that using the R300 as an example (since it was the first FP supporting hardware) would be a good basis for discussions. sireric in his first posts made some excellent points but he somehow had to bring his "I'm an ATi employee" mindset into the discussion. I countered with my own points. sireric again provided additonal comments but still feels the need to be "defensive" about ATi (since he works for them). Why? Where can I go to talk about the need for progress and better solutions? Is it wrong to do so here without coming out like I'm criticizing a hardware or company unjustifiably? I want FP32. I appreciate FP24. I gave my views on why I think FP32 is more important than FP24 in specific instances. I did not say any app currently requires FP32 but I gave instances where such may be required in the future. Is that wrong?
OpenGL guy
04-Jun-2003, 08:05
I'm not talking about current apps, please. I'm talking about specific instances where FP32 has an advantage over FP24, that's it, period. Is it wrong to wish for more than the R300 offers? You make it sound like I'm bashing the R300 when that is most definitely not what I'm doing.
As I said before, FP24 isn't a limitation now, and won't be for a while. What use is more precision? Sure, you may want it, just like I want a new sports car, but do you need it? Not that I can see.
Analogy - what's the point of discussing better AA algorithms and its benefits than what current hardware offers? There's no point in discussing it then?
Have we reached the limits of current AA algorithms? I don't know. I do know that I would like more samples, and I can see a need for that. I can't see a need for more than FP24 in the near future.
And I don't think FP24 is holding back anyone's development right now.
I gave my views on why I think FP32 is more important than FP24 in specific instances. I did not say any app currently requires FP32 but I gave instances where such may be required in the future. Is that wrong?
IHVs design chips with specific goals in mind. If FP24 meets those goals, then that's what will be designed. There's no reason to go overboard (i.e. over engineer) with FP32 because you won't be able to justify the extra cost. Again, you have to balance what you want vs. what you need, and it can be a tough balancing act.
I my opinion, ATi chose wisely and that's why the R300 products are doing well.
P.S. Don't worry, I'm not arguing against FP32, I'm just saying that right now, and in the near future, there is no need for it.
Reverend
04-Jun-2003, 08:43
I suppose what you're saying is that my wants and needs and facts (presumably facts, since you didn't comment on them specifically, unlike sireric's attempts) about FP24 limitations vis-a-vis FP32 aren't terribly important right now :) (wait a sec, that should be a :( ).
Completely understandable.
Now, can we get back on topic? :) Can you tell me if I'm right or wrong in the examples (FP24 not being good enough compared to FP32 per se regardless of hardware availability) I gave, regardless of whether you think FP24 is good enough for now or near future? I'm not trying to be clever here -- I am sincerely not sure if I'm right or wrong in my thinking and experimentations (I'm still learning). Just the facts as per what I laid out re FP24 vs FP32. Forget you work for ATI :)
BoardBonobo
04-Jun-2003, 08:49
P.S. Don't worry, I'm not arguing against FP32, I'm just saying that right now, and in the near future, there is no need for it.
Not until ATI get their implementation right? ;)
OpenGL guy
04-Jun-2003, 08:50
I suppose what you're saying is that my wants and needs and facts (presumably facts, since you didn't comment on them specifically, unlike sireric's attempts) about FP24 limitations vis-a-vis FP32 aren't terribly important right now :) (wait a sec, that should be a :( ).
Completely understandable.
Yay, so we're past that ;)
Now, can we get back on topic? :) Can you tell me if I'm right or wrong in the examples (FP24 not being good enough compared to FP32 per se regardless of hardware availability) I gave, regardless of whether you think FP24 is good enough for now or near future? I'm not trying to be clever here -- I am sincerely not sure if I'm right or wrong in my thinking and experimentations (I'm still learning). Just the facts as per what I laid out re FP24 vs FP32. Forget you work for ATI :)
Eventually, you can expose limitations of any level of precision. For example, I really wish I had (fast) FP128 for my fractal computations. Maybe I should put pressure on Intel and AMD for better FPUs :D Or maybe sireric will give me FP128 in the pixel shader ;)
K.I.L.E.R
04-Jun-2003, 08:54
I would LOVE to see an instance where FP32 has an advantage over FP24.
Simon F
04-Jun-2003, 08:58
6) FP32 vs. FP24 would not only be 30% larger from a storage standpoint, the multipliers and adders would be nearly twice as big as well.
Interesting info. The FP24 choice obviously had a lot to do with your silicon budget, but I didn't realize that FP32 is so demanding over FP24.
Just do an N digit x N digit multiply by hand and count the steps needed. (=> O(N^2) complexity) The algorithm is going to be approximately the same for binary in silicon.
(Actually, there is an algorithm that does it faster than O(N^2) (possibly N.log(N)??) but it's complicated)
As so
me has said, thanks for your comments Eric. However...
About some misconceptions, and some comments:
1) IE^3 standards do not specify what should be returned for transcendental functions (sqrt, sin, etc...). They specify the format of data (including nans, infinites, denorms) and the internal roundings for results -- This rounding is not the f2i conversion, but how to compute the lsbs of the results. Different HW can return different results. People have learned to live with this.
Correct but here you have a fine alternative: if you need a reproducable version of cos, you can implement it yourself as a Taylor series, using floating point add and mul, if add and mul are deterministic. But if add and mul aren't deterministic, it's impossible to implement anything deterministically at all. The basic arithmetic ops are the building blocks.
Maybe I'm misinterpretting what you've written, but the ops are certainly deterministic, i.e. a*b always produces c on hardware x. (assuming it's implemented properly!!!). It's just that floating point is not a mathematical group, so the usual maths rules you expect, eg associativity, (a+b)+c = a + (b + c), or the distributive law, a*(b+c)=a*b + a*c, aren't guaranteed.
Actually, this raises an interesting point - presumably C compilers are not allowed to make optimisations of this nature in FP calculations simply because they could change the behaviour of the code.
Quote:
If you need 24b of mantissa precision, FP32 is not enough for you anyway.
FP32 is 1.8.23
But you are forgetting that, except for denormalised numbers, there is an implied '1' in the MSB, and so the mantissa is 24b.
"Assuming 1/2 lsb of error per operation" is not realistic. The number of lsb's of error in the worst case can be equal to the difference between the two exponents in the computation, so you can easily have 2, 4, 8, or 16 lsb's of error in any given computation.
Actually you can have all of the precision lost in a couple of calcs if you program things badly, eg Inaccurate = (Big + Small) - Big But if we assume sensible calculations, losing 1/2 a bit per calc is not a bad rule of thumb. I always liked the quote (can't remember who said it and this might be a misquotation)
Floating Point numbers are like piles of sand; every time you move one you lose a little sand, but you pick up a little dirt
It's arguable that double precision in the vertex shader could be substantially more useful than FP32 in the pixel shader.
There's probably quite a few guys in the professional (flightsim, etc. rather than film) space who would appreciate it already...
andypski
04-Jun-2003, 10:05
Let's focus on FP24, FP32, DX9 and IEEE and what they all mean, without mentioning IHVs and parts.
Ok - sounds like a good approach.
"R400" (or whatever is next) is likely still to be a DX9 part so that will probably stay FP24 as well, I would expect DX10 parts to have different base precisions.
I hope this is not the case -- ATi is the leader now, with a lot of folks buying their parts, which is something developers always take into consideration... ATi should start pushing things, like NVIDIA did with 32-bit, static TnL and DX8 shaders back then. I'm sure nobody will complain. But then, time and market penetration is a very important factor so if R400 still has a FP24 "limit", I'd understand the decision but I personally would wish for more.
Hmmm... interesting.
The situation here as I see it is this - the next clear step up from PS2.0 is really PS3.0, and my understanding is that PS3.0 is defined in the DX9 spec as having the same precision requirements as PS2.0.
Here is the quote from Microsoft's clarifying statement on the DXDev mailing list -
[from ps_2_0 section]
---Begin Paste---
Internal Precision
- All hardware that support PS2.0 needs to set
D3DPTEXTURECAPS_TEXREPEATNOTSCALEDBYSIZE.
- MaxTextureRepeat is required to be at least (-128, +128).
- Implementations vary precision automatically based on precision of
inputs to a given op for optimal performance.
- For ps_2_0 compliance, the minimum level of internal precision for
temporary registers (r#) is s16e7** (this was incorrectly s10e5 in spec)
- The minimum internal precision level for constants (c#) is s10e5.
- The minimum internal precision level for input texture coordinates (t#)
is s16e7.
- Diffuse and specular (v#) are only required to support [0-1] range, and
high-precision is not required.
---End Paste ---
For ps_3_0 the requirements are the same, however interpolated input
registers are now defined by semantic names. Inputs here behave like t#
registers in ps_2_0: they default to s16e7 unless _pp is specified (s10e5).
IHVs will be releasing PS3.0 parts, and those parts may well be making the step up to higher precisions that you desire, but as I see it here is the bind. Even if this is the case, developers still should not be assuming higher than 24 bit precision when coding, because that is what is specified.
You may get higher than 24-bit precision, which will be good, but we are in the situation that for the future of DX9 things are fixed at 24 bits and it is important that this is recognised by all parties so that the spec does not fragment.
As to the benefits of FP32 vs. FP24 in the short term I won't make any comment for the moment. My personal feeling is that it's the right balance for the current time (but then I would think that, wouldn't I ;)).
It will be interesting as we move forward to hear more from people developing shaders as and when they manage to run into any significant limitations caused by the use of 24-bit FP. At the moment it would seem to me to be rather early to be criticising the precision - after all, we stayed with at most 8 bits of guaranteed precision from well before DirectX even existed right through to DX8, which is really quite a long time. Even in DX8 the recommendation only went up to at least 9 bits for internal operations as I recall. Now in a single DX version we have made a huge step, so it seems appropriate to wait a bit for shader coding to catch up.
- Andy.
For example, in a shader that says something like 1/square(magnitude(LightPosition-TexelPosition)), unless your light and texel are real close together, that subtract can easily have many bits of lsb error, and squaring that quantity then doubles the number of error bits.
You got it the wrong way. It's when the light and the texel are close to each other relative to the distance to origo you've got problems. And squaring it doesn't double the number of error bits, but it can double the error (meaning adding one error bit).
But it's bad (wrt performance and precision) to do the calc that way. It's better to write square(magnitude(X)) as dot(X,X), or to do the latter in an inline function magnitude2(X).
The subtraction lose as many bits of precision as is needed to store the ratio between light to tex distance, and light/tex to origo distance. Ie, if light-tex distance is one unit, and they are 32 units away from origo, then you'll lose 5 bits. If you lose to much precision (which certainly is possible if you're not careful), you can gain it back by moving part of the subtraction to VS (using its higher precision). The VS can make sure that PS get a local coordinate system for (lightPos-texPos).
PS:
I'm interested in what the ATI- (and other HW-) developers says about not doing perfect rounding of exact calculation.
Maybe I'm misinterpretting what you've written, but the ops are certainly deterministic, i.e. a*b always produces c on hardware x. (assuming it's implemented properly!!!). It's just that floating point is not a mathematical group, so the usual maths rules you expect, eg associativity, (a+b)+c = a + (b + c), or the distributive law, a*(b+c)=a*b + a*c, aren't guaranteed.
Actually, this raises an interesting point - presumably C compilers are not allowed to make optimisations of this nature in FP calculations simply because they could change the behaviour of the code.
This is core part of the discussion IMO. I don't want shader compilers to have to live under the same restrictions as C compilers have. Placing all this IEEE crap on the shader part will just be a huge roadblock in the development of graphics technology. If someone would seriously propose something similar to be written into API spec I would actively oppose it. Or expressed in Carmackian terms, keep dumbass ideas that will hunt us for years out of the API. Reproducability is only guaranteed by the API in one way, you should on the same hardware and driver get the same results if you do the exact same operations. This kind of repeatability is useful. Expecting mathematical operations to always return the exact same results is not very useful. In fact, I would support getting rid of all these kinds of restrictions on C compilers too. Since 99.99% of the applications don't care about it I think the default should be that these kinds of optimisations are valid. If your code requires that level of repeatability then you either have very odd needs or you just don't know how to write proper code.
As for the fragment pipeline, we have never had this level of repeatability there, and I certainly don't think we should introduce it, ever.
About fp24 vs f32, sure I want more precision if possible. But 32 is an arbitrary number, why not 33, it's even better? Fp24 is pretty good for now, if I get more in the next generation I'm going to be happy, but I would be fine with it if it stayed at 24. They may choose to go 28 too, sounds like a good middle solution that's not too expensive but adds a little more precision for the few apps that would need it.
Edit:
Was also going to say, caring about the last bit or two of precision is kinda silly. Graphics is onyl half engineering, the other half is artristry. It's like bringing a microscope to the art museum and complain about the quality of the work of Van Gogh.
Joe DeFuria
04-Jun-2003, 15:17
What have I gotten out of all of this? I'm betting the Series5 has 24 bit precision. 8)
Doomtrooper
04-Jun-2003, 15:34
Why not ?? A 'spec' is a spec
Joe DeFuria
04-Jun-2003, 15:43
Agreed. As long as it meets the minimum DX9 specs, that's good enough for me. Didn't mean to imply anything else. Just making the observation.
I fully agree with Humus. Shader precision is about producing artifact-less images, not about getting reproducable results up to the lsb.
If you need full IEEE compliance and have control over evaluation order, you will have to do it in software and live with the fact that it's slower. Of course there are much better alternatives than the reference rasterizer: swShader (http://sw-shader.sourceforge.net). :P
Else just be content. Your eyes can't see the difference anyway. Well I should be speaking for myself... :roll:
Simon F
04-Jun-2003, 16:47
Maybe I'm misinterpretting what you've written, but the ops are certainly deterministic, i.e. a*b always produces c on hardware x. (assuming it's implemented properly!!!). It's just that floating point is not a mathematical group, so the usual maths rules you expect, eg associativity, (a+b)+c = a + (b + c), or the distributive law, a*(b+c)=a*b + a*c, aren't guaranteed.
Actually, this raises an interesting point - presumably C compilers are not allowed to make optimisations of this nature in FP calculations simply because they could change the behaviour of the code.
This is core part of the discussion IMO. I don't want shader compilers to have to live under the same restrictions as C compilers have.
Do you mean allowing the shader compiler to re-order the operations, eg assume associativity or distributive law? That's risky. As I said, a certain IHV appeared to be using different calcs in the 'fixed T&L' part of the drivers depending on whether shading was on or off and it definitely caused some major rendering errors. (eg Z values changing and making objects flicker).
Entropy
04-Jun-2003, 18:19
A small comment from the scientific computing field.
Code that critically depend on the minutiae that Reverend brings up is effectively broken. You should never, ever write anything which makes those kinds of assumtions.
Assuming rounded rather than truncated results is pretty much as far as you can hope for. If you _need_ control, you should explicitly code for it, never leave it to the system to take care of for you.
Now, in scientific computing, codes tend to have very long life and get ported all over the place, and is thus probably a worst case, but generally the experience should carry over.
Sireric explained nicely why FP24 is a good compromise for the tasks we ask of this hardware. If you do something else though and need fp32, by all means buy whatever supports it. But making the product significantly slower/costlier for some hypothetical benefit just doesn't make sense. The very same tradeoffs have been made on the CPUs you are currently running on.
BTW, the above should in no way be construed as endorsing general sloppyness when defining computational tools. From personal experience, I do however endorse extreme suspiciousness on the part of programmers as far as these issues are concerned. "Just don't count on it."
Entropy
Maybe I'm misinterpretting what you've written, but the ops are certainly deterministic, i.e. a*b always produces c on hardware x. (assuming it's implemented properly!!!). It's just that floating point is not a mathematical group, so the usual maths rules you expect, eg associativity, (a+b)+c = a + (b + c), or the distributive law, a*(b+c)=a*b + a*c, aren't guaranteed.
Actually, this raises an interesting point - presumably C compilers are not allowed to make optimisations of this nature in FP calculations simply because they could change the behaviour of the code.
This is core part of the discussion IMO. I don't want shader compilers to have to live under the same restrictions as C compilers have.
Do you mean allowing the shader compiler to re-order the operations, eg assume associativity or distributive law? That's risky. As I said, a certain IHV appeared to be using different calcs in the 'fixed T&L' part of the drivers depending on whether shading was on or off and it definitely caused some major rendering errors. (eg Z values changing and making objects flicker).
The only place where such optimizations would causes problems is on the vertex position output, since it affects fragment depths. Otherwise it should be pretty safe. In the fragment pipeline I see no reason why it should ever be a problem.
darkblu
04-Jun-2003, 19:44
The only place where such optimizations would causes problems is on the vertex position output, since it affects fragment depths. Otherwise it should be pretty safe. In the fragment pipeline I see no reason why it should ever be a problem.
what's with the 'fragment pipeline' and 'vertex pipeline'? - it all comes down to 'real' data ending up discrete (actually it's discrete data getting 'grossly more' discrete, so to say, but nevermind). so, until the very final color output of the very last 'pass' of the algorithm at hand you'd want as high error-proofness as possible (in consumer's terms - 'as money can buy'). saying that poeple don't scrutinize an artist's work under a microscope is not quite the analogy - microscopes deal w/ spatial not so w/ spectral precision, and with the latter you don't know if the artist wouldn't have liked the means to express his vision of a particular color even further than what the 'present art' allowed him. humans strive for perfection - and they wouldn't give it up if they had the means to achieve it (resources & time). in this regard, i'm perfectly fine with the dx9 ps/vs specs, but that does not mean i'm set with those for the rest of my life (any life span expectations aside).
ps: a pretty please w/ sugar on top goes to the well-respected ati employees who spend their well-deserved but sparse spare time to post on these forums - could you (arbitrarily) improve on the aniso algo for the next parts currently in design? i believe i'm speaking for those ppl of the mindset 'aniso should be rather costlier but nicer'. thank you.
pps: before anybody gets the wrong idea, my humble opinion is that r3xx is the best dx9 implementation by far for the time being. i just wish it could be a bit better ;)
antlers
04-Jun-2003, 20:45
Is ATI going with FP24 more akin to 3dfx going with 16-bit only in the Voodoo 3, or 3dfx going 16-bit only in the Voodoo1?
I think it's more akin to going 16-bit only with the Voodoo1. Sure, FP32 is nice to have, but the applications that would demand it and the technology to support it at fast speeds aren't here yet (I've yet to be convinced that the NV35 can do FP32 shaders at adequate speeds).
Also, when it comes to color precision, there is diminishing returns. The visible difference between FP24 and FP32 would be much less than between FP16 and FP24.
Deflection
04-Jun-2003, 21:04
I would LOVE to see an instance where FP32 has an advantage over FP24.
Humus's Mandlebrot demo. It's basically a worst case scenario demo where precision errors can "spiral" out of control (pardon the pun:) Even there you really have to zoom to see it. Kind of like the SS rotating floors for AF on the radeon. The stuff we've seen so far seems to be that FP24 can handle pretty much all that's out there to a very acceptible level.
Where I'm not sure, is that the same can't be said of FP16. ATI and MS don't seem to think so, but Humus's demo can't really be used to judge that because it is a worst case scenario. The 3Dmark demo did show differences too under close examination. The question is, does it fall more on the side of "worst case scenario" or "real games will see results like these". The framerates are rather low which implies intensive shaders that might not make it in to DX9 games. Some people on this forum have said textures need the extra precision, but I don't have the knowledge to judge that.
In any case, we're just now starting to see DX8 pixel shader games. I think it's safe to say the r300 is the best DX9 design so far, but it's tough to say by how much without the games to compare.
Textures already do need extra precision.
If you consider the concept of a 'location' in a texture - well, the biggest textures are 2048x2048. That's 11 bits. But for smooth bilinear filtering, you have to have subtexel precision (because the bilinear interpolation factor is the fractional part of the texture coordinate). That's at least four more bits to be acceptable, and might be more like six.
Luminescent
04-Jun-2003, 22:04
That is exectly what sireric referred to when he wrote this (http://www.beyond3d.com/forum/viewtopic.php?p=93873#93873) a while back, in reference to R3*:
We don't have the 2^127 as the largest number, it's 1.999*2^63 -- Smallest is 2^-64. The range was deemed large enough for most items (1.8*10^19), while giving us 17b of mantissa, which is more than enough for texture lookup (2k texture requires 11b, plus 4b subprecision takes you to 15b -- The extra two bits improve precision in computations and reduce the probability of introducing errors in the max texture addressing computation) Our choice of 24b total was based on this -- enough to cover all texture addresses and most numerical items as well; a "good" balance, imho.
Reverend
05-Jun-2003, 03:02
As so
me has said, thanks for your comments Eric. However...
About some misconceptions, and some comments:
1) IE^3 standards do not specify what should be returned for transcendental functions (sqrt, sin, etc...). They specify the format of data (including nans, infinites, denorms) and the internal roundings for results -- This rounding is not the f2i conversion, but how to compute the lsbs of the results. Different HW can return different results. People have learned to live with this.
Correct but here you have a fine alternative: if you need a reproducable version of cos, you can implement it yourself as a Taylor series, using floating point add and mul, if add and mul are deterministic. But if add and mul aren't deterministic, it's impossible to implement anything deterministically at all. The basic arithmetic ops are the building blocks.
Maybe I'm misinterpretting what you've written, but the ops are certainly deterministic, i.e. a*b always produces c on hardware x. (assuming it's implemented properly!!!).
On a single machine, it's deterministic.
On all machines supporting DX9, no. NVIDIA's * function and ATI's * function are not the same function because NVIDIA's a*b and ATI's a*b differ.
It's just that floating point is not a mathematical group, so the usual maths rules you expect, eg associativity, (a+b)+c = a + (b + c), or the distributive law, a*(b+c)=a*b + a*c, aren't guaranteed.
Correct, the theoretical thing going on here is that floating point numbers form a "semi-field", rather than an field, because certain laws fail, such as associativity (a semifield is a data type equipped with addition, negation, multiplication, inverse, zero and one; a field is a semifield where all of the operations obey all of the associative, distributive, etc., laws). But at least IEEE defines the operations deterministically across machines
Actually, this raises an interesting point - presumably C compilers are not allowed to make optimisations of this nature in FP calculations simply because they could change the behaviour of the code.
Whoa, so true. So, C compilers tend to have optimization options that you can turn on to let the compiler pretend that identities like (a+b)+c = a+(b+c) are true so it can rearrange your code to make it faster. Like most compilers' "assume no aliasing" optimization flag, this isn't strictly safe, but is usually good enough for most tasks. The difference here is that with C, the programmer can choose whether to do things precisely or quickly, whereas with DirectX9, the hardware has already decided for you.
Reverend
05-Jun-2003, 03:13
For example, in a shader that says something like 1/square(magnitude(LightPosition-TexelPosition)), unless your light and texel are real close together, that subtract can easily have many bits of lsb error, and squaring that quantity then doubles the number of error bits.
You got it the wrong way. It's when the light and the texel are close to each other relative to the distance to origo you've got problems.
My bad, you're right. This case actually occurs, for example, when a player is in a small room with the lightsource, and the room is far away from the origin of the world.
And squaring it doesn't double the number of error bits, but it can double the error (meaning adding one error bit).
Yup, you're right.
But it's bad (wrt performance and precision) to do the calc that way. It's better to write square(magnitude(X)) as dot(X,X), or to do the latter in an inline function magnitude2(X).
The approaches are pretty similar. The problem occurs just as much in 1D as in 3D, for example (a-b)^2 where a and b are both large numbers that are almost equal.
The subtraction lose as many bits of precision as is needed to store the ratio between light to tex distance, and light/tex to origo distance. Ie, if light-tex distance is one unit, and they are 32 units away from origo, then you'll lose 5 bits. If you lose to much precision (which certainly is possible if you're not careful), you can gain it back by moving part of the subtraction to VS (using its higher precision). The VS can make sure that PS get a local coordinate system for (lightPos-texPos).
Yes, you can definitely reduce the amount of error by arranging calculations as carefully as possible, and moving certain things into the vertex shader (or doing them on the CPU in double-precision and passing the final results down to a VS). This all requires more programming effort of course. It also limits the generality of what you can set up. When you are writing a single pixel shader, you can look at the overall algorithm and manage its precision carefully.
But for example if you're writing a bunch of shader components that can be combined together to form pixel shaders (for example, a specular lighting module, a spherical harmonic module, attenuation modules, etc), you can't be so sure about how much precision will be lost as data is passed between the different routines, given that they can be plugged together arbitrarily, by artists. This is the essence of what engines are meant to do, not to provide a single shader or single feature, but a bunch of shaders that the content creators can piece together to achieve the effect they want.
I'm sure for all that I've written thus far, there is a spirit of the arguments for FP24 (or other hardcoded hardware limitations in general) being always something like "for all the shaders we can think of, this isn't a problem. If you think there's a problem, send us a shader and we'll show you how to work around our limitations with it.". The flaw in that logic is that it assumes isolated pieces of shader code matter, but what really matters is the set of all possible shaders an engine can generate. If you look at Max or Maya's lighting models and material systems, they're all along these lines, not a single shader with a few knobs you can twiddle with, but general frameworks for combining arbitrary other shader functionality.
Reverend
05-Jun-2003, 03:48
Is ATI going with FP24 more akin to 3dfx going with 16-bit only in the Voodoo 3, or 3dfx going 16-bit only in the Voodoo1?
I think it's more akin to going 16-bit only with the Voodoo1. Sure, FP32 is nice to have, but the applications that would demand it and the technology to support it at fast speeds aren't here yet (I've yet to be convinced that the NV35 can do FP32 shaders at adequate speeds).
Also, when it comes to color precision, there is diminishing returns. The visible difference between FP24 and FP32 would be much less than between FP16 and FP24.
That's the thing... the entire chicken-and-egg scenario. Sure, the apps that demand it aren't here yet but if we have a long timeframe where FP24 hardware is the majority of the video cards out there, it will be even longer before we see such apps than if FP32 hardware had debuted instead of FP24 in the first place. That's logical and makes business sense to developers who sells games.
Obviously, it all comes down to performance when you make a piece of hardware. But the point of my starting this thread really isn't about slower FP32 performance compared to FP24 -- it was simply about instances where I think FP24 has definite disadvantages compared to FP32 and I wanted others to confirm if my understanding and thinking about this is correct or not because I have never had much faith in myself when I see and know there are so many folks here more knowledgeable about coding and hardware than myself :)
Colourless
05-Jun-2003, 06:53
Even though I get a feeling this comment is going to bite me in the ass sometime in the future, Precision be damned! The biggest limitation that I'm facing with the R300 is purely instruction counts!
More instruction slots and more registers are needed right now, not really more precision. Of course, GFFX has all three, so Nvidia at least did something right with it.
Luminescent
05-Jun-2003, 07:06
It seems the R3xx's lattest incarnation (R350) supports an unlimited number of instructions (in the fragment shader) via f-buffer; altough I'm not sure if the functionality is currently exposed in drivers.
LeStoffer
05-Jun-2003, 10:49
That's the thing... the entire chicken-and-egg scenario. Sure, the apps that demand it aren't here yet but if we have a long timeframe where FP24 hardware is the majority of the video cards out there, it will be even longer before we see such apps than if FP32 hardware had debuted instead of FP24 in the first place. That's logical and makes business sense to developers who sells games.
Yes, but I prefer to look at it from a much more practical view: Going from FP24 to FP32 for developers can't really take much of an effort when you look at how much they had to upgrade their skill to write shader in the first place (PS 1.1 - 1.4) and working with FP (PS 2.0) the second time around.
And then Colourless brings up the crucial point of what you want the IHV to include in their silicon budget (gotta love that word): Do you really want them to use up so much space for FP32 when we are still in the very start of cinematic rendering (like Colourless mention more instruction slots and registers)?
In other words: I sincerely doubt that the industry will stop in it's tracks if we don't see all IHV's doing FP32 before DX10. :wink: Just for the record: I think ATI made the right decision with R300 for all us non-developers, while I can see why nVidia wanted the developers to have the opportunity to mess around with the future today.
I know this isn't the point you're making - I don't care about IEEE standards in my games :P - but I just like to keep part of the discussion within the constraints of reality (the given silicon budget). IMHO.
On a single machine, it's deterministic.
On all machines supporting DX9, no. NVIDIA's * function and ATI's * function are not the same function because NVIDIA's a*b and ATI's a*b differ.
And that's the way it should be. We have never had any more determinism and we should not enforce it because it's basically useless and a heck of a burden to put on the shoulder of IHVs and in the end on the customers.
Whoa, so true. So, C compilers tend to have optimization options that you can turn on to let the compiler pretend that identities like (a+b)+c = a+(b+c) are true so it can rearrange your code to make it faster. Like most compilers' "assume no aliasing" optimization flag, this isn't strictly safe, but is usually good enough for most tasks. The difference here is that with C, the programmer can choose whether to do things precisely or quickly, whereas with DirectX9, the hardware has already decided for you.
In OGL2 there has been talks about providing ways to turn optimizations off, but I don't know the status of that though. That should satisfy everyone. For shaders optimisations should default to on.
Either way Reverend, you haven't explained why just 32 bits is significant. It's an arbitrary number just like every other. Assume ATI had provided fp32 already, this whole discussion would still apply, except all number += 8. The same argumentation could be made that "why don't we have fp40, there are applications that could use it".
Reverend
05-Jun-2003, 12:01
That's the thing... the entire chicken-and-egg scenario. Sure, the apps that demand it aren't here yet but if we have a long timeframe where FP24 hardware is the majority of the video cards out there, it will be even longer before we see such apps than if FP32 hardware had debuted instead of FP24 in the first place. That's logical and makes business sense to developers who sells games.
Yes, but I prefer to look at it from a much more practical view: Going from FP24 to FP32 for developers can't really take much of an effort when you look at how much they had to upgrade their skill to write shader in the first place (PS 1.1 - 1.4) and working with FP (PS 2.0) the second time around.
There is no additional effort (FP24 -> FP32) if you know exactly what you aim for -- FP32 is available to me, I know what it offers and what its limitations are and I work on that from the very start... this isn't about "upgrading". All of my postings in this thread is based on using FP32 -- I can't do this (http://www.beyond3d.com/forum/viewtopic.php?p=91712#91712) (which is important to me, for what I have in mind, which as OpenGL guy pointed out in a hidden way, doesn't matter) with FP24. I don't know if what I want is important nor what a game developer may want to do, of course.
And then Colourless brings up the crucial point of what you want the IHV to include in their silicon budget (gotta love that word): Do you really want them to use up so much space for FP32 when we are still in the very start of cinematic rendering (like Colourless mention more instruction slots and registers)?
Do I want them to? Yes I do. But I don't have/need to consider competition and I don't work for a IHV :)
In other words: I sincerely doubt that the industry will stop in it's tracks if we don't see all IHV's doing FP32 before DX10. :wink:
This is rather silly -- of course the "industry" won't stop because of this.
Just for the record: I think ATI made the right decision with R300 for all us non-developers, while I can see why nVidia wanted the developers to have the opportunity to mess around with the future today.
Perhaps all that I have written is based on the fact that the R300 is a resounding success -- and usually when I see a resounding success, I start thinking "Why didn't they do this in the first place?" Kinda like asking for a mile when I am given an inch :)
Reverend
05-Jun-2003, 12:09
Either way Reverend, you haven't explained why just 32 bits is significant. It's an arbitrary number just like every other. Assume ATI had provided fp32 already, this whole discussion would still apply, except all number += 8. The same argumentation could be made that "why don't we have fp40, there are applications that could use it".
The entire point of starting this thread is based on DX9 and IEEE-32, both available standards. It's not about "XX bits" nor an additional 1 bit -- it's about the two standards I know of, which I offered as a the basis for this discussion. If I followed your way of thought, this thread wouldn't exist -- nothing is ever enough.
You appear to not know the basis of my wanting to start this discussion, which was very specific -- it's about FP24 and FP32, nothing more than that -- and you have digressed onto "But what is enough for you Rev?", which isn't what I want to talk about. I gave specific examples of why I want FP32, and not FP24. Not why I always want more. I have explained why FP32 is significant to me (to me, to me, TO ME ALONE! :) ) compared to the availability of FP24. I have not explained why FP32 is enough as a distinct floating point spec (32bit) because that would be pointless -- as I said, nothing is ever enough when you get more creative. I am simply working on FP32 and 32bits alone, compared to FP24 and 24-bits. Hope this is clear.
If you want to me to stick to talking about "what's available", I would have nothing to say and live with what's available because, well, that's all I can do, right?
Then what's this talk about reproducibility all about? If someone goes to fp40, then any reproducibility is once again kicked out of the window. Arguing for a particular precision is odd IMO, be it 32, 24 or anything else. It's more precision => better (assuming same performance).
Reverend
05-Jun-2003, 12:59
As is usually the case in any thread, things get sidetracked -- I didn't bring up reproducibility. Well, actually I did but I had to, in response to sireric's first post in this thread, hehe :)
I can tell you one thing though -- I already know why I want more than FP32... but that'll have to be in another thread. And another time where I'll be damned for wanting more than what is the "API" standard. :)
K.I.L.E.R
05-Jun-2003, 18:30
I would LOVE to see an instance where FP32 has an advantage over FP24.
Humus's Mandlebrot demo. It's basically a worst case scenario demo where precision errors can "spiral" out of control (pardon the pun:) Even there you really have to zoom to see it. Kind of like the SS rotating floors for AF on the radeon. The stuff we've seen so far seems to be that FP24 can handle pretty much all that's out there to a very acceptible level.
Where I'm not sure, is that the same can't be said of FP16. ATI and MS don't seem to think so, but Humus's demo can't really be used to judge that because it is a worst case scenario. The 3Dmark demo did show differences too under close examination. The question is, does it fall more on the side of "worst case scenario" or "real games will see results like these". The framerates are rather low which implies intensive shaders that might not make it in to DX9 games. Some people on this forum have said textures need the extra precision, but I don't have the knowledge to judge that.
In any case, we're just now starting to see DX8 pixel shader games. I think it's safe to say the r300 is the best DX9 design so far, but it's tough to say by how much without the games to compare.
Is there any other demo? Like a small level with smoke, etc... that would also show the difference between FP32 and FP24?
Hellbinder
13-Jun-2003, 03:05
One thing I dont understand about this entire Discussion is the fact that even though Nvidia is Currently supporting FP32 you can hardly say its a practical and usable feature. The Subject has been brought up a few times wether FP32 is needed.
How can one say that offering FP32 to developers to use in games is a better option if that feature is completely impractical and slow for anything but a flash pan effect in a game? Why is ATi's Choice with fP24 any different than Nvidias choice to only Partially support FP32?
This comment specifically concerned me (though i have read every word of this thread)
More instruction slots and more registers are needed right now, not really more precision. Of course, GFFX has all three, so Nvidia at least did something right with it.
How do you figure??? If ATi is limited (in R300) by instruction then Nvidia is Clearly limited by Pecision. Becuase they cant offer than added percision in a way that can be mass used in a game. They have to revert to FP16 or even lower modes than that. What are you possibly coding that requires more instructions than what R300 currently supports?? Hell look at Hl2 and some of their effects.. yet it is running great on an R350.
You take any Ps 2.0 based application or demo currently exsisting and ATi's hardware is running it faster. Can you tell me that the GFFX you are developing on is actually handeling a large number of instructions at playable Frame rates currently? What about the FX5600 or 5200. How are they supposed to deal with these really long shaders you seem to be working on. The eveidence so far is not there to back it up from what I can see. Hell, Nvidia is even complaining about the way 3dmark03 is coded with to many redundant instructions.
I guess the big issue for me is how can one complain that ATi is perhaps holding everyone back by going with FP24, when at the same time the competition just released and entire line of cards that might as well be FP24 for all practical purposes.
I'm not a techie like you guys and i'm sure the more is allways better when all things are considered. But when you can't use something it becomes worthless. Imho nvidia is doing more harm than good. They included support for something that can't run with any decent speed. Knowing this they added support for something much lower than the spec called for but left in the slow one to be compliant. Wouldn't it have been much better for the advancment of games if nvidia supported 24bit and offered it at a decent speed ?
FP32 is not inherently slower than FP24, FP16, or FPwhatever. It just takes more silicon to support it natively. This means more transistors in your ALUs (about 2x as many for a multiplier); more transistors spent on registers to hold the same number of elements; and wider datapaths. On the other hand, note that R3xx already supports FP32 throughout most of its pipeline--the vertex shader is of course all FP32, and the pixel shaders load and store in FP32 format, so you're already taking the hit in terms of bandwidth, and possibly cache (?), up until you get to the pixel shaders themselves.
As for NV3x, FP16 and FP32 run at exactly the same speed, except for the issue of register file usage. NV3x's pixel shader pipeline is rather ridiculously poor in full-speed temp registers; indeed, testing shows a shader can only address 256 bits of registers without taking a performance hit. This equates to 4 FP16 values (FP16 * 4 componenets = 64 bits per pixel), which is bad enough, but only 2 FP32 values, which is downright awful. In fact, it's so awful that the best explanation I've seen for it is that they must have something buggy in their implementation which is being worked around by using a number of should-be GPRs as special-purpose registers, and thus taking them off the table. But the point is, if this presumed bug were fixed, or (if NV3x really were this register-poor by design) the design were a bit less braindead, NV3x's FP32 performance would be the same as its FP16 performance in the majority of cases. Still not great, to be sure, but nothing to do with FP32 being inherently slow.
As with all matters of computer performance, this is all about tradeoffs. And as with most matters of graphics performance (indeed, probably most matters of computer performance in general), the balance of the tradeoffs is mostly a function of contemporary process technology. As Moore's Law rolls along, the proper tradeoff inevitably changes from one side to the other. That is, the primary fact of hardware engineering is that your transistor budget roughly doubles every 1.5 years. At some point, tradeoffs you decided against because you couldn't justify the transistor expense eventually become worthwhile.
Of course, since graphics is an embarrassingly parallel problem, there is almost always a good default way to use up one's ever-increasing transistor budget: just slap on more pixel pipelines, TMUs or vertex units. This is quite a different situation from CPUs, where most problems are not embarrassingly parallel and thus extra transistors are used in periodic (every 5 years or so) redesigns for ever-more-complicated control to try to extract as much parallelism as possible from code, and in between are just donated to more and more cache. Unfortunately, the default uses of extra transistors--more pipes on a GPU, and more cache on a CPU--are subject to diminishing returns on many applications; with GPUs, eventually adding more pipes will get you nothing because you are bandwidth limited. (And indeed most GPUs already have more or less enough pipes for their available bandwidth.)
So there is a space in which such tradeoffs are evaluated: transistor budget (which is itself a tradeoff of performance vs. functionality vs. manufacturing cost) vs. the benefits of the feature vs. the benefits of the default use for extra transistors (i.e. extra pipelines, or perhaps some other worthy feature). So, getting back to FP32 vs. FP24: we already know the transistor cost (2x bigger multipliers, and 33% bigger registers); what are the benefits?
There are basically four issues:
color fidelity: essentially no need for anything more than FP24 or indeed anything more than FP16. After all, the colors are going to be output on an FX8 monitor for the forseeable future (maybe FX10 sometime soon)
texture addressing: a 2048x2048 texture at 32 bits per texel runs 16MB (which is to say, nothing larger will be used for quite some time); FP24 can accumulate 4 bits of error and still address such a texture accurate to 2 subpixel precision bits. On the other hand, FP16 is not sufficient to address large textures with subpixel accuracy; it is for precisely this reason that PS 2.0 and ARB_fragment_program both require at least FP24 support. (Sireric makes exactly this point early in the thread)
world-space coordinates: these basically need to be FP32 for any sort of accuracy over large distances (which is why they are FP32 in the vertex pipeline). To the extent you want to use positional coordinates as input to a pixel shader (Rev discusses a perfect example here (http://www.beyond3d.com/forum/showpost.php?p=75722&postcount=38)), you may be able to get away with FP24 with some hacks or restrictions, but in many situations you will get artifacts.
accumulated error: the longer a shader is, the more error can build up; FP24 shaders of moderate length may start to give incorrect answers for texture addressing, and eventually even for color output (although in practice you'd need a really long shader for that).
The last one is the most interesting, because it brings up another important point about hardware tradeoffs: they have to take into account the prevailing performance environment they will be used in. This is particularly important in realtime graphics, because there is a very narrow target you are shooting for: a realized fillrate of 40-150 million pixels per second. Anything more than that is essentially wasted; much less, and you might as well not bother.
Given that target range, and given the throughput of today's pixel shader implementations, shaders long enough to bring out precision artifacts in FP24 are pretty unlikely to arise in realtime use for the next few years. Which is not to say never: a shader might be particularly poorly behaved, or a game might get away with a couple really long shaders if they're used on a relatively small portion of the screen. And it certainly doesn't address non-realtime use, where the range of useful performance is much wider.
So what's the conclusion from all this rambling? In a sentence, FP24 is probably the best choice for current generation GPUs, but FP32 will be the best choice soon enough. Right now about the only thing FP24 can't handle well is positional coordinates; a couple generations down the road, however, GPUs will have the shader processing power to allow those long shaders which will bring out FP24's limitations in other uses as well. (After all, they won't be using their extra power for more pixels, because above ~150 Mp/s, there's no point.)
Plus, while the transistor overhead FP32 requires over FP24 might be a bad tradeoff in .15u and even .13u, as process technology improves it will look better and better; at .09u it's probably a shoo-in. Remember, it's not slower, it just takes more transistors; and transistor budgets are skyrocketing all the time. And there are other advantages to full FP32 support, most notably that it allows the unified pixel/vertex shader model Uttar and Demalion are always talking about.
As always, a particular hardware feature is almost never "good" or "bad" in isolation, but only when considered as a tradeoff between the manufacturing constraints and end-user environment it will spend its life in. The age-old "CISC vs. RISC" debate is a perfect example of this. Which is better? Neither: each was a function of the prevailing environment in its time. "CISC" was the best choice throughout the 70s and early 80s, with a heyday in the late 70s. For a number of reasons, primarily that core memory was too expensive, so minimizing code size was of primary importance; but also because the process technology of the time didn't allow for significant on-chip register files, and the compiler technology of the time wasn't good enough for high-level languages to be a win over assembly for most uses. RISC was the clear best choice from the mid 80s until recent times, but particularly in the early 90s. That's because memory became cheap enough that code bloat wasn't much of a problem; compilers became good enough for high-level languages to become the obvious choice, and to do the simple optimizations necessary for decent in-order RISC performance; and process technology was good enough to allow first large register files, then ever increasing levels of pipelining, then superscalar designs and then out-of-order designs, all of which were more easily realized with CISC than RISC architectures. In the late 90s, CISC ISAs (well, x86) became increasingly competitive with RISCs, because transistor budgets had increased to the point where CISC-to-"RISC" decoders could be stuck on the front-end, thus allowing all the design benefits of RISC (easy pipelining, superscalar, and OoO) for an increasingly negligable silicon cost; and because the increasing importance of system bandwidth as a bottleneck meant CISC's code-size advantage counted for something again. Looking to the future, it appears that compiler advances will indeed bring Intel's much-maligned EPIC (plus the VLIWs that are increasingly moving into the media-processor space) significant implementation-normalized advantages over competing architecture philosophies.
No approach is "better"; rather they can all only be judged in terms of the times they were designed for. During crossover periods there is certainly much valid debate over the best solution for the time; but for the most part such discussions are more a matter of "when" and not "if". Which is not to say that having an "ahead of its time" design is a good thing; in hardware, being ahead of your time is just as much of a sin as being behind the times.
Dave Baumann
13-Jun-2003, 11:09
R3x0 uses separate texture address processors which are at FP32 (AFAIK).
Simon F
13-Jun-2003, 11:14
FP32 is not inherently slower than FP24, FP16, or FPwhatever. It just takes more silicon to support it natively.
Are you sure that it's possible to keep the calculation time constant just by throwing even more silicon at the problem?
I'm not convinced - surely it must still be slower.
Chalnoth
13-Jun-2003, 11:29
world-space coordinates: these basically need to be FP32 for any sort of accuracy over large distances (which is why they are FP32 in the vertex pipeline). To the extent you want to use positional coordinates as input to a pixel shader (Rev discusses a perfect example here (http://www.beyond3d.com/forum/showpost.php?p=75722&postcount=38)), you may be able to get away with FP24 with some hacks or restrictions, but in many situations you will get artifacts.
This may make FP32 a shoe-in for the gen4 designs (NV4x, R4xx), if they have unified shader architectures. That is, they really need at least FP32 for vertex positions, so if the same hardware is used for PS ops...
In the late 90s, CISC ISAs (well, x86) became increasingly competitive with RISCs, because transistor budgets had increased to the point where CISC-to-"RISC" decoders could be stuck on the front-end, thus allowing all the design benefits of RISC (easy pipelining, superscalar, and OoO) for an increasingly negligable silicon cost; and because the increasing importance of system bandwidth as a bottleneck meant CISC's code-size advantage counted for something again. Looking to the future, it appears that compiler advances will indeed bring Intel's much-maligned EPIC (plus the VLIWs that are increasingly moving into the media-processor space) significant implementation-normalized advantages over competing architecture philosophies.
Not graphics related, so OT, -but this is a pet peeve of mine.
CISC has *zero* code size advantage over RISC. Compare ARM thumb or MIPS16 to x86 and you'll find the latter losing. The average instruction size of the new x86-64 is 5 bytes per instruction, -yes you can have a memory operand in there, but at the same time you only have a 2-adress instruction format, -and fewer registers, so you'll end up with more instructions shuffling data around than in a typical RISC.
Also decoding ia32 into uOps does not take negligable resources. decoders are either big and power hungry (Athlon) or less power hungry but even bigger (P4; trace cache). A 21264 core is half the die size of the P4 in a similar process and yet has higher performance. The succes of x86 is solely due to economy of scale, which has allowed the companies behind the MPUs to pour $$$$ into process and uarch developments while still maintaining a price/performance edge.
Finally: The compiler advancements that will benefit EPIC (VLIW) will also benefit every single other architecture out there. The only thing EPIC has going for it is the large register file, -and with SMT becoming ever more popular even that is looking likely to be a liability (big ass context-> fewer contexts juggled at the same time->lower throughput).
Cheers
Gubbi
DemoCoder
13-Jun-2003, 11:59
FP32 is not inherently slower than FP24, FP16, or FPwhatever. It just takes more silicon to support it natively.
Are you sure that it's possible to keep the calculation time constant just by throwing even more silicon at the problem?
I'm not convinced - surely it must still be slower.
Well, for addition, it can certainly be made constant. Addition is o(n), and with n-adders you can do it in constant time. For multiplication, it depends on how much logic you want to burn. Simple multiplication is o(n^2),( but that can be lowered to o(n^lg(3)) , or o(n lg n) if you use fast fourier techniques. However, those tricks only pay off on truly large numbers useful for huge number theoretic math and cryptography)
What this means is that multiplication requires a quadratic increase in circuit complexity if you want to preserve constant speed. I recall that there are two ways this is implemented today: Booth encoding with arrays of adders, and Wallace trees. You can preserve constant speed as long as your critical path doesn't get too long. 32-bit multiplication with single cycle throughput is a mature technology, and so yes, throwing silicon at the problem has resulted in constant speed.
Whether this could be extended further (say, FP64 or FP128) is unknown.
Edit: BTW, I just found this on google, should be educational http://www-2.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15828-s98/lectures/0126/sld001.htm
There's also the problem with power. A FP32 FMADD unit will use twice the amount of power of a FP24 one, and with current (and future) power densities, this is likely to impact performance in a negative way.
Cheers
Gubbi
Reverend
13-Jun-2003, 12:27
I'd like to say thanks to Dave H for his post above.
Simon F
13-Jun-2003, 12:29
FP32 is not inherently slower than FP24, FP16, or FPwhatever. It just takes more silicon to support it natively.
Are you sure that it's possible to keep the calculation time constant just by throwing even more silicon at the problem?
I'm not convinced - surely it must still be slower.
Well, for addition, it can certainly be made constant. Addition is o(n), and with n-adders you can do it in constant time.
I'm still not quite following you. With an N-bit carry save adder can do MOST of the add in constant time, but surely you eventually you have resolve the carries which is surely going to be at least an o(log(n)) operation or perhaps even linear.
For multiplication, it depends on how much logic you want to burn. Simple multiplication is o(n^2),( but that can be lowered to o(n^lg(3)) , or o(n lg n) if you use fast fourier techniques. However, those tricks only pay off on truly large numbers useful for huge number theoretic math and cryptography)
I'm aware of the other methods, but I don't see that you can completely trade-off time vs area....
What this means is that multiplication requires a quadratic increase in circuit complexity if you want to preserve constant speed. I recall that there are two ways this is implemented today: Booth encoding with arrays of adders, and Wallace trees. You can preserve constant speed as long as your critical path doesn't get too long.
Ahh.. there you have it. "as your critical path doesn't get too long". The time is not constant.
Edit: BTW, I just found this on google, should be educational http://www-2.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15828-s98/lectures/0126/sld001.htm
[/quote]
I'll have look shortly.
arjan de lumens
13-Jun-2003, 13:18
Constant speed is O(1), not O(n). With both adders, multipliers, and barrel shifters (the basic circuits from which FP units are made), since the MSB of the result potentially depends on all the bits of the inputs, you can't get lower than O(log n) gate delays (and O(n) interconnect delay) no matter how you design the circuit, which is hardly constant.
FP24 will always be faster than FP32 if you spend equal effort on optimizing them, but the speed difference will be small, perhaps 5% or so in present-day processes. Similarly, FP16 wll be only slightly faster than FP24 in turn. These small differences often disappear once you align the circuit timings to a clock.
What is the expected time it takes to calculate an FP32 addition or multiplication?
Can one expect that a 0.15 circuit is able to execute 150M, 300M or 600M of such operations?
I ask this because we usually know the troughput only, but not the latency. Latency is usually hidden in GPUs, by processing multiple {vertices / fragments} paralelly.
I was surprised to find out that on the GF3 it takes 6 cycles to execute a VS instuction.
Can it the same for FP pixel shader architectures?
Can it be that the NV30 is not so sensitive to texture lookup latency because the FP units has similar latency, so it doesn't really matter whether you use one or the other?
RussSchultz
13-Jun-2003, 14:39
Constant speed is O(1), not O(n). With both adders, multipliers, and barrel shifters (the basic circuits from which FP units are made), since the MSB of the result potentially depends on all the bits of the inputs, you can't get lower than O(log n) gate delays (and O(n) interconnect delay) no matter how you design the circuit, which is hardly constant.
Wouldn't pipelining make the delays inconsequential? If you have a 4 stage tree, calculate each stage per clock. You have higher latency, but you're generally working with a long stream of data so it shouldn't matter.
arjan de lumens
13-Jun-2003, 15:22
Not 100% sure how fast an FP32 unit is in the various process, but can do some guesswork: in highly custom logic on a high-leakage process, the AthlonXP @ 0.13 can do an FP32 mul or add with a latency of 4 clock cycles @ 2.25 GHz = ~1.8 nS. For standard logic at TSMC/UMC 0.13, I would estimate it takes about twice as long , giving about 3.5 nS. If you are doing fused multiply-add, you may add perhaps 50% to the delay for ~5-6 ns. If you are doing DOT3, RCP or other composite operations, I would expect the delay of a fused-multiply-add + a standard add + a little bit more, giving ~8-11 ns. Numbers get even worse for 0.15 micron. In any case, these operations will have a latency of several clock cycles if you want to reach a remotely reasonable clock speed. These are latency numbers, you can get the throughput as high as you want with sufficiently deep pipelining.
As for texture lookup, this operation is far slower and more complex than any FP32 arithmetic operation in turn - the operations for a texture lookup go approximately as follows:
Compare texture coordinates and divide the smaller ones by the larger one (if doing cube-mapping)
Determine mipmap level (measure differences in texture coordinates between adjacent pixels, then do calculations involving many multiplies and at least one logarithm)
Scale and wrap/clamp the texture coordinates (simple)
Look up the needed texels from the texture cache. The delay through this stage will be much larger than going through a standard CPU cache (on the order of ~1 external memory latency), or else you won't be able to mask memory latency and overlap cache line fills, ruining performance.
Perform bi/trilinear interpolation on the resulting texels.
Luminescent
13-Jun-2003, 15:34
Would that mean that the fp32/texture unit (the fpu cannot function on shader and texture ops concurrently), present within the NV3x architecture, is most likely composed of much more than 4 fmads/inverse logic units (it can achieve 2 texture lookups per clock cycle)? In fragment program mode, NV3x does not compute texture lod with the common tex instruction, and requires txd (a texture lookup which references computed partial derivates) for proper lod (?).
Wouldn't pipelining make the delays inconsequential? If you have a 4 stage tree, calculate each stage per clock. You have higher latency, but you're generally working with a long stream of data so it shouldn't matter.
But that's another tradeoff. You are right in that the likely way to do it would be to add more pipe stages to absorb the added propagation delays, but then the architecture has to be retuned to absorb the added latencies, which may add additional costs and/or impact performance.
If you are doing DOT3, RCP or other composite operations, I would expect the delay of a fused-multiply-add + a standard add + a little bit more, giving ~8-11 ns.
For DOT3 I'd be surprised if it took that much.
FP add contains the following ops:
1. Compare the exponents and determine the amount of mantissa shift.
2. Shift the mantissa of one of the numbers.
3. Add the two mantissas together.
4. Search the highest bit in the result
5. Shift the mantissa for renormalization.
DOT3 requires a 3 parameter add, instead of the 2 parameter add used in MAD operations.
Stages 2, 4 and 5 should take exactly the same time, only stage 1 and 3 takes more time, but I'm not sure it matters as much as (a standard add + a little bit more)
OTOH, I don't know how RCP is implemented, how much of it is a table lookup, and how much more work is done.
Not 100% sure how fast an FP32 unit is in the various process, but can do some guesswork
I doubt if that's really too comparable - there are so many uncertainties involved. One major point omitted is that the x86 unit on the Athlon is FP80, not FP32, for a 4-clock latency.
arjan de lumens
13-Jun-2003, 18:35
If you are doing DOT3, RCP or other composite operations, I would expect the delay of a fused-multiply-add + a standard add + a little bit more, giving ~8-11 ns.
For DOT3 I'd be surprised if it took that much.
FP add contains the following ops:
1. Compare the exponents and determine the amount of mantissa shift.
2. Shift the mantissa of one of the numbers.
3. Add the two mantissas together.
4. Search the highest bit in the result
5. Shift the mantissa for renormalization.
True enough, but you can do 2-input FP adds faster than that sequence of operation indicates. Note that if the two input numbers are about the same magnitude, then step 2 can be reduced to a 1-bit shift. Otherwise, if the two input numbers are not of similar magnitude, then the renormalization in step 5 can be reduced to a 1-bit shift. Split the FP adder in two paths - one for each of the two cases - and you will end up with a substantially faster 2-input FP adder. Most CPU makers do this for extra speed these days (dunno about GPU makers; this way of designing an FP adder can be expensive in terms of transistor count)
DOT3 requires a 3 parameter add, instead of the 2 parameter add used in MAD operations.
Stages 2, 4 and 5 should take exactly the same time, only stage 1 and 3 takes more time, but I'm not sure it matters as much as (a standard add + a little bit more)
For DOT3, you can overlap stage 1 with the multiplication part of the operation. But:
with a 3-input add, you can no longer use the adder trick I described above, and you cannot determine the sign of the mantissa until after you have done the addition step, so you get a potentially expensive negation step as well, so the time needed to perform the addition goes up by 60-80% over a 2-input add. A 4-input add (for DOT4) is, however, only slightly more expensive than a 3-input add.
As for the Athlon: it does both FP32 (for 3dnow) and FP80 (for x87) with the same number of cycles (4), although IIRC the units are separate from each other.
It's worth looking up the carry-save adder technique. I found this link:
http://www.geoffknagge.com/fyp/carrysave.shtml
CSA allows you to add 3 values together into 2 values, but critically has no carry propagation, and so the propagation time of the adder is just one or two gates.
Chaining these you can trivially convert the addition of 4, 5, 6, etc. numbers into a single carry-propagate adder - so a dot4 operation just needs a couple of extra propagation delays compared to a MAD - certainly nothing like huge gate propagation for a conventional CPA.
arjan de lumens
13-Jun-2003, 19:46
I am aware of how a carry-save adder works - the fastest available multiplier designs (wallace-trees, 4-2 trees) are just trees of carry-save adders. For an integer/fixedpoint DOT4, you can use a tree of carry-save adders to get an operation latency of roughly 1 multiply + 2 CSAs. For floating-point dot3/4, the problem is that you need normalization circuits before and after the addition when adding 3 or more numbers (which can be partially avoided if you add just 2 numbers).
It wasn't aimed at you :) you sounded far more informed than I was anyway, and confirmed it!
DemoCoder
13-Jun-2003, 22:06
Constant speed is O(1), not O(n). With both adders, multipliers, and barrel shifters (the basic circuits from which FP units are made), since the MSB of the result potentially depends on all the bits of the inputs, you can't get lower than O(log n) gate delays (and O(n) interconnect delay) no matter how you design the circuit, which is hardly constant.
I didn't claim O(n) was constant, I said addition was fundamentally O(n), if you take as consideration your model of computation as being the typical turing or decision tree model. It's constant only if you ignore delays. Once you move to a parallel model of you, you must take communication delays into account.
However, when we are dealing with small, fixed inputs, where the differences between 16, 24, and 32 aren't large, I don't find O() analysis that informative. After all, merge sort may be O(n log n), but insertion sort is going to be quicker if you're sorting 8 numbers.
arjan de lumens
13-Jun-2003, 23:24
Umm, if neither size nor delay is constant, then what is constant about n O(n) adders? It's not like you need as many as n adders to maintain constant throughput - there are several adder designs with O(n) size and O(log n) delays, so you need only log(n) adders. (These adders are only barely larger than ripple-carry adders and are twice as fast already at ~8-12 bits operand size).
OT
CISC has *zero* code size advantage over RISC. Compare ARM thumb or MIPS16 to x86 and you'll find the latter losing
Well sure, but condensed ISAs are hardly what one means when one says RISC. Of course Thumb and MIPS16 deserve the term "RISC" ISAs, because they are variations on "classic RISC" ISAs (and SuperH because it is so similar to Thumb and MIPS16), and incorporate many of the design insights of the RISC revolution. But they pointedly fail to have many of the features shared by all general-purpose RISC ISAs: fixed-length instructions (specifically fixed at 32 bits); 32 GPRs; and three operands on all arithmetic instructions.
ARM Thumb (in 16-bit mode), for example, only provides 8 GPRs, only 8-bits of offset on a conditional branch :!: and 11 on a jump, and so on. It's a nice ISA for many embedded applications, but it is completely unusable for general-purpose computing. And if it wasn't clear in my post, I was talking about general-purpose computing, where 8-bit immediate fields don't cut it. If you want to play this game, I can come up with an 8-bit microcontroller that blows Thumb or any other narrow-RISC ISA out of the water when it comes to code density. But it's pretty obviously irrelevant to whether CISC has a code density advantage over RISC in general-purpose use.
The average instruction size of the new x86-64 is 5 bytes per instruction
If you're referring to this Paul DeMone post (http://realworldtech.com/forums/index.cfm?action=detail&PostNum=1410&Thread=1&entr yID=17866&roomID=11), then you follow his postings even more closely than I do. :) But he later makes it quite clear (http://realworldtech.com/forums/index.cfm?action=detail&PostNum=1410&Thread=3&entr yID=18017&roomID=11) that this almost 5 bytes/instruction figure is anomalous even for the brand-new x86-64 under GCC, and certainly for normal-case x86 under a real compiler. I'm not going to take the time to come up with a real figure, but it is obviously significantly less than 5 bytes.
-yes you can have a memory operand in there, but at the same time you only have a 2-adress instruction format, -and fewer registers, so you'll end up with more instructions shuffling data around than in a typical RISC.
The bottom line is that x86 has a significant code size advantage over a traditional RISC in general-purpose code, roughly 20% in the case of the SPEC suite. A quick search found me this paper on dictionary compression of RISC ISAs (http://www.eecs.umich.edu/~tnm/compress/publications/micro30.compress.pdf). (Interestingly, IBM does a similar thing for some embedded RISC MPUs, rather than moving to a hybrid 16/32-bit ISA ala Thumb or MIPS16.) Check out page 9: uncompressed x86 code size averages 18% smaller than ARM and 29% :!: smaller than PowerPC for a large sample of SPEC95 subtests. Admittedly the figures aren't perfect, as AFAICT they represent (unlinked) binary size, rather than runtime code path size, which is what really matters. But they give a general idea.
In any case, the fact that x86 has a smaller runtime code size than all general-purpose RISCs is well established.
Also decoding ia32 into uOps does not take negligable resources. decoders are either big and power hungry (Athlon) or less power hungry but even bigger (P4; trace cache).
I didn't say "negligable", I said "increasingly negligable silicon cost". This is just a simple consequence of Moore's Law. Of course, as more resources are available, more will be given to the task of decoding. It is certainly fair to charge the extra footprint of the expanded instructions in the trace cache to the decoding cost, but it's worth noting that a trace cache is a worthwhile feature in and of itself; the idea was developed in academia for reasons having nothing to do with taking a CISC-"RISC" decoder out of the critical path.
Obviously the x86 tax is still too great to pay when backwards compatiability isn't worth anything, and power/heat are important issues, as is the case with most embedded systems But that doesn't mean x86 can't do the high-end of low-power reasonably well. Pentium M offers pretty remarkable performance/power-consumption considering how high the performance really is. Yeah, it would be even better if it were a RISC with the same design resources poured into it; but in the meantime it sure wipes the floor with the G3/G4 in both performance and battery life.
A 21264 core is half the die size of the P4 in a similar process and yet has higher performance.
Er, no. Let's compare the last similar process that both chips have topped out: .18um bulk Al. EV68 (833 MHz) tops out at 518/643 SPECint/fp base, while Willamette (2 GHz) hit 681/735. So nice try, Alpha, but no cigar. And I'm sure the Alpha's 8 MB L2 had nothing to do with anything, as it's always perfectly fair to compare a $500 chip to a $20,000 one.
Moreover, the 21264B had a die size of 153mm^2 (http://www.microprocessor.sscc.ru/alpha-21264/), hardly "half the die size" of the 217mm^2 Willamette, and that's disregarding the fact that 21264B has about half the on-die cache of Willamette.
Ok, so you were probably talking about the 21264C, which is in .18um Cu. (SOI? I forget.) Not quite "a similar process", but we'll let that slide. 21264C's die size is 125mm^2, so you're a bit closer there, although again considering the missing 128Kb of on-die cache (or, alternatively, the 8MB off-die cache) I'm disinclined to give the benefit of the doubt. 21264C is shipping at 1250 MHz--50% faster than the 833 MHz 21264B--so obviously it could turn in higher SPEC scores than the 2 GHz Willy, if HPaq would only submit them.
While we're on the subject, the .18um Cu/SOI Power4 (IIRC only <= 1.3 GHz; the faster ones are .13um) scores higher as well, although the fricking 128MB L3 can't hurt.
But that doesn't contradict what I said. This is the most important point, so I'll make it quite clear: I never said CISC could now beat RISC in process-normalized performance; I said it has become "increasingly competitive". I think our "disagreement" arises mainly from the fact that you don't realize just how much of an advantage RISC provided over CISC around 15 years ago. There was a famous study carried out at DEC where they pitted their own VAX 8700 against the MIPS M2000; the chips were chosen because they were built on an extremely similar process. Just as with your EV6 vs. P4 example, the RISC chip had about half the core complexity of its rival. The difference is, instead of being around .8x as fast, it was 2.66x as fast. Here's a nice Powerpoint presentation (http://www.cs.virginia.edu/~skadron/cs654/cs654_01/slides/hua.ppt) discussing the results (download OpenOffice if you don't have Powerpoint), although you can find gazillions of less in-depth mentions of it as it is featured in Hennessy and Patterson and thus in the curriculum of every college MPU architecture course in the nation.
Now, let's dwell on this a second. Obviously the reason the P4 is doing so well is not because of x86 but in spite of it. Clearly the Alpha was hampered by fewer development resources, an older design optimized for older chip geometries; process technology that, while pretty decent (IBM), was not tailored to the MPU as Intel's is; not quite as good a compiler as Intel's, and so on. This is all a function of the huge installed base of x86 and the money their captivity buys. Fine.
Problem is, Alpha was still a ton better off than any of the other RISC architectures. Alpha at least had a design team with the talent (if not the resources or company backing) to challenge Intel's; indeed, the Alpha core has the advantage of being more hand-tweaked than even Intel's designs. None of the RISC vendors owns their own fab (except IBM, but they don't target their fab to their own chips, as the fab is run as a completely seperate entity), and many are worse off on this front (Sun uses TI for example). In compilers, too, Alpha was the only group that could even compete with Intel. And so on. Finally, Alpha was particularly strong in SPEC (and particularly weaker in TPC), so the comparison is made on reasonably favorable terms for Alpha. I mean, think about comparing the 2 GHz Willamette to the best .18um process chips from Sun (USII I believe), HP, or SGI in single-threaded SPEC. P4 at .18um will probably beat what PA-RISC or MIPS achieve at .13um on SPEC at least. (If SGI never bumps the R16000 past 700 MHz, it won't even be remotely close.)
Ok, so obviously even Alpha and Power4 are in many ways victims of an unfair comparison with P4. (OTOH, they do have those huge off-die caches, which SPEC loves, and the benefit of IBM's more advanced process technology, even if that only brings them even with Intel's bulk Al .18um.) Obviously if Intel were to put the same amount of resources they dedicated to P4 into a RISC chip it would be faster. Probably a lot faster. Maybe as much as 30-40% faster at similar cost and process.
Thing is, that doesn't begin to compare to 166% faster, which is what the MIPS M2000 did to the VAX 8700 in the late 80s. And, while I don't have process-normalized information, this sort of dominance, or even greater, continued throughout the early to mid-90s (i.e. RISCs compared to 486 and then Pentium). It was only with the PPro that x86 could be considered within the same breath as RISC chips in SPECint performance (but not SPECfp); and with P4 that x86 took a constant place at or near the top of the SPEC standings. (One that it will by all indications lose for good to Itanium when Madison launches in the coming weeks.)
There are lots of reasons for this, among them the fact that serious development of big-iron RISC chips stalled out of the fear of Itanium (except at IBM and SUN, with the latter being too behind to matter much). But by far the biggest reason is that there has been a huge secular increase in the competitiveness of CISC architectures compared to RISC in the last decade. And this is due to Moore's Law, first offering the mere possibility of a CISC->"RISC" translating design, and then making it ever-cheaper in relative silicon cost.
The succes of x86 is solely due to economy of scale, which has allowed the companies behind the MPUs to pour $$$$ into process and uarch developments while still maintaining a price/performance edge.
Quite true, inasmuch as x86 was still extremely successful back in the days when it wasn't anywhere near performance-competitive with RISC MPUs. If you mean "success" as "marketplace success".
If you mean technological success, you're entirely wrong. All the engineering talent and R&D money in the world couldn't give a CISC the cost/performance of a RISC 10-15 years ago; now it's made P4 competitive on an absolute performance basis, much less considering manufacturing cost.
Finally: The compiler advancements that will benefit EPIC (VLIW) will also benefit every single other architecture out there.
EPIC is infinitely more dependent on good compilers for high performance than CISC or RISC, and particularly out-of-order implementations of CISC or RISC. Moreover, the other general-purpose architectures don't have features like full predication, branch hints (with poison bits to preserve correctness), or memory reference speculation. Plus their smaller visible register set limits how aggressive the compiler can be in terms of software pipelining or trace scheduling.
The only thing EPIC has going for it is the large register file
Totally wrong. For one thing, simply giving a classic RISC 128 GPRs without significantly changing the rest of the ISA would barely improve performance at all. (After all, OoO RISCs get most of the benefit of a large visible register set by having a similarly large renaming register set.) For another, among all the other bits I mentioned above, you're somehow forgetting the little bit about the explicit parallelism...
and with SMT becoming ever more popular even that is looking likely to be a liability (big ass context-> fewer contexts juggled at the same time->lower throughput).
So because SMT is now "popular", IPF is going to have to use it?? :lol:
Apparently you're not as big a Paul DeMone fan as I thought. There are other forms of multithreading, you know. (Or maybe not.)
Look, the main challenge facing MPU architects is extracting enough parallelism to keep busy the increased number of functional units Moore's Law affords them.
For a while, ILP found via OoO was enough to keep things going. Unfortunately, that method has pretty much played itself out: increasing the reordering window size is one of the most important ways of extracting more ILP, but the silicon required increases quadratically.
So now we have two new approaches. The first is extracting thread-level parallelism via SMT. Unfortunately, we won't get to see what would have been the best early exemplar of this, EV8. It's certainly a viable approach, although it obviously relies on having multiple threads competing for CPU-time.
The second is to extract ILP at compile time. There are obvious disadvantages, but the amount of ILP left unclaimed by current methods is enormous, and while not all is knowable at compile-time, we can do a lot better than what can be practically extracted dynamically.
Of course the proof is in the pudding, and after a pretty awful debut (but then again, most 1st-generation processors never make it out of the lab for the world to see how bad they really are), Itanium has become quite impressive performance-wise. With Madison that will turn to "quite dominant performance wise"; now that EV8 is no longer, nothing is going to challenge Madison's SPEC numbers in .13um.
Going back to the discussion earlier: does this mean Intel couldn't have put up similar numbers if they'd poured the same resources into a RISC design? On .13um, probably they could have. (On .18um, definitely.) A couple process generations from now, it's looking more dubious. While it's certainly not a perfect test, IBM's Power roadmap looks ambitious enough that we should get to see how a serious RISC competitor stacks up to EPIC.
(Although ironically, Power4 does a crude form of on-chip RISC->"semi-VLIW" encoding, so that it can reap some of the control benefits of a bundled instruction ISA. Remind you of anything?)
P.S. - As you said, this is quite OT; I'll take further reponses to PM if any are necessary.
FP32 is not inherently slower than FP24, FP16, or FPwhatever. It just takes more silicon to support it natively.
Are you sure that it's possible to keep the calculation time constant just by throwing even more silicon at the problem?
I'm not convinced - surely it must still be slower.
Others have taken this discussion farther than I will (or could) here, but...
Sure, FP32 arithmetic requires either more levels of logic, or more loops through the logic you've got, than does FP24; simplified a bit, it's a matter of doing a 24-bit integer multiply and several 24-bit adds instead of doing 17-bit multiplies and adds (plus assorted shifts in both cases). I didn't mention it because GPUs are ridiculously pipelined anyways, and their FMADs should be no different. In case we've forgotten, FMADs are generally done on vec4 inputs, so it's not like previously we were getting our results in a single cycle or anything.
(EDIT: Ok, obviously the multiplies are all independent, so the fact that you're doing 3 FMADs and a multiply "in sequence" really only leaves you with an extra FP add if you dedicate the requisite hardware (4 independent multiplies, then add the results in pairs, then add those results). So, maybe it is plausible to do a vec4 FP24 FMAD in one cycle at ~500 MHz. Heck, maybe it's plausible with FP32. I don't know if I'm even in the ballpark. Still, there's no good reason not to have it all pipelined if need be; so much of a GPU is about latency hiding I can't see why it would be a problem here.)
So yes, moving to FP32 could increase latencies of arithmetic operations a bit. But I would be quite surprised if these latencies were not completely hidden from the point of view of a pixel shader program. You may need to increase the number of pixels you have in-flight down the shader pipeline, but, as I said, a matter of throwing silicon at the problem.
In my understanding at least. If there is some reason why moving to FP32 would necessarily impact cycle time, or why a little extra arithmetic latency would negatively impact shader performance, I'd be interested to hear it.
21264C is shipping at 1250 MHz--50% faster than the 833 MHz 21264B--so obviously it could turn in higher SPEC scores than the 2 GHz Willy, if HPaq would only submit them.
21264C scores has been submited, it scores 845/928 base/peak in specint and 1019/1365 in specfp.
21264C is shipping at 1250 MHz--50% faster than the 833 MHz 21264B--so obviously it could turn in higher SPEC scores than the 2 GHz Willy, if HPaq would only submit them.
21264C scores has been submited, it scores 845/928 base/peak in specint and 1019/1365 in specfp.
Whoops: I was only looking through the scores submitted by Compaq! :oops: :oops: :oops:
Thanks for the heads-up.
P.S. - As you said, this is quite OT; I'll take further reponses to PM if any are necessary.
Agree, I wrote a response here (http://www.beyond3d.com/forum/viewtopic.php?p=131860#131860)
Cheers
Gubbi
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.