arjan de lumens said:
16 and 24-bit FP types are not IEEE standards, whereas 32 and 64-bit are. The exponent, mantissa and range of each type are:
- FP16 (as used in GFFX): 5-bit exponent, 11-bit mantissa, range 2^-24 to 2^16 (as found here; supports denormalized numbers; search for 'fp16' in the document)
- FP24 (as used in R300): 8-bit exponent, 16-bit mantissa, range 2^-126 to 2^128 (IIRC, it is essentially FP32 with the lowest 8 mantissa bits ripped off and no support for denormalized numbers. Cannot remember the source of this information, though)
- FP32 ('single precision' IEEE-754 standard, look here or google for 'IEEE 754'): 8-bit exponent, 24-bit mantissa, range 2^-149 to 2^128.
- FP64 ('double precision' IEEE-754 standard): 11-bit exponent, 53-bit mantissa, range 2^-1074 to 2^1024.
- FP80 ('extended precision' quasi-standard, supported by x86 and 68k series processors): 15-bit exponent, 64-bit mantissa, range 2^-16444 to 2^16384
- FP128 ('quadruple precision' IEEE-754 standard): 15-bit exponent, 113-bit mantissa, range 2^-16493 to 2^16384. Rarely used.
Except FP80, the most significant bit of the mantissa is not stored explicitly (as it is always 1). All formats also have an extra sign bit.
To clarify - the sign bit is not 'extra' - it is part of the 32 bits (otherwise the alignment issues with accessing floats would be horrific), with the space made by the implicit mantissa bit.
So in terms of storage -
32 bit is S23E8 - Sign bit, 23 mantissa bits, 8 exponent bits
64 bits is S52E11 - Sign bit, 52 mantissa bits, 8 exponent bits
I believe that nVidia's half format is S10E5
[edit] I misunderstood the wording in the inital response [/edit]