Multiplying two doubles, each with 52-bit mantissas (implicitly 53-bit numbers) results in at most 106-bits of result, which would become a 105-bit encoding if IEEE went that far, so bits have to be chopped off the end of the process (I'm not saying the hardware actually computes all 106-bits, I don't know what it does). If you wanted the result of this multiply you'd only use the top 53 bits of that 105-bit result."adds the lower 64 bits of the result" implies rounding to me.
http://developer.amd.com/gpu_assets/R700-Family_Instruction_Set_Architecture.pdfWhere'd you find that by the way? It's a nightmare trying to find info on AMD's stuff.
No GPU can do double-precision transcendentals intrinsically - they effectively use macros. In the ATI case currently you have to write your own Only, what, 19 months since the first double-precision GPU arrived.I know at one point AMD couldn't do certain transcendental functions with DP, but I believe that was a software limitation. I don’t know if that’s still true or not (or if it really was a software limitation).
No GPU can do double-precision transcendentals intrinsically - they effectively use macros. In the ATI case currently you have to write your own Only, what, 19 months since the first double-precision GPU arrived.
NVidia, pretty reasonably, appears to put high value on precision (if you're gonna bother with double, then you prolly care), with lots of double-precision functions (e.g. log or sin) being 1 or 2 ULP. Best of luck with your ATI versions...
OpenCL only optionally supports double precision - but the specification is quite thorough in what's required if enabled, though fused multiply-add is optional. Precisions are quite loose, generally.
Jawed
Multiplying two doubles, each with 52-bit mantissas (implicitly 53-bit numbers) results in at most 106-bits of result, which would become a 105-bit encoding if IEEE went that far, so bits have to be chopped off the end of the process (I'm not saying the hardware actually computes all 106-bits, I don't know what it does). If you wanted the result of this multiply you'd only use the top 53 bits of that 105-bit result.
patent said:In mantissa path 516 (FIG. 8), multiplier 802 computes the 106-bit product Am*Bm and provides the product to 168-bit adder 804
Jawed said:
If you wanted the result of just the multiply, then you'd take the top 53-bits, encode as sign+52-bit+exponent, then be done. So clearly doing anything with more than 53-bits (i.e. 64-bits) implies more precision is being used.Why would you only need the top 53-bits?
That's if you want to do full-speed denormals it seems, where all 106+52=168 bits are required.It's quite possible for the thrown away bits to affect to result of the addition. The relevant patent explictly indicates that the full 106-bit product flows down to the adder, no truncation involved.
If you wanted the result of just the multiply, then you'd take the top 53-bits, encode as sign+52-bit+exponent, then be done. So clearly doing anything with more than 53-bits (i.e. 64-bits) implies more precision is being used.
Minor point, but CUDA is only proprietary in that nVidia developed it and is the only IHV to support it. There isn't, in principle, anything preventing any other IHV from producing their own compiler for CUDA.As much as I think proprietary languages are total bullshit,
I think if you look at the CUDA spec, it's clear that NV anticipates making cards with no DP support. Those would probably be consumer oriented mainstream cards where you don't want to spend die space on un-used features....or maybe cards where you want to disable DP to force CUDA users to buy more expensive parts for production use.
In theory, for a full-precision result, the mantissas are adjusted in their alignment:I read that 64-bit as being the full shebang - mantissa + exponent. So I'm not sure there's any extra precision there.
1
1 2 3 4 5 6 7 8 9 0
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456
A*B = 10101101010111010101011010101010101010101010101010101010111001101011011001101011111010101 e27
C = 11100011100101111001110001011011110100101110 e-33
1
1 2 3 4 5 6 7 8 9 0
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456
10101101010111010101011010101010101010101010101010101010111001101011011001101011111010101 e27
11100011100101111001110001011011110100101110 e27
that was the plan for GT200 initially, but they changed their mind and let DP enabled on all cards.
I wonder about GT21x : are they similar to G92, or GT200? if the latter, they would be "CUDA level 1.2" GPUs. (1.0 is G80, 1.1 is G84/86 and G9x, 1.3 is GT200)
Minor point, but CUDA is only proprietary in that nVidia developed it and is the only IHV to support it. There isn't, in principle, anything preventing any other IHV from producing their own compiler for CUDA.
Of course, CUDA is developed with nVidia's architectures in mind, so there may be performance-oriented disincentives for other IHV's to produce their own CUDA compilers.
Not exactly the same situation. The language level is far, far above the machine level.If you think CUDA is anything other than a proprietary language, you have a serious case of rectal cranial inversion.
Someone could in theory produce a CPU that is binary compatible with IBM mainframes using BT or other tricks. Nonetheless, IBM mainframes are still quite proprietary and expensive as hell.
DK
AFAIK, all GT21x support DP in the same way GT200/b does.I wonder about GT21x : are they similar to G92, or GT200?
If there's no such thing as 'GT300' I'm glad I still think of it as NV60 because that doesn't get confusing
...
G98 / GeForce 9800 = "NV50" or "NV51"
...
http://www.eetimes.com/news/latest/showArticle.jhtml;?articleID=218900011
So, GT300 should be around 5x the performance of G80 (3 years).
Jawed