View Full Version : Double precision comparison
codedivine
04-Mar-2009, 07:31
Can someone comment on the comparitive ability of Gtx 280 vs rv770 about DPFP ?
Correct me if I am wrong :
1. Radeon 4870 has higher theoretical peak for DPFP than GTX 285. 240 gflops vs about 90?
2. Nvidia's implementation is however more complete. AMD doesnt have a DP sqrt, cos etc. Nvidia has implemented most common functions.
vasionok
29-Apr-2009, 12:50
2. Nvidia's implementation is however more complete. AMD doesnt have a DP sqrt, cos etc. Nvidia has implemented most common functions.
NVIDIA doesn't have DP sqrt and cos in hardware either. However, you can use them in CUDA, in which case they are simulated.
codedivine
29-Apr-2009, 16:29
Ok thanks. Does Nvidia have fp64 divide in hw?
The only intrinsic double-precision math functions in NVidia are add, mul and fma, according to appendix C of the CUDA Programming Guide 2.2.
Jawed
vasionok
29-Apr-2009, 17:23
Apparently, they added only simple DP multiply-and-add pipeline. Nothing like special function units in double precision.
The only intrinsic double-precision math functions in NVidia are add, mul and fma, according to appendix C of the CUDA Programming Guide 2.2.
Jawed
These intrinsic functions are just for different rounding modes (other than round-to-nearest-even). Since NVIDIA claims that fdiv in double precision is accurate, which is very difficult to simulate with only add and mul without additional precision, I think they do have a fdiv hardware for double precision. It's also noted that double precision fdiv and sqrt only support round-to-nearest-even.
Note that there is apparently no fdiv hardware for single precision, as single precision fdiv is computed with reciprocal (1/x * y), which is accurate as 754 required.
vasionok
30-Apr-2009, 08:57
These intrinsic functions are just for different rounding modes (other than round-to-nearest-even). Since NVIDIA claims that fdiv in double precision is accurate, which is very difficult to simulate with only add and mul without additional precision, I think they do have a fdiv hardware for double precision. It's also noted that double precision fdiv and sqrt only support round-to-nearest-even.
Note that there is apparently no fdiv hardware for single precision, as single precision fdiv is computed with reciprocal (1/x * y), which is accurate as 754 required.
(1/x * y), which is available as __fdividef intrinsic in CUDA, is not as accurate as 754 requires. First, it has 2 ulp since errors of 1/x and * add up. Second, if 1/x overflows or underflows but y/x does not, result is incorrect. So, they say it has 2 ulp error only if x is in range [2^-126, 2^126].
CUDA also offers single precision IEEE-compliant division as __fdiv_rn intrinsic, which is implemented with a lot of software support. You can see that by the size of the .cubin file and/or by disassembling it using decuda.
Similarly, if you compile (double)a/(double)b in CUDA and check the .cubin file. You'll see that it is huge, which is an indication that the operation is emulated in software. Disassembling it with decuda shows that it calls a subprogram, but you can't read to much details since decuda does not support any double precision arithmetics at all.
The point is: double precision division is emulated just as single precision IEEE-compliant division.
(1/x * y), which is available as __fdividef intrinsic in CUDA, is not as accurate as 754 requires. First, it has 2 ulp since errors of 1/x and * add up. Second, if 1/x overflows or underflows but y/x does not, result is incorrect. So, they say it has 2 ulp error only if x is in range [2^-126, 2^126].
Sorry, I meant "not" accurate as 754 required.
Similarly, if you compile (double)a/(double)b in CUDA and check the .cubin file. You'll see that it is huge, which is an indication that the operation is emulated in software. Disassembling it with decuda shows that it calls a subprogram, but you can't read to much details since decuda does not support any double precision arithmetics at all.
The point is: double precision division is emulated just as single precision IEEE-compliant division.
It's possible, but to be completely IEEE-compliant (for example, NVIDIA supports denormalized double precision, which is supported by div.f64 as well) is not easy to "emulate" in software without some hardware assistants.
Although I'd like to see a more complete comparison between NVIDIA and ATI's floating point implementation. CUDA programming guide has a small section about how NVIDIA's hardwares' behavior regarding to 754 compliance, but I didn't find similar texts in ATI's stream SDK document, maybe I just didn't look hard enough. To be fair, though, I think a full accurate round-to-nearest-even add/sub/mul/div is pretty much good enough for most usage.
It's very hard to find anything from AMD on precision. The R700 Family ISA says that DP is IEEE round-to-nearest (i.e. for ADD, MUL, MAD), not round-to-nearest-even.
Jawed
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.