Nvidia GT300 core: Speculation

Status
Not open for further replies.
OK. The issue here is that I am talking specifically (and only) about FMA and not about other fp functionality. The rest of the stuff is obviously important and relevant.
 
"adds the lower 64 bits of the result" implies rounding to me.
Multiplying two doubles, each with 52-bit mantissas (implicitly 53-bit numbers) results in at most 106-bits of result, which would become a 105-bit encoding if IEEE went that far, so bits have to be chopped off the end of the process (I'm not saying the hardware actually computes all 106-bits, I don't know what it does). If you wanted the result of this multiply you'd only use the top 53 bits of that 105-bit result.

64-bits is clearly more than the 53-bit precision requirement. I can't tell if the wording is sloppy (since the absolute precision of the initial MUL is not stated) or contains some implications I can't discern.

Maybe someone can work out what's going on?

Where'd you find that by the way? It's a nightmare trying to find info on AMD's stuff.
http://developer.amd.com/gpu_assets/R700-Family_Instruction_Set_Architecture.pdf

Jawed
 
I know at one point AMD couldn't do certain transcendental functions with DP, but I believe that was a software limitation. I don’t know if that’s still true or not (or if it really was a software limitation).
No GPU can do double-precision transcendentals intrinsically - they effectively use macros. In the ATI case currently you have to write your own :rolleyes: Only, what, 19 months since the first double-precision GPU arrived.

NVidia, pretty reasonably, appears to put high value on precision (if you're gonna bother with double, then you prolly care), with lots of double-precision functions (e.g. log or sin) being 1 or 2 ULP. Best of luck with your ATI versions...

OpenCL only optionally supports double precision - but the specification is quite thorough in what's required if enabled, though fused multiply-add is optional. Precisions are quite loose, generally.

Jawed
 
No GPU can do double-precision transcendentals intrinsically - they effectively use macros. In the ATI case currently you have to write your own :rolleyes: Only, what, 19 months since the first double-precision GPU arrived.

NVidia, pretty reasonably, appears to put high value on precision (if you're gonna bother with double, then you prolly care), with lots of double-precision functions (e.g. log or sin) being 1 or 2 ULP. Best of luck with your ATI versions...

OpenCL only optionally supports double precision - but the specification is quite thorough in what's required if enabled, though fused multiply-add is optional. Precisions are quite loose, generally.

Jawed

Seems to make sense, I haven't actually had to deal with DP stuff yet, so I'm no means an expert in this area.
 
Multiplying two doubles, each with 52-bit mantissas (implicitly 53-bit numbers) results in at most 106-bits of result, which would become a 105-bit encoding if IEEE went that far, so bits have to be chopped off the end of the process (I'm not saying the hardware actually computes all 106-bits, I don't know what it does). If you wanted the result of this multiply you'd only use the top 53 bits of that 105-bit result.

Why would you only need the top 53-bits? It's quite possible for the thrown away bits to affect to result of the addition. The relevant patent explictly indicates that the full 106-bit product flows down to the adder, no truncation involved.
patent said:
In mantissa path 516 (FIG. 8), multiplier 802 computes the 106-bit product Am*Bm and provides the product to 168-bit adder 804

Jawed said:

Thanks!
 
Why would you only need the top 53-bits?
If you wanted the result of just the multiply, then you'd take the top 53-bits, encode as sign+52-bit+exponent, then be done. So clearly doing anything with more than 53-bits (i.e. 64-bits) implies more precision is being used.

It's quite possible for the thrown away bits to affect to result of the addition. The relevant patent explictly indicates that the full 106-bit product flows down to the adder, no truncation involved.
That's if you want to do full-speed denormals it seems, where all 106+52=168 bits are required.

I guess they decided that since double-precision throughput would be so low they might as well go the whole hog with precision. Perhaps roused by the 80-bit extended precision of x86, an extra 11-bits of mantissa...

Since ATI doesn't do denormals (should call them subnormals) the, what I'm guessing is 4x27-bit, 108 bits is all that's required. I just don't get the whole thing though.

Thanks for the patent, hadn't seen it on my last trawl.

Jawed
 
I think if you look at the CUDA spec, it's clear that NV anticipates making cards with no DP support. Those would probably be consumer oriented mainstream cards where you don't want to spend die space on un-used features....or maybe cards where you want to disable DP to force CUDA users to buy more expensive parts for production use.

Regarding DP support - first of all, it's very nebulous any way. DP in NV GPUs isn't IEEE compliant, even if they have the IEEE data types, so you can't just do a straight mindless port anyway. I don't know about ATI GPUs, but GPGPU is even less of a priority for then, so go figure...

Doing DP calculations in SW on SP hardware is quite fine, it's basically the equivalent of microcode, but for GPUs. There is a performance hit, but it's tolerable...and as someone pointed out, it's the end result that matters (GFLOPs/watt or per $), not how you get there. ATI has the performance to be interesting (and in theory, they could probably have better IEEE compliance since it's mostly in SW), but their ecosystem is pretty weak.

As much as I think proprietary languages are total bullshit, NV at least has the start of an ecosystem for their developers. I don't really think ATI does. That's definitely a big problem, but I think ATI's strategy is the same as what AMD does in the CPU world - let the leader forge the path, and then follow in their footsteps. My guess is ATI will rely on OpenCL for their ecosystem, which makes more sense to me (long term), but it does mean that short term they will have big challenges.

Anyway, what you really want in HW is some sort of vector functional unit that can operate on both SP and DP input, but with DP performance = 1/2 SP performance (like SSE units). GPUs may move towards that, but there are many steps along the way (SW libraries, microcode, dedicated DP, etc.).

DK
 
If you wanted the result of just the multiply, then you'd take the top 53-bits, encode as sign+52-bit+exponent, then be done. So clearly doing anything with more than 53-bits (i.e. 64-bits) implies more precision is being used.

I read that 64-bit as being the full shebang - mantissa + exponent. So I'm not sure there's any extra precision there.
 
As much as I think proprietary languages are total bullshit,
Minor point, but CUDA is only proprietary in that nVidia developed it and is the only IHV to support it. There isn't, in principle, anything preventing any other IHV from producing their own compiler for CUDA.

Of course, CUDA is developed with nVidia's architectures in mind, so there may be performance-oriented disincentives for other IHV's to produce their own CUDA compilers.
 
I think if you look at the CUDA spec, it's clear that NV anticipates making cards with no DP support. Those would probably be consumer oriented mainstream cards where you don't want to spend die space on un-used features....or maybe cards where you want to disable DP to force CUDA users to buy more expensive parts for production use.

that was the plan for GT200 initially, but they changed their mind and let DP enabled on all cards.
I wonder about GT21x : are they similar to G92, or GT200? if the latter, they would be "CUDA level 1.2" GPUs. (1.0 is G80, 1.1 is G84/86 and G9x, 1.3 is GT200)

do you refer to further specs tailored for GT3xx?
 
I read that 64-bit as being the full shebang - mantissa + exponent. So I'm not sure there's any extra precision there.
In theory, for a full-precision result, the mantissas are adjusted in their alignment:

Code:
                                                                                                         1
               1         2         3         4         5         6         7          8        9         0
      1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456
A*B = 10101101010111010101011010101010101010101010101010101010111001101011011001101011111010101 e27 
C   = 11100011100101111001110001011011110100101110 e-33

exponent difference=60, shifts to:
Code:
                                                                                                         1
               1         2         3         4         5         6         7          8        9         0
      1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456 
      10101101010111010101011010101010101010101010101010101010111001101011011001101011111010101 e27
                                                                  11100011100101111001110001011011110100101110 e27
The addend in this case effectively becomes subnormal (denormal), with 60 leading zeroes (which I didn't bother to type). Obviously NVidia's implementation with a 168-bit adder is just a wider version of this.

The trouble is working out which 64-bits of what kind of result the R700 documentation is referring to. The description just seems incomplete. Need to sleep on it...

Jawed
 
that was the plan for GT200 initially, but they changed their mind and let DP enabled on all cards.

Wonder if this has to do with yields? Better to build lower end cards from part broken higher end cards that had DP? DP on lower end for free?

I wonder about GT21x : are they similar to G92, or GT200? if the latter, they would be "CUDA level 1.2" GPUs. (1.0 is G80, 1.1 is G84/86 and G9x, 1.3 is GT200)

Also been wondering about this. Might be in the CUDA 2.3 docs, will have to check later...
 
Minor point, but CUDA is only proprietary in that nVidia developed it and is the only IHV to support it. There isn't, in principle, anything preventing any other IHV from producing their own compiler for CUDA.

Of course, CUDA is developed with nVidia's architectures in mind, so there may be performance-oriented disincentives for other IHV's to produce their own CUDA compilers.

If you think CUDA is anything other than a proprietary language, you have a serious case of rectal cranial inversion.

Someone could in theory produce a CPU that is binary compatible with IBM mainframes using BT or other tricks. Nonetheless, IBM mainframes are still quite proprietary and expensive as hell.

DK
 
If you think CUDA is anything other than a proprietary language, you have a serious case of rectal cranial inversion.

Someone could in theory produce a CPU that is binary compatible with IBM mainframes using BT or other tricks. Nonetheless, IBM mainframes are still quite proprietary and expensive as hell.

DK
Not exactly the same situation. The language level is far, far above the machine level.
 
His predictions for the 2015 timeframe seem extremely optimistic. As of now it's still up in the air whether mighty Intel can achieve 11nm by then. Far less for third parties like TSMC and GF.
 
I think that would be 345G*5=~1.7T. Now if we assume 1.5GHz clock then we get 1.7T/2 (for fma) /1.5G ~ 500 ALU's. Which fits with the 512 ALU rumours for GT300.
 
Status
Not open for further replies.
Back
Top