NVIDIA GT200 Rumours & Speculation Thread

Status
Not open for further replies.
DP would be useful for SATs and to filter exponential shadow maps on 'long' ranges without employing log-filtering.
I skimmed Andy's GPU Gems 3 article the other day an noticed it mentioned double precision being useful to ensure numerical stability.
 
C2Q can do a mul and an add in a single cycle? Last I checked they didn't have a MAD/FMA instruction, though I haven't kept up with the latest SSE flavors.
Another point in that respect would be theoretical GFLOPS vs. sustained throughput. I must admit that I don't have much of a clue about this kind of stuff, but am under the impression that GPUs generally get much closer to theoretical peaks than CPUs.
 
I wonder if NVIDIA are gambling that folks will be smart when writing their CUDA code and only use DP where it's really necessary, using SP everywhere else.

78GFLOPS isn't much, arguably it's not even worth the hassle of porting a code to CUDA when £5k gets you a 16-core Opteron box and fifteen minutes with an OpenMP manual is much simpler and cheaper on the old manpower front.
 
btw is this supposed separate dp mad in addition to the 8 single precision mads (thus helping with the single precision flops too?) or replacing one of them?
I don't see how you could make one of the 8 lanes in a SIMD unit really different.

Clearly you can perform a double-precision operation on two SP operands - so there's a theoretical bump in SP performance (yay, 1TFLOP) but I guess the rounding behaviour will be a bit different...

With the throughput on this unit being so low you'd only want to use it if the resultant has an 8 clock latency before being required by another instruction - otherwise it's going to stall other SP operations.

Jawed
 
I don't see how you could make one of the 8 lanes in a SIMD unit really different.
Yes you're right this looks odd. Doesn't really make sense that one of the mads would just do a (implicit) double precision op while the other 7 do the same single precision op...

With the throughput on this unit being so low you'd only want to use it if the resultant has an 8 clock latency before being required by another instruction - otherwise it's going to stall other SP operations.
I got to wonder - it looks like a strange decision to have a unit explicitly for dp ops, instead of changing the sp units (and the control logic, if necessary) to also be able to handle dp ops. I don't know of any other chip having separate execution units for sp and dp.
 
C2Q can do a mul and an add in a single cycle? Last I checked they didn't have a MAD/FMA instruction, though I haven't kept up with the latest SSE flavors.
Yes, C2Q can do mul and add in the same cycle. So can Phenoms, P4 and K8... All these chips have separate execution units for FMUL and FADD (P4 and K8 only get half the throughput per clock because their units are only 64bit wide internally so need 2 cycles per instruction).
So while a single MAD would require 2 cycles (cause the chip couldn't issue the mul and add part simultaneously - actually that's more than 2 cycles cause of the dependency and the mul instruction latency), the peak throughput with 2 separate units for mul and add is indeed the same as with a single mad unit. In theory mul + add has benefits because the chip can issue independent, well, muls and adds, but single mad should be easier for dependency tracking and stuff since only one instruction needs to be issued).
 
Another point in that respect would be theoretical GFLOPS vs. sustained throughput. I must admit that I don't have much of a clue about this kind of stuff, but am under the impression that GPUs generally get much closer to theoretical peaks than CPUs.
I think this depends a bit what you do. cpus should be able to get to their theoretical performance too with "simple" math kernels, but they might often get limited by memory bandwidth (requiring you possibly to rewrite algorithms to be more cache friendly) - Nehalem should be much better in that area than a penryn c2q, but obviously have nowhere close the memory bandwidth of a GTX280. OTOH, GPUs only come close to their peak performance if you can process data in batches equaling (or larger) than the chips batchsize (though actually with one dp unit per shader multiprocessor the batch size for dp ops should be only 2? Or maybe I'm misunderstanding something here).
 
Yes you're right this looks odd. Doesn't really make sense that one of the mads would just do a (implicit) double precision op while the other 7 do the same single precision op...


I got to wonder - it looks like a strange decision to have a unit explicitly for dp ops, instead of changing the sp units (and the control logic, if necessary) to also be able to handle dp ops. I don't know of any other chip having separate execution units for sp and dp.

Yes, it does have sort of a tacked-on feel, almost as though it were an afterthought...

There is one advantage to having separate DP units though - your SP units aren't tied up so you can still get full rate there (barring setup issues).
 
I skimmed Andy's GPU Gems 3 article the other day an noticed it mentioned double precision being useful to ensure numerical stability.
Incidentally I just prototyped a SAT + EVSMs implementation the other day to judge the feasibility... the good news is that the precision artifacts are on the same order as SAT + standard VSMs... the bad news is the same :)

I may be able to pull something with mod arithmetic like I did with the int32 stuff in the chapter/demo, but it's not as obvious how to make that work well with the exponential warp. It'd be a really fun thing to play with using doubles.
 
The other day I was actually thinking about using extended precision instead of double precision as to filter exponential values we already have enough mantissa bits, while we need more exponent bits. Theoretically extended precision implemented with half floats might be better than single precision values (range wise).
 
The other day I was actually thinking about using extended precision instead of double precision as to filter exponential values we already have enough mantissa bits, while we need more exponent bits. Theoretically extended precision implemented with half floats might be better than single precision values (range wise).
Sounds like you could do your own math then, using say an int8 for the mantissa and int24 for the exponent.

Jawed
 
Sounds like you could do your own math then, using say an int8 for the mantissa and int24 for the exponent.
If the math is non-linear though you can't use hardware filtering (or SATs for that matter) :(

But yes, G(T)200 ftw ;) That said, my guess is that they don't bother to expose DP until the next major API update. Perhaps they'll add it to the GL extensions, but I dunno whether I'm keen enough to recode everything in GL again! ;)
 
Status
Not open for further replies.
Back
Top