Fornowagain
Newcomer
Factor in the constrained supply rumor too.depends on reviews i suppose. if a 4870x2 costs less and performs better than the GTX 280 prices will go down.
Depends how you define "weak". Current Core2Quad is 2 (vec2 fp64) * 4 (cores) * 2 (mul+add) * clock flops.
I skimmed Andy's GPU Gems 3 article the other day an noticed it mentioned double precision being useful to ensure numerical stability.DP would be useful for SATs and to filter exponential shadow maps on 'long' ranges without employing log-filtering.
Another point in that respect would be theoretical GFLOPS vs. sustained throughput. I must admit that I don't have much of a clue about this kind of stuff, but am under the impression that GPUs generally get much closer to theoretical peaks than CPUs.C2Q can do a mul and an add in a single cycle? Last I checked they didn't have a MAD/FMA instruction, though I haven't kept up with the latest SSE flavors.
I don't see how you could make one of the 8 lanes in a SIMD unit really different.btw is this supposed separate dp mad in addition to the 8 single precision mads (thus helping with the single precision flops too?) or replacing one of them?
Yes you're right this looks odd. Doesn't really make sense that one of the mads would just do a (implicit) double precision op while the other 7 do the same single precision op...I don't see how you could make one of the 8 lanes in a SIMD unit really different.
I got to wonder - it looks like a strange decision to have a unit explicitly for dp ops, instead of changing the sp units (and the control logic, if necessary) to also be able to handle dp ops. I don't know of any other chip having separate execution units for sp and dp.With the throughput on this unit being so low you'd only want to use it if the resultant has an 8 clock latency before being required by another instruction - otherwise it's going to stall other SP operations.
Yes, C2Q can do mul and add in the same cycle. So can Phenoms, P4 and K8... All these chips have separate execution units for FMUL and FADD (P4 and K8 only get half the throughput per clock because their units are only 64bit wide internally so need 2 cycles per instruction).C2Q can do a mul and an add in a single cycle? Last I checked they didn't have a MAD/FMA instruction, though I haven't kept up with the latest SSE flavors.
I think this depends a bit what you do. cpus should be able to get to their theoretical performance too with "simple" math kernels, but they might often get limited by memory bandwidth (requiring you possibly to rewrite algorithms to be more cache friendly) - Nehalem should be much better in that area than a penryn c2q, but obviously have nowhere close the memory bandwidth of a GTX280. OTOH, GPUs only come close to their peak performance if you can process data in batches equaling (or larger) than the chips batchsize (though actually with one dp unit per shader multiprocessor the batch size for dp ops should be only 2? Or maybe I'm misunderstanding something here).Another point in that respect would be theoretical GFLOPS vs. sustained throughput. I must admit that I don't have much of a clue about this kind of stuff, but am under the impression that GPUs generally get much closer to theoretical peaks than CPUs.
Yes you're right this looks odd. Doesn't really make sense that one of the mads would just do a (implicit) double precision op while the other 7 do the same single precision op...
I got to wonder - it looks like a strange decision to have a unit explicitly for dp ops, instead of changing the sp units (and the control logic, if necessary) to also be able to handle dp ops. I don't know of any other chip having separate execution units for sp and dp.
Incidentally I just prototyped a SAT + EVSMs implementation the other day to judge the feasibility... the good news is that the precision artifacts are on the same order as SAT + standard VSMs... the bad news is the sameI skimmed Andy's GPU Gems 3 article the other day an noticed it mentioned double precision being useful to ensure numerical stability.
It really depends on what you actually need to compute and what shortcuts you can take. Anyway..we are going OT, I'm going to stop here before Rys tells me off!Any idea what the perf. penalty would be going to EP?
Sounds like you could do your own math then, using say an int8 for the mantissa and int24 for the exponent.The other day I was actually thinking about using extended precision instead of double precision as to filter exponential values we already have enough mantissa bits, while we need more exponent bits. Theoretically extended precision implemented with half floats might be better than single precision values (range wise).
If the math is non-linear though you can't use hardware filtering (or SATs for that matter)Sounds like you could do your own math then, using say an int8 for the mantissa and int24 for the exponent.