Umm.., no.
According to Nvidia, Fermi architecture is capable of achieving typical double precision (DP) performance of 1.5GFLOPS per watt.
DP fpu's use sp fpu's as well.On Fermi, where there are more DP FPU's, there are less idling SP FPU's idling, and you will be running "closer to the TDP" when running DP FPU code.
Are you sure there aren't at least a few applications that do? After all, some HPC-oriented products weren't very good in DP. For instance IBM's Cell (first version), GT200, or even RV770. Granted, future versions of these products saw their DP performance greatly improved (except for RV770, so far anyway) but still.
hkultala's point is legit above, but I'm obviously missing something in that statement too. Shouldn't it be more like 2.5-3 GFLOPs/Watt on Fermi?
May be they used the real power figures (and not the public ones) when they calculated that.
ROFL Jokes aside, either way I turn it the 1.5GFLOP/W doesn't make sense. A C2050 has 515 GFLOPs/s DP peak performance; it would need what ~344W to reach that DP performance?
Well, unless you are going to count 0.0001% of HPC as non zero, HPC is DP only.
Although in many places SP could be fine. People just use DP as it gives less worries about accuracy, and at half rate, there's not much upside in risking low precision work.
I'm no expert in these things, but I don't think full speed DP makes much sense. If you can do N DP FLOPs per cycle, it probably doesn't require much additional hardware to be able to do 2N SP FLOPs per cycle.
I think this isn't that clear cut. Full speed DP makes no sense indeed. You need twice the operand fetch width, and for mul basically (more than) four times the multipliers. But, 2N sp could make sense if the cost of fetching operands is similar to the cost of the multiipliers themselves, since it requires the same fetch bandwidth (this is the SSE approach, obviously). Also, for adds it's easy to do half speed instead of quarter speed DP.Actually, you could do 4N sp flops easily.
You are right. With 250W for power, and 15 SMs and 1.4G clock, I get ~2.7GFlops/W.
May be this is their way of "achieving" a 4x increase in DP flops/W from Fermi to Kepler. :smile:
Maybe we're just over-analyzing NVIDIA's bullsh*t…
Well, unless you are going to count 0.0001% of HPC as non zero, HPC is DP only.
Although in many places SP could be fine. People just use DP as it gives less worries about accuracy, and at half rate, there's not much upside in risking low precision work.
Like?There are a number of HPC apps with single precision.
DK
Like?
And as I said, if you have half rate DP, then what is the upside of using single precision?
Also, in my experience, people just tend to shrug and use DP throughout just out of habit. May be you can share some of your experiences.
Where does your 190W for GT200b come? TDP?
If you are running only DP FPU code on GT200, most of the execution units(8/9) are idling, and you are not consuming that much power with just the DP FPU's.
On Fermi, where there are more DP FPU's, there are less idling SP FPU's idling, and you will be running "closer to the TDP" when running DP FPU code.
So you have invalid comparison.