NVIDIA Kepler speculation thread

Umm.., no.

Are you sure there aren't at least a few applications that do? After all, some HPC-oriented products weren't very good in DP. For instance IBM's Cell (first version), GT200, or even RV770. Granted, future versions of these products saw their DP performance greatly improved (except for RV770, so far anyway) but still.

Hell, there was even a G80-based Tesla card, even though G80 had no DP support at all (if I recall correctly).
 
Last edited by a moderator:
On Fermi, where there are more DP FPU's, there are less idling SP FPU's idling, and you will be running "closer to the TDP" when running DP FPU code.
DP fpu's use sp fpu's as well.

And TDP is expected to be the maximum power drawn, and not be workload dependent.
 
Are you sure there aren't at least a few applications that do? After all, some HPC-oriented products weren't very good in DP. For instance IBM's Cell (first version), GT200, or even RV770. Granted, future versions of these products saw their DP performance greatly improved (except for RV770, so far anyway) but still.

Well, unless you are going to count 0.0001% of HPC as non zero, HPC is DP only.

Although in many places SP could be fine. People just use DP as it gives less worries about accuracy, and at half rate, there's not much upside in risking low precision work.
 
hkultala's point is legit above, but I'm obviously missing something in that statement too. Shouldn't it be more like 2.5-3 GFLOPs/Watt on Fermi?

May be they used the real power figures (and not the public ones) when they calculated that. :p
 
May be they used the real power figures (and not the public ones) when they calculated that. :p

ROFL :LOL: Jokes aside, either way I turn it the 1.5GFLOP/W doesn't make sense. A C2050 has 515 GFLOPs/s DP peak performance; it would need what ~344W to reach that DP performance?
 
ROFL :LOL: Jokes aside, either way I turn it the 1.5GFLOP/W doesn't make sense. A C2050 has 515 GFLOPs/s DP peak performance; it would need what ~344W to reach that DP performance?

You are right. With 250W for power, and 15 SMs and 1.4G clock, I get ~2.7GFlops/W.

May be this is their way of "achieving" a 4x increase in DP flops/W from Fermi to Kepler. :smile:
 
:???: Too.

In regard to the "big GPUs do good GPUs" statement it's notlike Nvidia has some choice. The thing is likely to ship in end 2011/2012, discrete GPUs in the low end will have disapeared and we don't know what AMD and Intel attitudes will be in regard to more the low/mid segment and the development of their fusion chips. Nvidia better have something impressive in the high end and so both on graphics and compute fronts.

Anyway I've a tough time believing those statement. Smells fishy.
 
Well, unless you are going to count 0.0001% of HPC as non zero, HPC is DP only.

Although in many places SP could be fine. People just use DP as it gives less worries about accuracy, and at half rate, there's not much upside in risking low precision work.

OK, thanks!
 
I'm no expert in these things, but I don't think full speed DP makes much sense. If you can do N DP FLOPs per cycle, it probably doesn't require much additional hardware to be able to do 2N SP FLOPs per cycle.

Actually, you could do 4N sp flops easily.
I think this isn't that clear cut. Full speed DP makes no sense indeed. You need twice the operand fetch width, and for mul basically (more than) four times the multipliers. But, 2N sp could make sense if the cost of fetching operands is similar to the cost of the multiipliers themselves, since it requires the same fetch bandwidth (this is the SSE approach, obviously). Also, for adds it's easy to do half speed instead of quarter speed DP.
So really, AMD's approach makes sense. DP adds half speed (same fetch hw required as with twice the number of SP adds, not much more logic needed for the adders), DP mul (and hence mad) quarter speed (this "wastes" fetch hw, but very significant savings in the multipliers). If you have a really strong emphasis on DP, you can beef up your multipliers to be able to do half speed DP too, so you can utilize fetch hardware to the maximum extent - that is what nvidia has done for GF100 (and only GF100 not other Fermi chips). More than half speed DP makes absolutely no sense, and noone is going to try to do it (well, pretty sure of that - it would be crazy).
 
You are right. With 250W for power, and 15 SMs and 1.4G clock, I get ~2.7GFlops/W.

May be this is their way of "achieving" a 4x increase in DP flops/W from Fermi to Kepler. :smile:

Could it be that they were talking about "sustained" FLOPS/W and not peak? Maybe we're just over-analyzing NVIDIA's bullsh*t…
 
Well, unless you are going to count 0.0001% of HPC as non zero, HPC is DP only.

Although in many places SP could be fine. People just use DP as it gives less worries about accuracy, and at half rate, there's not much upside in risking low precision work.

There are a number of HPC apps with single precision.

DK
 
There are a number of HPC apps with single precision.

DK
Like?

And as I said, if you have half rate DP, then what is the upside of using single precision?

Also, in my experience, people just tend to shrug and use DP throughout just out of habit. May be you can share some of your experiences.
 
Where does your 190W for GT200b come? TDP?

If you are running only DP FPU code on GT200, most of the execution units(8/9) are idling, and you are not consuming that much power with just the DP FPU's.

On Fermi, where there are more DP FPU's, there are less idling SP FPU's idling, and you will be running "closer to the TDP" when running DP FPU code.

So you have invalid comparison.

actually the whole idea of paperflops per watt is weird, if you go through that kind of thinking.
 
Would you consider distributed computing projects as HPC, or only centralized systems? If the former, DC projects such as Folding @ Home make significant use of SP FLOPs, at least when using GPUs.
 
That's one data point. HPC's bigger than that.

I thought there was a class of HPC apps which used single precision. Perhaps air quality simulations are fine with HPC.

BTW, in your discussions with the authors, did you ask about why they used SP? What did they say? And did they consider the possible accuracy issues?
 
Back
Top