NVIDIA Kepler speculation thread

Ailuros · Sep 22, 2010

Good point :!:

Alexko · Sep 22, 2010

rpg.314 said:
Umm.., no.

Are you sure there aren't at least a few applications that do? After all, some HPC-oriented products weren't very good in DP. For instance IBM's Cell (first version), GT200, or even RV770. Granted, future versions of these products saw their DP performance greatly improved (except for RV770, so far anyway) but still.

Hell, there was even a G80-based Tesla card, even though G80 had no DP support at all (if I recall correctly).

Ailuros · Sep 22, 2010

Errr wait a second....

http://www.xbitlabs.com/news/video/...epler_and_Maxwell_Architectures_Incoming.html

According to Nvidia, Fermi architecture is capable of achieving typical double precision (DP) performance of 1.5GFLOPS per watt.

hkultala's point is legit above, but I'm obviously missing something in that statement too. Shouldn't it be more like 2.5-3 GFLOPs/Watt on Fermi?

rpg.314 · Sep 22, 2010

hkultala said:
On Fermi, where there are more DP FPU's, there are less idling SP FPU's idling, and you will be running "closer to the TDP" when running DP FPU code.

DP fpu's use sp fpu's as well.

And TDP is expected to be the maximum power drawn, and not be workload dependent.

rpg.314 · Sep 22, 2010

Alexko said:
Are you sure there aren't at least a few applications that do? After all, some HPC-oriented products weren't very good in DP. For instance IBM's Cell (first version), GT200, or even RV770. Granted, future versions of these products saw their DP performance greatly improved (except for RV770, so far anyway) but still.

Well, unless you are going to count 0.0001% of HPC as non zero, HPC is DP only.

Although in many places SP could be fine. People just use DP as it gives less worries about accuracy, and at half rate, there's not much upside in risking low precision work.

rpg.314 · Sep 22, 2010

Ailuros said:
hkultala's point is legit above, but I'm obviously missing something in that statement too. Shouldn't it be more like 2.5-3 GFLOPs/Watt on Fermi?

May be they used the real power figures (and not the public ones) when they calculated that.

Ailuros · Sep 22, 2010

rpg.314 said:
May be they used the real power figures (and not the public ones) when they calculated that.

ROFL

Jokes aside, either way I turn it the 1.5GFLOP/W doesn't make sense. A C2050 has 515 GFLOPs/s DP peak performance; it would need what ~344W to reach that DP performance?

rpg.314 · Sep 22, 2010

Ailuros said:
ROFL Jokes aside, either way I turn it the 1.5GFLOP/W doesn't make sense. A C2050 has 515 GFLOPs/s DP peak performance; it would need what ~344W to reach that DP performance?

You are right. With 250W for power, and 15 SMs and 1.4G clock, I get ~2.7GFlops/W.

May be this is their way of "achieving" a 4x increase in DP flops/W from Fermi to Kepler. :smile:

liolio · Sep 22, 2010

Too.

In regard to the "big GPUs do good GPUs" statement it's notlike Nvidia has some choice. The thing is likely to ship in end 2011/2012, discrete GPUs in the low end will have disapeared and we don't know what AMD and Intel attitudes will be in regard to more the low/mid segment and the development of their fusion chips. Nvidia better have something impressive in the high end and so both on graphics and compute fronts.

Anyway I've a tough time believing those statement. Smells fishy.

Alexko · Sep 22, 2010

rpg.314 said:
Well, unless you are going to count 0.0001% of HPC as non zero, HPC is DP only.

Although in many places SP could be fine. People just use DP as it gives less worries about accuracy, and at half rate, there's not much upside in risking low precision work.

OK, thanks!

mczak · Sep 22, 2010

Alexko said:
I'm no expert in these things, but I don't think full speed DP makes much sense. If you can do N DP FLOPs per cycle, it probably doesn't require much additional hardware to be able to do 2N SP FLOPs per cycle.

rpg.314 said:
Actually, you could do 4N sp flops easily.

I think this isn't that clear cut. Full speed DP makes no sense indeed. You need twice the operand fetch width, and for mul basically (more than) four times the multipliers. But, 2N sp could make sense if the cost of fetching operands is similar to the cost of the multiipliers themselves, since it requires the same fetch bandwidth (this is the SSE approach, obviously). Also, for adds it's easy to do half speed instead of quarter speed DP.
So really, AMD's approach makes sense. DP adds half speed (same fetch hw required as with twice the number of SP adds, not much more logic needed for the adders), DP mul (and hence mad) quarter speed (this "wastes" fetch hw, but very significant savings in the multipliers). If you have a really strong emphasis on DP, you can beef up your multipliers to be able to do half speed DP too, so you can utilize fetch hardware to the maximum extent - that is what nvidia has done for GF100 (and only GF100 not other Fermi chips). More than half speed DP makes absolutely no sense, and noone is going to try to do it (well, pretty sure of that - it would be crazy).

Alexko · Sep 22, 2010

rpg.314 said:
You are right. With 250W for power, and 15 SMs and 1.4G clock, I get ~2.7GFlops/W.

May be this is their way of "achieving" a 4x increase in DP flops/W from Fermi to Kepler. :smile:

Could it be that they were talking about "sustained" FLOPS/W and not peak? Maybe we're just over-analyzing NVIDIA's bullsh*t…

Ailuros · Sep 22, 2010

Alexko said:
Maybe we're just over-analyzing NVIDIA's bullsh*t…

dkanter · Sep 22, 2010

rpg.314 said:
Well, unless you are going to count 0.0001% of HPC as non zero, HPC is DP only.

Although in many places SP could be fine. People just use DP as it gives less worries about accuracy, and at half rate, there's not much upside in risking low precision work.

There are a number of HPC apps with single precision.

DK

rpg.314 · Sep 22, 2010

dkanter said:
There are a number of HPC apps with single precision.

DK

Like?

And as I said, if you have half rate DP, then what is the upside of using single precision?

Also, in my experience, people just tend to shrug and use DP throughout just out of habit. May be you can share some of your experiences.

dkanter · Sep 22, 2010

rpg.314 said:
Like?

And as I said, if you have half rate DP, then what is the upside of using single precision?

Also, in my experience, people just tend to shrug and use DP throughout just out of habit. May be you can share some of your experiences.

http://www.realworldtech.com/page.cfm?ArticleID=RWT050310102525

That's just one I can reference off the top of my head.

David

Blazkowicz · Sep 22, 2010

hkultala said:
Where does your 190W for GT200b come? TDP?

If you are running only DP FPU code on GT200, most of the execution units(8/9) are idling, and you are not consuming that much power with just the DP FPU's.

On Fermi, where there are more DP FPU's, there are less idling SP FPU's idling, and you will be running "closer to the TDP" when running DP FPU code.

So you have invalid comparison.

actually the whole idea of paperflops per watt is weird, if you go through that kind of thinking.

ShaidarHaran · Sep 22, 2010

Would you consider distributed computing projects as HPC, or only centralized systems? If the former, DC projects such as Folding @ Home make significant use of SP FLOPs, at least when using GPUs.

larrabee · Sep 22, 2010

the code for f@H is modified GROMACS it is (or more arguably was) HPC oriented. they dont use MPI anymore and SSE is no longer exclusive to linux.

rpg.314 · Sep 23, 2010

That's one data point. HPC's bigger than that.

I thought there was a class of HPC apps which used single precision. Perhaps air quality simulations are fine with HPC.

BTW, in your discussions with the authors, did you ask about why they used SP? What did they say? And did they consider the possible accuracy issues?

NVIDIA Kepler speculation thread

Ailuros

Epsilon plus three

Alexko

Ailuros

Epsilon plus three

rpg.314

rpg.314

rpg.314

Ailuros

Epsilon plus three

rpg.314

liolio

Aquoiboniste

Alexko

mczak

Alexko

Ailuros

Epsilon plus three

dkanter

rpg.314

dkanter

Blazkowicz

ShaidarHaran

hardware monkey

larrabee

rpg.314

Similar threads