Nvidia BigK GK110 Kepler Speculation Thread

K20 and K20X NDA has been lifted. Titan claims #1 on the top500 supercomputer list.

http://www.anandtech.com/show/6446/nvidia-launches-tesla-k20-k20x-gk110-arrives-at-last

K20X has a TDP of 235watts. GF110's Tesla had a TDP of 250 watts, and the gtx580 was able to increase it's core clocks 19% while staying within the same power envelope. I know that comparing Kepler @ 28nm to Fermi @ 40nm is not apples to apples, but that bodes well for potential Geforce core clocks.

Interestingly enough, even though Oak Ridge National Laboratory has not had time yet to optimize the efficiency of the Titan supercomputer due to the very recent installation of the components, Titan appears to have the best performance/watt (using measured Linpack performance and measured power consumption) out of any supercomputer on the Top 500 list, including the well-regarded BlueGene Sequoia supercomputer:

http://www.top500.org/list/2012/11/

At the moment, Titan has a Linpack efficiency (ie. measured-to-theoretical performance) of ~ 65% compared to ~ 81% for BlueGene Sequoia. If Oak Ridge National Laboratory can improve the efficiency of Titan to anywhere close to 80%, then the "heterogenous" Titan supercomputer will further distance itself from the pack in terms of measured performance and performance/watt, all while achieving similar efficiency to the best of breed "homogenous" supercomputers on the list.

On a side note, what happened to the measured power consumption of the Intel Xeon Phi equipped Stampede supercomputer? This data point is completely missing, whereas each and every other supercomputer in the Top 10 has this data point included...
 
Last edited by a moderator:
On a side note, what happened to the measured power consumption of the Intel Xeon Phi equipped Stampede supercomputer? This data point is completely missing, whereas each and every other supercomputer in the Top 10 has this data point included...

I noted that also. It being missing adds speculation that Intel may be having issues with power usage for the Xeon Phi.
 
Double performance while reducing power

K20 and K20X NDA has been lifted. Titan claims #1 on the top500 supercomputer list.

http://www.anandtech.com/show/6446/nvidia-launches-tesla-k20-k20x-gk110-arrives-at-last

K20X has a TDP of 235watts. GF110's Tesla had a TDP of 250 watts, and the gtx580 was able to increase it's core clocks 19% while staying within the same power envelope. I know that comparing Kepler @ 28nm to Fermi @ 40nm is not apples to apples, but that bodes well for potential Geforce core clocks.

Also note that Nvidia was able to double DP performance 1.31 DP TFlops for the K20x vs 0.655 DP TFlops for the m2090 all the while reducing power by 6% (235 watts vs 250 watts).

That is impressive.
 
I noted that also. It being missing adds speculation that Intel may be having issues with power usage for the Xeon Phi.

According to an Intel slide, a supercomputer with a prototype Knights Corner (aka Xeon Phi, manufactured using Intel's 22nm fabrication process?) installation achieved 1.381 GFLOPS/w:

http://www.techpowerup.com/img/12-0...otchips_architecture_presentation_page_05.jpg

The Titan supercomputer in it's unoptimized state achieves 2.143 GFLOPS/w, which would make it the most energy efficient supercomputer in the world based on the prior quarter Green 500 list (in addition to the most powerful supercomputer in the world too):

http://www.green500.org/lists/green201206

If Oak Ridge gets closer to 80% Linpack efficiency (and that is a big "if"), then performance/watt would be closer to 2.642 GFLOPS/w for the Titan supercomputer, which would be more than double the performance/watt of the Barcelona supercomputing system using Tesla M2090 cards based on the Fermi architecture.
 
Last edited by a moderator:
According to an Intel slide, a supercomputer with a prototype Knights Corner (aka Xeon Phi, manufactured using Intel's 22nm fabrication process?) installation achieved 1.381 GFLOPS/w:

http://www.techpowerup.com/img/12-0...otchips_architecture_presentation_page_05.jpg

I noted that also. It being missing adds speculation that Intel may be having issues with power usage for the Xeon Phi.

Isn't Xeon Phi also a 300 watt part?
 
Last edited by a moderator:
Also from the Anandtech article: "Interestingly NVIDIA tells us that their yields are terrific..."

With 2 SMX and 1 MC disabled for the mainstream product, this should as a surprise to no one with a brain.

Logically speaking, you are correct. But some people believe that because NVIDIA has yet to release a full 15 SMX Tesla/Quadro/Geforce variant, then that means that yields are poor on the cut down parts with 14 or 13 SMX's. Go figure :)
 
Isn't Xeon Phi also a 300 watt part?

Possibly, but unknown for sure at this time. What is interesting is that the Linpack efficiency of the Xeon Phi equipped Stampede supercomputer is ~ 67%, which is only 2% higher than the unoptimized Titan supercomputer. Of course, Stampede may be unoptimized too, but we really won't know for sure until the next round of Top 500 supercomputer results in June 2013.
 
The Titan supercomputer in it's unoptimized state achieves 2.143 GFLOPS/w, which would make it the most energy efficient supercomputer in the world based on the prior quarter Green 500 list (in addition to the most powerful supercomputer in the world too):

http://www.green500.org/lists/green201206

If Oak Ridge gets closer to 80% Linpack efficiency (and that is a big "if"), then performance/watt would be closer to 2.642 GFLOPS/w for the Titan supercomputer, which would be more than double the performance/watt of the Barcelona supercomputing system using Tesla M2090 cards based on the Fermi architecture.
I don't think Titan's benchmark numbers qualify as "unoptimized". NV obviously supplies some library for the GPU kernels which is basically as good at it gets (they claim >90% of the peak flops in the DGEMM kernel). And the scaling from that to the fully implemented HPL code on a single node (76% efficiency) to the large scale efficiency of Titan (~65%) is completely in line with what other heterogenous clusters achieve (it's a cluster with Cypress GPUs, AMD GPUs have >90% efficiency in the DGEMM kernel for years, so it is somewhat comparable; they got 75.54% efficiency of the full HPL code on a single node, 70.6% on 4 nodes and 69.7% on 550 nodes). The scaling from a few nodes to a large number of nodes is obviously relatively flat with a good network and appropriate code. And I doubt Cray built clusters show subpar network performance. Furthermore, they had plenty of time to test and tune this using the 960 nodes with the Fermi Tesla cards they had for quite some time already. For me, the ~65% number we see for Titan appears to be right on target for the expectations.
 
I don't think Titan's benchmark numbers qualify as "unoptimized".

According to Oak Ridge National Laboratory scientific computing chief Jeff Nichols, the lab team did not have time to fully optimize Titan for the benchmark tests:

http://www.knoxnews.com/news/2012/nov/12/ornl-unveils-worlds-fastest-computer-151-titan/

So when I said "unoptimized", I meant "not fully optimized yet". Since Kepler has some notable new compute features aimed at improving efficiency relative to Fermi, I wouldn't be surprised to see some efficiency gains later down the road from Titan as the programming team learns how to take better advantage of these new features.
 
Last edited by a moderator:
I noted that also. It being missing adds speculation that Intel may be having issues with power usage for the Xeon Phi.

SC#52, Discover comes with Xeon Phi and has an efficiency of 66,3% and about 1,93 GFLOPS/watt.
http://www.top500.org/system/177993

Intels own Endeavor (#57) weighs in at 75,5% efficiency but only 1,26 GFLOPS/watt.
http://www.top500.org/system/176908

There's another K20-equipped system at #90 called Todi: 69,7%; 2,25 GFLOPS/watt.
http://www.top500.org/system/177472

But what's wrong with AMDs FirePro? There's only one system in the Top100 and it's efficiency is an abysmal 23 percent with no power figure given.
http://www.top500.org/system/177996 -> see here for additional info: http://forum.beyond3d.com/showpost.php?p=1679422&postcount=4172. Seems like the entry was borked at the time of me posting.
 
Last edited by a moderator:
Makes you wonder why no one equipped a Top500-System with decent & recent FirePros.
 
K20X has a TDP of 235watts. GF110's Tesla had a TDP of 250 watts, and the gtx580 was able to increase it's core clocks 19% while staying within the same power envelope.

No, it did NOT stay in the same power envelope. The 244W TDP sticker was more marketing than reality (while I guess the tesla held to specs).

At the moment, Titan has a Linpack efficiency (ie. measured-to-theoretical performance) of ~ 65% compared to ~ 81% for BlueGene Sequoia. If Oak Ridge National Laboratory can improve the efficiency of Titan to anywhere close to 80%, then the "heterogenous" Titan supercomputer will further distance itself from the pack in terms of measured performance and performance/watt

Still Linpack is a pretty gpu-friendly workload, so still kinda theoretical.
 
No, it did NOT stay in the same power envelope. The 244W TDP sticker was more marketing than reality (while I guess the tesla held to specs).

Yes it did for its intended workload, namely gaming. TDP is thermal design power, not power consumption, and thus an average value. Spikes in consumption are okay as long the average consumption stays below the TDP. These measurements are card only and real world games, not Furmark etc.:

http://www.3dcenter.org/artikel/ein...auchs/eine-neubetrachtung-des-grafikkarten-st

223W on average for at least 6 games (I don't know how many games Heise tests with).
 
Makes you wonder why no one equipped a Top500-System with decent & recent FirePros.

AMD's first and probably the only real presence it will have in HPC is through its x86 line. Its heart was never in a programming model it never cared for, especially since it had repeatedly embarrassed itself with its default product line already.
You can tell from how it trumpets the AMD-powered Titan supercomputer .

I'm going to set up a little math joke, although doubtless everyone in this thread has already caught the punchline.

The Opteron 6274 provides 70.4 GFLOPS per chip.
Per AMD's own web announcement:
The DOE’s ORNL supercomputer contains 18,688 nodes, each holding a 16-core AMD Opteron 6274 processor, for a total of almost 300,000 cores at 20 petaFLOPS.

I like to stare at this, and just marvel at it. There are so many things that whirl around in my mind to do or say.

But just look at it.

Look at it.

Look.

.
 
Back
Top