Nvidia BigK GK110 Kepler Speculation Thread

drive-by to say that although I left NV in July, I was the CUDA SW lead for Titan for a long time (over a year). B3D taught me well :) (this is the second #1 machine I was heavily involved with, I worked on Tianhe-1A as well)

also, it should be noted that it's hard to get efficiency close to BG/Q on any x86 machine--even Jaguar was running at 75% efficiency.
 
drive-by to say that although I left NV in July, I was the CUDA SW lead for Titan for a long time (over a year). B3D taught me well :) (this is the second #1 machine I was heavily involved with, I worked on Tianhe-1A as well)

also, it should be noted that it's hard to get efficiency close to BG/Q on any x86 machine--even Jaguar was running at 75% efficiency.

Why did you leave Nvidia? Where are you at now?
 
So when I said "unoptimized", I meant "not fully optimized yet".
No such thing as "fully optimized"...


Since Kepler has some notable new compute features aimed at improving efficiency relative to Fermi, I wouldn't be surprised to see some efficiency gains later down the road from Titan as the programming team learns how to take better advantage of these new features.

They have had silicon in house for at least a year...
 
Why did you leave Nvidia? Where are you at now?
Why I left: lots of reasons, but the main one is that at this point I don't think HPC is where I want to spend the majority of my career. It was an unbelievable first job to have in terms of advancement and learning opportunities, but it was time to move on.

Working on the Android RenderScript team at Google now. (did you know: we shipped GPU compute on Nexus 10)
 
SC#52, Discover comes with Xeon Phi and has an efficiency of 66,3% and about 1,93 GFLOPS/watt.
http://www.top500.org/system/177993

Intels own Endeavor (#57) weighs in at 75,5% efficiency but only 1,26 GFLOPS/watt.
http://www.top500.org/system/176908

There's another K20-equipped system at #90 called Todi: 69,7%; 2,25 GFLOPS/watt.
http://www.top500.org/system/177472

But what's wrong with AMDs FirePro? There's only one system in the Top100 and it's efficiency is an abysmal 23 percent with no power figure given.
http://www.top500.org/system/177996 -> see here for additional info: http://forum.beyond3d.com/showpost.php?p=1679422&postcount=4172. Seems like the entry was borked at the time of me posting.

When running Linpack, the Titan system overall achieves ~ 2.1428 GFLOPS/w, but GK110 by itself achieves ~ 7 GFLOPS/w (per Bill Dally @ NVIDIA in his SC12 presentation). So GK110 appears to be very energy efficient (relatively speaking). With Titan, there is a 1:1 ratio of CPU-to-GPU, a very high total number of cores, a very high performance network, and other things that contribute to the overall power consumption and reduced energy efficiency relative to running GK110 by itself.

With respect to the King Abdulaziz supercomputing system, the data has been revised/updated to include power consumption. The Linpack performance/watt for the system is quite good (relatively speaking), but the Linpack efficiency is still really low, at ~ 38.4% of the theoretical peak performance. Any thoughts on why the efficiency would be so low?
 
Last edited by a moderator:
When running Linpack, the Titan system overall achieves ~ 2.1428 GFLOPS/w, but GK110 by itself achieves ~ 7 GFLOPS/w (per Bill Dally @ NVIDIA in his SC12 presentation).

I wonder where the line is drawn, to get that incredibly precise 2.1428 figure.
There's the power used by the rack cabinets themselves (maybe 208V DC or some combination of DC voltages). Then power used by the transformers/PSU that generate that supply from whatever comes to the building. The UPS batteries too, and then cooling that big server room.
 
With respect to the King Abdulaziz supercomputing system, the data has been revised/updated to include power consumption. The Linpack performance/watt for the system is quite good (relatively speaking), but the Linpack efficiency is still really low, at ~ 38.4% of the theoretical peak performance. Any thoughts on why the efficiency would be so low?
The data in the Top500 list is still wrong as said in the SI thread already. The cluster has a theoretical peak of just 674.7 TFLOP/s and runs with a Linpack efficiency of 62.4%. The predecessor of that cluster (using Cypress cards) attained a Linpack efficiency of 69,6% iirc. The difference is that they put now two much faster cards in a single node (instead of a single Cypress card) and the network basically stayed the same potentially starting to limit a bit in comparison. The efficiency of the DGEMM kernels on the GPU is in both cases close to or even above 90%.
 
Last edited by a moderator:
An erroneous peak performance data point would certainly explain the strangely low Linpack efficiency. That is a pretty big error for Top 500 to make, because the peak system performance goes from 1.098 Petaflops to 0.6747 Petaflops.
 
That is a pretty big error for Top 500 to make, because the peak system performance goes from 1.098 Petaflops to 0.6747 Petaflops.
The core count is also completely wrong. Top500 says there are 33600 accelerator cores and 38400 in total, while there are in fact 3360 CPU cores and 840 GPUs (on 420 S10000 cards). Looks like they should train that copy paste stuff a bit more. :LOL:

If we count CUs as cores (with nV GPUs SMx are usually counted as cores) it would be 23520 accelerator cores and 26880 in total.
 
I wonder where the line is drawn, to get that incredibly precise 2.1428 figure.
There's the power used by the rack cabinets themselves (maybe 208V DC or some combination of DC voltages). Then power used by the transformers/PSU that generate that supply from whatever comes to the building. The UPS batteries too, and then cooling that big server room.

Arthur Bland from OLCF/ORNL gave some details about this: the 8.209 MW are with all cooling, PSUs and everything included (at the wall, so to say), but are the average over the whole Linpack run. Peak power was about 8.9 MW - I hope those measurements are standardized over the top500 entry - otherwise, the green500 would be reduced to some sort of excel joke, with green500= sort[Rmax/Power] as the formula.
 
Last edited by a moderator:
According to the podcast guys at HPCWire, Tesla K20/K20X is more energy efficient than both Xeon Phi and FirePro 10000 when compared on it's own:

http://www.hpcwire.com/hpcwire/2012-11-16/podcast:_amd_troubles_sc12_winners_and_losers.html

The podcast guys even hint that it would be possible to create a "trick" system to gain top honors on the Green 500 list. They suggest that Beacon (with Xeon Phi) and SANAM (with FirePro 10000) supercomputing systems are propelled to the top of the Green 500 list for two reasons: 1) The ratio of accelerators to CPU's in the system is relatively high. For instance, in the Beacon system with Xeon Phi accelerator, there are four Xeon Phi's for every two Xeon CPU's per node. Since accelerators have relatively high performance/watt compared to CPU's, the Green 500 performance/watt score is significantly boosted. 2) The size and scope of the system is relatively small compared to the Top 10 supercomputing systems. For instance, the Beacon system contains only 144 Xeon Phi accelerators (compared to 18,688 Tesla K20X accelerators in the Titan system and 1,875 Xeon Phi accelerators in the Stampede system). Performance scaling tends to become worse as the number of cores in a supercomputing system increases, so it is much easier to achieve high performance/watt with smaller systems.

I do believe that the Green 500 score for Beacon (with Xeon Phi) in particular is a bit suspect. None of the other systems equipped with Xeon Phi are even close with respect to Beacon's Green 500 score. I suspect that when running Linpack on the Beacon system, the Xeon CPU's were turned off while the Xeon Phi's were used exclusively for Linpack. Since Xeon Phi has x86 functionality and can essentially operate autonomously in the system, it is possible to do this. And since accelerators tend to have higher performance/watt than CPU's, that would boost the Green 500 performance/watt score substantially. But this is a bit misleading too, because anyone with a Xeon CPU + Xeon Phi accelerator supercomputing system would never realistically use the system with the Xeon CPU's turned off, as the Xeon Phi's would be far less efficient than the Xeon CPU's when executing serial portions of the software code. On the flip side, the Titan supercomputing system with K20X accelerators appears to achieve very high Green 500 performance/watt in it's standard (and not yet fully optimized) configuration, while also achieving extremely high overall Top 500 performance, so that is quite an achievement.
 
Here is an interesting fact that highlights the momentum behind GPU-accelerated heterogeneous high performance computing: only one year ago, if you looked at the Top 500 supercomputing systems, the total peak throughput of all Top 500 systems combined (excluding Sequoia) was less than 30 Petaflops. Within the last one month, there has already been 30 Petaflops of Tesla K20/K20X cards shipped out:

http://www.youtube.com/watch?v=50AzXbrvtmg

This will be a very interesting and exciting space to monitor over the next few years...
 
Im not sure you can turn off the CPU in the system with Phi cards for get data... its a coprocessor unit, they cant work alone. ( as mentionned by Anandtech if i remember well )
 
Actually Xeon Phi can boot Linux and it can run x86 software on it's own (remember that Xeon Phi consists of a few dozen relatively simple x86 CPU cores placed together on one piece of silicon). I believe that Intel has pulled wool over people's eyes by running Linpack on the Beacon supercomputing system without turning on the Xeon CPU's, primarily to get on top of the Green 500 list. The reality is that Xeon Phi is supposed to be used as a "co-processor", and no one in their right mind would use it in a supercomputing system without some high performance CPU's. Fortunately for NVIDIA, Project Denver will integrate CPU and GPU cores so that their card will be able to boot Linux too. Same goes for AMD presumably with their next gen card too.
 
What is K20C?
Maybe it just denotes the board or cooling variation. I've seen an air cooled K20 card bearing the designation of K20SUC and for the K10 there is also a version named K10-RL-SUC (specifying an aircooled version with the airflow from the left to the right, howsoever that is defined). There are probably a few versions around with different cooling layouts or even coming without a heatsink (for watercooled installations).
 
Back
Top