Tianhe-1A installation - 7168 NVIDIA Tesla M2050s

Discussion in 'GPGPU Technology & Programming' started by Rys, Oct 29, 2010.

  1. Rys

    Rys Graphics @ AMD
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    4,182
    Likes Received:
    1,579
    Location:
    Beyond3D HQ
    The Tianhe-1A installation at the NSC in Tianjin is a 2.5PF (measured w/ LINPACK, peak is supposedly over 4PF) machine running Xeons, 7168 M2050s and "Fei-Teng 1000" processors. It should take #1 in the upcoming Top500 refresh, and looks to be open for almost anything, but from what I can tell it's mostly bio-med R&D and oil-gas sims so far.

    It's probably a really big deal for NV due to the size. Tim, can you tell us any more?
     
  2. Tim Murray

    Tim Murray the Windom Earle of mobile SOCs
    Veteran

    Joined:
    May 25, 2003
    Messages:
    3,278
    Likes Received:
    66
    Location:
    Mountain View, CA
    7168 M2050s, custom interconnect between nodes, running a lot of software and driver code that I wrote (hooray!). I wasn't personally super involved with it, despite having my fingers in every Tesla software pie.

    Not sure how much I can actually say, but if anyone has specific questions I'll see what I can do.
     
  3. Florin

    Florin Merrily dodgy
    Veteran Subscriber

    Joined:
    Aug 27, 2003
    Messages:
    1,707
    Likes Received:
    345
    Location:
    The colonies
    According to Wikipedia:
    Congrats on being part of the foundation of such a prestigious project :)

    I imagine keeping such a beast fully utilised can be a challenge, and the hybrid nature probably means you would use multiple programming models in parallel? Something like OpenMP for the CPU side of things and specific CUDA jobs to utilitise the Tesla's?
     
  4. Tim Murray

    Tim Murray the Windom Earle of mobile SOCs
    Veteran

    Joined:
    May 25, 2003
    Messages:
    3,278
    Likes Received:
    66
    Location:
    Mountain View, CA
    Thanks, it's pretty cool. :D

    All of these clusters are generally MPI in some form to handle inter-node (and inter-core) communication and CUDA to offload parts of that to the GPU. You might see some other parallel libraries and the like (e.g., that Linpack run probably ran MKL on the CPUs in parallel to CUBLAS on the GPUs), but the ten-thousand foot view is MPI + CUDA.
     
  5. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,802
    Likes Received:
    3,921
    Location:
    Germany
    What I find interesting - and possibly subject to an upcoming CD-rant - is that the first installment of Tianhe was powered by RV770 GPUs. Now, I can see upgrading to a more flexible model as offered by the latest graphics processors would make any sense in the world, especially if you also get (way) more FLOPS out of it. But isn't it kind of unusual for an existing supercomputer to switch the source of its main crunching power?

    Or is it a new, parallel installation to the original Tianhe-1 even though you still could use the same infrastructure, only switchting CPUs and PCIe-cards (power should be roughly in the same budget per node I guess).

    Anyway, this system should have 0,515*7168+14336*0,14 TFLOPS of theoretical peak if I am not mistaken, and 2,5 PFLOPS achieved linpack throughput is small step up from Nebulae's 42,5% efficiency but not catching up to Tianhe-1's 46,7% still.

    I wonder, why did they do the upgrade then? Better use of ressources cannot be the reason, unless Linpack is not representative of typical workloads - then they should omit this benchmark.
    They could have gotten the same theoretical DP-GFLOPS by using HD5870s or even almost double that with HD 5970s (they seem to have used consumer level boards before) - and even without upgrading the CPUs also. Plus, that would have cleared much AMDs (and possibly their partners) inventory of Cypress-products.
     
  6. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    505
    Likes Received:
    189
    Maybe because without ECC, you can't actually use a bug cluster for serious work?
     
  7. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,802
    Likes Received:
    3,921
    Location:
    Germany
    But didn't they know that before building Tianhe-1?
     
  8. Psycho

    Regular

    Joined:
    Jun 7, 2008
    Messages:
    746
    Likes Received:
    41
    Location:
    Copenhagen
    Seem strange if they have replaced both CPUs and GPUs and probably a lot of software in a one year old machine, instead of building a new one next to it.
     
  9. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    LINPACK is, IMHO, only representative of very specific workloads. Actually, since it's almost the easiest benchmark to optimize and parallelize while having reasonably pressure on multiple parts of a computer (instead of, for example, some easy to optimize benchmarks which only depends on ALU or memory bandwidth), one can see it as an "upper limit" of the performance of a computer. That is, for typical workloads, you'd be very lucky to reach LINPACK performance. This makes LINPACK a better performance measure than the so-called peak performance, because it can be seen as a "real world peak."

    However, anything more complex is very difficult to say. Some have tried to develop some HPC benchmarks but they didn't take off (mostly because it's very difficult and expensive to run a whole supercomputer for a specific benchmark already, not to mention multiple probably irrelevant benchmarks, and if you don't have enough supercomputers to run your benchmarks, they are irrelevant anyway).
     
  10. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,541
    Likes Received:
    964
    I think a short answer would be that CUDA was the way to go for HPC from the beginning, but GT200's DP performance was so dreadful that they had to pick AMD.

    Now, with Fermi, DP is half-rate, and you get ECC on top of it.


    A longer answer would also involve the fact that these deals don't usually bring much money to chip manufacturers, they're more of a prestige and PR thing. And while NVIDIA is pushing HPC very hard and probably sitting on quite a bit of GF100 inventory, AMD is less aggressive on that front, and tight on supply. So NVIDIA must have made a better offer, which probably includes extensive support.
     
  11. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    And yet, with all that peak DP performance the GPUs can only pull off ~60ish% of the flops in perhaps the worlds easiest to optimize workload: Linpack.

    Considering that there is considerable research now that GPU do significantly worse in other workloads, it looks like outside of running linpack, the GPUs will largely be taking a backstage seat to the CPUs.
     
  12. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,802
    Likes Received:
    3,921
    Location:
    Germany
    How do current GPUs fare in that respect? The latest I've seen were comparisons based on Quadcore Xeons vs. GT200, which has just been touted as having a dreadful DP perf.
     
  13. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    NVIDIA claimed that 2X Tesla C2050 + 2X Xeon X5550 produces 656.1 GFLOPS in LINPACK. 2X Xeon X5550 alone produces 80.1 GFLOPS.
     
  14. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,802
    Likes Received:
    3,921
    Location:
    Germany
    What I meant were the "other workloads" that Aaron was speaking of.
     
  15. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    FFT and spmv. Lots of data, results, comparisons are available from the various reports of PRACE in europe.
     
  16. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,802
    Likes Received:
    3,921
    Location:
    Germany
    Right, thanks a lot. I'll read through the gazillions of papers there in order to see if my rather simple question gets answered.

    So far, my latest knowledge on recent GPUs consisted of what was discussed over here: http://forum.beyond3d.com/showthread.php?t=57839, wrt to prace especially this one: http://www.prace-project.eu/documents/christadler.pdf

    Now, that still is comparing GT200-type of chips - far less DP-powerful, far less programmable, far less memory. So, IMHLO it is not reflecting abilities of current hardware.

    edit: also the PDF you linked over in the other thread still is dealing only with last generation hardware. CPUs have gone up what, 50% from back then? And GPUs? 660%, plus better infrastructure on die for Computing. I hope this makes it a bit more precise, why I asked about performance number for current hardware.
     
    #17 CarstenS, Oct 30, 2010
    Last edited by a moderator: Oct 30, 2010
  17. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    949
    Likes Received:
    419
    Condition-free crunching-power brings you only that far. If you can show that long unpredictable erratic complex code runs faster on GPUs than on CPUs (which still mostly is not the case), you'd have a point. CPUs can go through enormous amounts of instructions, instruction-fetch is now almost on-par with data-fetch. GPUs are simply not made for that (like 20MB framebuffer and a 20MB instruction-stream, or in general situations in that the program is bigger than the data), and the GPU-compilers neither.

    To me (for my research) the turning-point would be if I can manage random markov fields faster on GPUs than on CPUs, I still can't see that to happen any time soon. And when it happens, I don't want to r*pe my already complex code to make it work in such obstrusive ways, that after 1 day passes I can't remember how and why the code works (on switch(GPU) case x: case y: case z: exit(666) ). :-|
     
  18. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,802
    Likes Received:
    3,921
    Location:
    Germany
    ... and that's why I was asking if there are any real-world (or not) performance numbers besides the old ones from previous generations and the purely theoretical FLOPS.
     
  19. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    949
    Likes Received:
    419
    The advocacy for GPUs is really, let's call it: abundant. Because you know the FLOPs are there, if you think you're good (a programmer) you think there must be a way to get them out; if you don't get it you just don't tell anybody; if you have success you spread your religion. It's not that the disappointment makes us CPU-advocates, telling everybody how nice the CPUs are. We do want the FLOPS, they are there, we don't get it - like the donkey and the carrot. Why would you write a paper on failure? ;)

    Except from a very few people, or CPU-only manufacturers (occasionally Intel for example, or possibly Sun in the past with DB-servers), you won't find any papers on that I suppose. I'd appreciate just for the fun of reading a "Why does Linpack reach X% utilization on GPU" vs. "Why does Linpack reach X% utilization on CPU". What you will find is a lot papers about when GPU works (namely "Beyond Programmable Shading"), so it's up to you to find out when it does not work.

    As far as my experience with the 1950 Firestream and the 5870 goes, only the most trivial algorithms go over to GPU-territory well. Essencially it means just inner hot-spot loops go over at all. Dade (Lux) also has his first-hand experiences with that; the executable is exactly not just a boot-strapper for one big GPU-process.

    Don't understand me wrong I don't say the algorithms running on GPUs are in an absolute sense trivial, or easy math; but they are algorithmically trivial in comparison to the extend of complexity you can reach with f.e. state-machines (which don't run well on GPU at all, zlib anyone?, HTML-DOM tree-walk on GPU?) which perform well on CPUs, and which honestly form the vast majority of all code. And with the forthcoming implementation of transactional memory and the connected speculative execution (with speculative writes), the GPUs will have it even harder to compete in the fast executed low throughput high algorithmic complexity space.

    Anyway; there is a lot enthusiasm, and a lot disappointment. And we still have a long way to go. :)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...