PDA

View Full Version : Tokyo Tech Builds First Tesla GPU Based Heterogeneous Cluster To Reach Top 500


Jawed
18-Nov-2008, 01:04
SC08—AUSTIN, TX—NOVEMBER 17, 2008—The Tokyo Institute of Technology (Tokyo Tech) today announced a collaboration with NVIDIA to use NVIDIA® Tesla™ GPUs to boost the computational horsepower of its TSUBAME supercomputer. Through the addition of 170 Tesla S1070 1U systems, the TSUBAME supercomputer now delivers nearly 170 TFLOPS of theoretical peak performance, as well as 77.48 TFLOPS of measured Linpack performance, placing it, again, amongst the top ranks in the world’s Top 500 Supercomputers.

“Tokyo Tech is constantly investigating future computing platforms and it had become clear to us that to make the next major leap in performance, TSUBAME had to adopt GPU computing technologies,” said Satoshi Matsuoka, division director of the Global Scientific Information and Computing Center at Tokyo Tech. “In testing our key applications, the Tesla GPUs delivered speed-ups that we had never seen before, sometimes even orders of magnitude – a tremendous competitive boost for our scientists and engineers in reducing their time to solution.”

Speaking to the ease of implementation, Matsuoka continued, “The entire upgrade was carried out in 1 week, and the TSUBAME supercomputer remained live throughout. This is an unprecedented feat in top-level supercomputing.”

“We are honored to partner with Tokyo Tech – world famous for their supercomputing expertise and success,” said Andy Keane, general manager of the GPU Computing business at NVIDIA. “NVIDIA Tesla breaking into the Top 500 marks a milestone in supercomputing history. The massively parallel GPU is now essential for supercomputing centers worldwide.”

The first to achieve Top 500 ranking with an NVIDIA Tesla based GPU cluster, Tokyo Tech. is one of hundreds of distinguished universities and supercomputing centers that have adopted GPU based solutions for research. Other leading centers include the National Center of Supercomputing Applications (NCSA) at the University of Illinois, Rice University, University of Heidelberg, University of Maryland, Max Planck Institute and University of North Carolina.

The Tesla S1070 1U GPU system is based on the NVIDIA CUDA™ parallel architecture. This architecture is accessible through an industry standard C language programming environment that allows developers and researchers to tap into the parallel architecture of the GPU more quickly and easily than any other solution shipping today.
For more information on NVIDIA Tesla S1070, please visit: www.nvidia.com/object/tesla_s1070 (http://www.nvidia.com/object/tesla_s1070)

---

Pretty groovy, huh?

TiT has a load of Clearspeed processors, too, lurking somewhere within Tsubame. Wonder if they have much of a lifetime left.

http://www.clearspeed.com/newsevents/news/pressreleases/ClearSpeed_Nissho_TokyoTech.php

Jawed

rpg.314
18-Nov-2008, 01:57
How long you think it is before nearly all in top500 are gpu accelerated? I'd say 2-3 years.

RudeCurve
18-Nov-2008, 04:54
It's interesting they did not go with Clearspeed this time around. I'm thinking Nvidia probably gave them a better deal. Clearspeed does have the new CATS 700 1U systems with comparable performance to Nvidia's 1U solution.

Rufus
18-Nov-2008, 07:10
RudeCurve: hadn't heard about that system before, but it makes for a very interesting architecture comparison.

CATS 700 (http://www.clearspeed.com/products/cats_700/): 1,100GFLOPS DP (? SP), 24GB RAM, 96GB/sec bandwidth
Tesla S1070 ( http://www.nvidia.com/object/tesla_s1070.html): ~333GFLOPS DP (~4,000GFLOPS SP), 16GB RAM, 400GB/sec bandwidth

They only thing that is equivalent between the two is the amount of RAM. Clearspeed has a huge DP FLOPS advantage, while NV has a huge bandwidth advantage and probably has a large SP FLOPS advantage.

Real world numbers for these two architectures will be extremely different based on if the workload is compute or bandwidth bound.

Edit: also I wonder how long before AMD gets ATI onto the list.

pcchen
18-Nov-2008, 10:19
I don't know about the architecture of Clearspeed processors. But for GPUs, because they have relatively smaller internal memory (including registers and share memory), many workloads are going to be more bandwidth bound then on normal CPU. If Clearspeed suffers from similar problem, then I'd say the bandwidth advantage is probably quite important.

bowman
18-Nov-2008, 12:47
Ooh, Linpack on GPUs! I didn't even know they had a package for that. I wish they'd make it available, just for novelty's sake..

I thought the lack of ECC memory was an obstacle to using GPUs in production?

ShaidarHaran
18-Nov-2008, 13:28
Ooh, Linpack on GPUs! I didn't even know they had a package for that. I wish they'd make it available, just for novelty's sake..

That'd be quite the novelty, producing incorrect results and then burning out the GPU in minutes :p

I thought the lack of ECC memory was an obstacle to using GPUs in production?

It is an obstacle, but not an insurmountable one.

Look at the GPU client for FAH. Frequent checkpoints with result verification are the answer, here. Some performance is lost of course.

bowman
18-Nov-2008, 15:50
That'd be quite the novelty, producing incorrect results and then burning out the GPU in minutes :p

Really? The so-called 'Intel Burn Test' (Linpack) runs on a single processor just fine, and reports the correct FLOPS results along with correct mathematical results as long as it's not overclocked beyond stability (so a nice stability test). Even if it's made for clusters, shouldn't it be possible to run this on a single board as well?

I really want to test the 8800GTX and get the real numbers now :lol: But I guess, perhaps this is a 64-bit GT200 only implementation..

ShaidarHaran
19-Nov-2008, 02:13
Really? The so-called 'Intel Burn Test' (Linpack) runs on a single processor just fine, and reports the correct FLOPS results along with correct mathematical results as long as it's not overclocked beyond stability (so a nice stability test). Even if it's made for clusters, shouldn't it be possible to run this on a single board as well?

I really want to test the 8800GTX and get the real numbers now :lol: But I guess, perhaps this is a 64-bit GT200 only implementation..

My point is that GPUs were not designed with such precision in mind. Where rounding errors are unacceptable in the CPU arena, they are a fact of life for GPUs.

pcchen
19-Nov-2008, 08:19
Whether GPU was designed with that precision in mind is not important. What's important is whether it is designed with that precision.

For example, NVIDIA claims that GT200 supports full IEEE 754 precision when doing double precision, and some operations (basically add and multiply) has full precision when doing single precision. These are probably good enough for some operations, depends on what you want to do and your algorithms.

ECC is another problem when you are using a lot of devices in parallel and operates continually for a long time. This is actually what NVIDIA can do to differentiate Tesla and other consumer level hardwares.

3dilettante
19-Nov-2008, 16:39
It is an obstacle, but not an insurmountable one.

Look at the GPU client for FAH. Frequent checkpoints with result verification are the answer, here. Some performance is lost of course.

F@H operates in a way that makes it quite different from an HPC center.
The computation is done on a wide swath of hardware, for which:
1) maintenance of said hardware is not handled by F@H
2) paying for the maintenance of said hardware is not done
3) there's mostly no physical plant, lower utilities
4) the economic model, such as it is, different than a supercomputer for a proprietary client might have

I haven't seen the figures of much of F@H's FLOPs finally resolved to verified result FLOPs.
I suppose we'll see. Some workloads won't mind the error, and in some cases the lowered yield due to checkpointing might not be prohibitive.
A university with academic interest in GPGPU is not the primary test of the extendability of the concept.

A flood of crap FLOPs that happens to be free doesn't look the same if it is constrained by physical and financial limits, it has to be maintained, and it ceases to be free.