Supercomputers - What's the current state?

Where's a good place to learn about supercomputers? Since Cray and its famous line "you can't fake bandwidth" (regarding why Cray 1 had no cache), what's been happening?

Cray seems to be no longer number one. The Earth Simulator has managed to get that trophy. IBM's ASCI Purple and Blue Gene/L is meant to take that back though, at 100 and 360 Teraflops each.

What I'd like to learn more about is how do these computers organisationally differ from PCs, workstatations and commerical servers. How do Intel based supercomputers fair against ones which employ custom chips? ie. Power4, Alpha.

What about the software? On PCs, SMP doesn't really scale linearly for most apps. But the scientic workloads of supercomps I guess are different. What kind of algorithms and methodology do them employ to get supercomputers to optimally work on data in SMP?

Finally, the whole 'on demand'/pervasive/autonomous/utility/grid computing model. How will this affect supercomputers? How realistic is the vision of an 'internet' that not only serves data but acts as a transparent grid of massive computing power?

Looking forward to hear from some 'key' people. ;)

Thanks!
 
I've been doing a bit of reading in the NEC architecture documentation for our SX-5 at work, and it in essence (in addition to really quite clever inter-node communication / crossbar stuff) is a very nice vector-machine (which reminds me somehow of the reconfigurable logic chips).
Imagine something like SSE3 with registers that are 128K entries wide and which you can link together (with operations) to perform custom computations...

The compiler also does a big chunk of vectorisation work (rewrites / shuffles the C code :?), but most of it is still left to the programmer (e.g. using MPI).

The current fad obviously is clustered linux machines as they've certainly got better bang for the buck, but for similar peak performance like a relatively modest 4-node SX-5/6, you'd need quite a few machines (and thus more communication overhead).
 
Vector computers are a lot like clusters build from COTS parts ... both are relatively cheap and easy to build, and for both that comes with a cost. One cant handle very irregular problems, the other bottlenecks on communication.
 
My 0.02€.

Read the usenet groups comp.sys.super and comp.arch.

The comp.sys.super FAQ is a place to start (info and references to litterature).

Cheers
Gubbi
 
JF_Aidan_Pryde said:
What about the software? On PCs, SMP doesn't really scale linearly for most apps. But the scientic workloads of supercomps I guess are different. What kind of algorithms and methodology do them employ to get supercomputers to optimally work on data in SMP?

The fundamental problem with SMP on the PC is that the CPUs share a common bus (for Intel). This leads to bandwidth starvation for memory-bound problems (a lot of scientific applications are memory bound, especially if poorly written!)

Most large single-system image machines these days are NUMA architecture; you basically have nodes of maybe 2/4/8 processors connected by some other interconnect technology. This means you have more busses to go around, but makes cache-coherence more tricky. Sun a nd SGI certainly have gone this way, as has HPQ AFAIK. The mid-range IBM P-series machines are still true SMP IIRC.

The AMD/Opteron way of doing things is much more like NUMA than bus-snoop SMP, so in principle should scale much better than Xeon-based systems. It's going to be interesting watching how well Opteron scales to 8x, 16x and 32x systems (all of which are being developed to some degree by certain companies).

Either way in theory you have to pay quite close attention to your data placement within the machine if you want top-notch performance.

Finally, the whole 'on demand'/pervasive/autonomous/utility/grid computing model. How will this affect supercomputers? How realistic is the vision of an 'internet' that not only serves data but acts as a transparent grid of massive computing power?

IMO the problem with the Grid is not technical, it's socialogical. What sort of organisations are going to spend lots of money to buy computing kit which they then effectively give away?

Furthermore the Grid is, and for the forseeable future will be, network bound. Grid-like processing works well for problems where you need a lot of CPU cycles and small data transfers (think of any of the @Home projects). For problems where data volumes are large, I'm yet to be convinced that Grid computing will take off outside of organisations that can afford a connection to Internet2 :)
 
PC-Engine said:
Cray is working on Red Storm which is based on 10,000 Opterons.

Yeah, as far as I can tell that machine is distributed memory; more so than some of the other ASCI machines (eg. ASCI White is a large cluster of 16-way SMP nodes IIRC).

Where things get really interesting is if someone can make shared-memory (by which I mean either UMA or NUMA) Opteron systems with 16 or 32 processors. In principle these could give stonking price-performance. I wish SGI would do this, their interconnect is pretty amazing stuff, but I fear they're too wedded to Itanium2.
 
nutball said:
PC-Engine said:
Cray is working on Red Storm which is based on 10,000 Opterons.

Yeah, as far as I can tell that machine is distributed memory; more so than some of the other ASCI machines (eg. ASCI White is a large cluster of 16-way SMP nodes IIRC).

Where things get really interesting is if someone can make shared-memory (by which I mean either UMA or NUMA) Opteron systems with 16 or 32 processors. In principle these could give stonking price-performance. I wish SGI would do this, their interconnect is pretty amazing stuff, but I fear they're too wedded to Itanium2.

The problem with Opterons and large single system image machines is that the Opteron uses broadcasts to maintain cache coherence. When the Opteron needs to obtain ownership of a cacheline (ie. misses the caches) it sends a request to the memory controller for the cache line. However there is the possibility that the requested chunk of data is residing in the caches (or memory) of another CPU in the system, so it broadcasts to the other CPUs and asks if they hold the cache line. If they do they send the cacheline and the local memory request is cancelled.

This means that broadcast traffic rises sharply when you start adding CPUs to the system: Every CPU will broadcast to every other CPU. AMD has obviously chosen 8 CPUs as the pain threshold above which scalability suffers.

What this means is that the bridge chips that are used to glue the 1-8 CPU subsystems together needs to uses a cache coherency method that offers better scalability when CPUs are added, something like directory based cache coherence (Alpha EV7 and SGI Origin 2 & 3Ks).

So you end up with boutique hardware gluing the system together and your cost advantage goes out the window.

As for the chances of AMD adding directory based CC to the Opteron, I consider those very slim. The reason the Opteron is so competitive in price/performance terms is because it shares everything (core, process, infrastructure) with the mass market Athlon 64/FX series. Adding cost to those to lower compound cost for the über high end is probably a losing proposition.

Cheers
Gubbi
 
Gubbi said:
So you end up with boutique hardware gluing the system together and your cost advantage goes out the window.

As for the chances of AMD adding directory based CC to the Opteron, I consider those very slim. The reason the Opteron is so competitive in price/performance terms is because it shares everything (core, process, infrastructure) with the mass market Athlon 64/FX series. Adding cost to those to lower compound cost for the über high end is probably a losing proposition.

Yeah, you may be right. One things I'm not certain about is whether the high costs associated with ccNUMA interconnect (eg. NUMAlink or Craylink as was) are intrinsic to the technology, or merely driven by the size of the market for machines they've traditionally been used in.
 
A glueless 8 processor node is pretty good already ... it is not like their competition can build completely glueless cache coherent systems. That said, personally I dont think of CC systems when I think of supercomputers.
 
Well all the Opteron based supercomputers are built using 1 or 2-way nodes because that's where the price/performance metric is optimal.

I'd consider the big Origons (128+ cpus) supercomputers. Clusters don't count... AT ALL.

Machines like Red Storm are border line, they have one memory space, but no cache coherence, so they still rely on MPI (or similar) to tie nodes together. But of course since packet routing is done directly on the memory adresses it has superfast latency/bandwidth.

Cheers
Gubbi
 
MfA said:
Meh, if clusters are out then so are vector computers :)

Shurely you're not putting a SX-6 and a bunch of Macs laced together with shoestring in the same league :)

Cheers
Gubbi
 
AFAIK Red Storm is a kind of MPP machines, like the Earth Simulator. They have lower latency than most clusters.

Currently the largest single image system is probably the sgi Altix 3000 system in NASA Ames Research Center, which has 512 CPUs. It has 1 TB memory and over 1 TB/s aggregate memory bandwidth in STREAM benchmark.
 
Yes, Ames also have a 2048 processor Origin 3800 I believe, which runs single-system image. That's based on the older MIPS chips though, so depending whether you want to talk numbers of processors, or performance, one of those two is the biggest publicly acknowledged machine.
 
Back
Top