PDA

View Full Version : Will future supercomputers use GPUs?


Techno+
13-Nov-2006, 18:43
hi guys,

do u think will future supercomputers use GPUs? are they more programmable than vector units in current supercomputers?

plz answer

thanx

archie4oz
13-Nov-2006, 19:58
Partially perhaps on a few, experimantal ones, and no....

Rufus
13-Nov-2006, 20:06
depends on what you call a "supercomputer". If you consider a Beowulf cluster of cheap servers a "supercomputer", then yes. Especially now with CUDA it's easy to imagine having a cluster with as many GPUs as possible cranking out some massivly-parallel computation.

However this won't come anywhere close to affecting "real" supercomputers from people like Cray. They are all about having a single, coherent image across the entire system with massive bandwidth between everything.

Maybe once (if?) there are cache-coherent GPUs from the Torrenza initiative they could go into the a large coherent system.

3dilettante
13-Nov-2006, 20:34
hi guys,

do u think will future supercomputers use GPUs? are they more programmable than vector units in current supercomputers?


It would be interesting to see a supercomputing cluster based on off-the-shelf parts with GPUs running, though I think some vector processors are more programmable in some respects than GPUs.

I think there are some possible issues in using large numbers of graphics boards.

1) Memory reliability: Graphics cards can stand to have a few transient errors. In graphics, it would probably amount to one bad pixel on one frame.
Data corruption while working on a scientific computation may or may not be acceptable, depending on what it is the GPU is working on.
Large supercomputers rely on ECC memory and checkpointing to catch memory errors.
While the error rate for a single memory module is incredibly low, systems with terabytes of RAM may experience a memory error in a matter of hours.

2) Precision: Graphics processors can get away with 32-bit precision per component. A huge amount of scientific computation likes to go with 64-bit or more.
There are ways to get around this, but they sap performance.

3)Communications: GPUs don't talk to each other all that well, at least at present. This isn't always necessary, but in cases where fast interprocessor communication is needed, it limits the utility of a large number of GPUs.

4)Specialization: GPUs aren't purely general purpose, they have a lot of hardware dedicated to doing graphics well. That's either an expense or additional power draw that may not be needed. In cases where supercomputers use custom vector engines, the GPU is specialized, but it is likely in the wrong specialty.

5)Trustworthiness/consistency: CPU implementations have variances in how they handle certain computations or corner cases, but there is a serious amount of effort made to keep them consistent with previous chips using the same ISA.
GPUs usually rely on a software layer to abstract a lot of their changes. Even then, there have been a number of cases where bugs or different mathematical behavior showed through. That would be unacceptable for those in charge of building and programming for a large system.
GPU manufacturers also have a history of sneaking in "optimizations", which would probably frighten a lot of supercomputer users, who need a lot of assurance that their numbers come out right.

MulciberXP
14-Nov-2006, 00:41
Those are valid points, but
1) I think wouldn't be a difficult engineering task to hurdle and put in a specialty gpgpu line above the quadro.
2) I believe is already in the pipe for future gpus.
3) Should be helped dramatically when Torenza comes out (along with whatever Intel implements, if they can ever figure that out :roll: ).
4) Unneeded portions of the gpu, such as ROPS, could be disabled or underclocked.
5) I think is a real oversimplification, considering the new direction gpus will be taking with DX10.

Entropy
14-Nov-2006, 01:42
hi guys,

do u think will future supercomputers use GPUs? are they more programmable than vector units in current supercomputers?

plz answer

thanx

No, and absolutely not.

Rufus
14-Nov-2006, 02:25
Good post 3dilettante. I was thinking along the same lines, just didn't expound on it.

In response to MulciberXP:
1) Quadro's and FireGLs are at their heart identical to the consumer chips. Remember the various resistor hacks to make a consumer card think it was a workstation card. Something like making a brand new chip with a memory controller that supports ECC simply isn't feasable.
2) Future might be a year or two out. Are we talking G90/R700, or the generation after that?
3) Again, Torrenza GPUs are a year or two out, assuming they will happen.
4) I agree with MulciberXP on this. Disabling of chip portions already exists, though who knows how fine grained it is. If you can turn off the entire 3d pipe when sitting in at XP desktop, it shouldn't be hard to turn off the ROPs, zcull, triangle setup, whatever when they're unneeded.
5) I don't think you understand 3dilettante's point. From Anandtech's review it states "DX10 has very nearly IEEE 754 requirements" and I think I read that G80 is along those lines. There's a very good reason that the full IEEE FP spec exists with all the crazy corner cases, it actually does make a difference in certain real-world algorithms. Clamping to 0 or MAX makes perfect sense in graphics, but will destroy the results of some programs that otherwise would be perfect fits. Basically every time you port an FP algorithm to a new platform you have to figure out if the edge cases matter for your algorithm, and if they do how they aligh with the hardware you're running it on.

Basically I can see clusters of GPUs being used in the near-term for two types of embarrassingly parallel (the only type that'll scale on a GPU cluster) algorithms: ones that absolute precision is not required (like off-line / movie rendering) and ones where the speedup is so huge, it's worth simply running everything twice or doing some sort of extra checking.

When / if the issues listed above get resolved, I can see GPUs expanding into more traditional supercomputing architectures. However there's one huge thing that could it back: the fact that they're GPUs. I'm sure it'll take a while to convince a nuclear scientist at the national labs that no really, this thing that their son just bought to play Doom 7 really is good enough to do his calculations reliably. That's entirely a marketing issue, but not something to overlook.

MulciberXP
14-Nov-2006, 03:02
Good post 3dilettante. I was thinking along the same lines, just didn't expound on it.

In response to MulciberXP:
1) Quadro's and FireGLs are at their heart identical to the consumer chips. Remember the various resistor hacks to make a consumer card think it was a workstation card. Something like making a brand new chip with a memory controller that supports ECC simply isn't feasable.


It's plenty feesable if there's a profit to be made in that market. A market which nVidia does want to expand AND dominate. And if the added complexity wasn't too much, it could again be something implemented in all chips, but only enabled on the models designed for its use. I never said I thought they'd make an entirely different chip, but they would if they could make a profit from it. Consider that most of the R&D costs would already have been made designing the consumer model.


2) Future might be a year or two out. Are we talking G90/R700, or the generation after that?


I've only heard rumors of double-precision here on this forum


3) Again, Torrenza GPUs are a year or two out, assuming they will happen.


Well the question was about the future :P


5) I don't think you understand 3dilettante's point. From Anandtech's review it states "DX10 has very nearly IEEE 754 requirements" and I think I read that G80 is along those lines. There's a very good reason that the full IEEE FP spec exists with all the crazy corner cases, it actually does make a difference in certain real-world algorithms. Clamping to 0 or MAX makes perfect sense in graphics, but will destroy the results of some programs that otherwise would be perfect fits. Basically every time you port an FP algorithm to a new platform you have to figure out if the edge cases matter for your algorithm, and if they do how they aligh with the hardware you're running it on.


Again, with an open ended question about "future" supercomputers, I don't think a complete ruling out is wise because I don't think the obsticles are that great when looking at how advanced G80 and DX10 architectures already are. Like anand said in your quote, they're nearly there already.


*edit* I'd like to clarify I'm mostly thinking about this in terms of cheap clusters. But if you were designing a supercomputer and found that some G90 or R700 chip performed better in some cost/benefit analysis than a custom vector processor, you'd start thinking about it.

3dilettante
14-Nov-2006, 15:16
Those are valid points, but
1) I think wouldn't be a difficult engineering task to hurdle and put in a specialty gpgpu line above the quadro.
2) I believe is already in the pipe for future gpus.
3) Should be helped dramatically when Torenza comes out (along with whatever Intel implements, if they can ever figure that out :roll: ).
4) Unneeded portions of the gpu, such as ROPS, could be disabled or underclocked.
5) I think is a real oversimplification, considering the new direction gpus will be taking with DX10.

I was more focused on the use of large numbers of graphics boards. I think it's possible that future GPUs will be more attractive for supercomputing, I just think they will have some limits in the markets they can target.

1) ECC could be added to the memory controller, though the boards with it would have to be very niche products. It's almost pointless currently to have ECC on graphics cards, which would might make using a GPU less attractive.

2) 64-bit precision would be helpful; I haven't seen the roadmaps with it mentioned.

3) That's a mixed blessing. Torrenza would make cache coherence and communications faster, but it would strangle bandwidth. A chip like the G80 would be saying goodbye to maybe 3/4 of its bandwidth, depending on what the motherboard's slots are populated with.

4) A chip that can have so much turned off on a whim will probably need much more complicated scheduling hardware and internal organization.
There are also significant design trade-offs that are very targeted for accellerating graphics work, so making the hardware broader could lead to overdesigned functional units.
After all that, if a significant fraction of every chip is turned, it begs the question: "why not just get a chip that doesn't have all that stuff in the first place?"

I'm sure streaming processors that are related distantly to GPUs could make a significant impact, but that's not enough to say they are GPUs if they don't do graphics very well.

Supercomputers that already use specialized vector engines are also rather immune, unless you want GPUs that have about a thousand specialized units that are almost always turned off.

Unless GPUs can find a way to completely swamp special-purpose vector processors with somewhat wasteful peak resources (possible, if GPU volumes in other markets can fund very brute-force chips), that portion of supercomputing will be off-limits.

The high power draw of GPUs can also be a hinderance. If specialized vector engines can manage competitive performance without the rather high power draw of modern and future GPUs, then system builders can either pack more of them into a system or allocate more money from the operating budget for initial outlays.

5) GPUs rely heavily on a driver layer that abstracts a lot of their quirks. They are also rather poorly documented.

The fact that a large portion of B3d's article on the G80 is about how they wrote shaders that they used to profile the latencies and shared units in the GPU is a sign that GPU manufacturers have their priorities elsewhere when it comes to serious computation.

If supercomputers are to be a priority, all that information from that article would have been released months ahead of time. It would be bad for B3d, but good for the clients.

It also takes a significant amount of time and investment to make a name for oneself in that market, and it does mean much more stringent adherence to some kind of standard.
DX10 is way too forgiving in this regard. The GPU designers will have to set down some kind of hardware/ISA-level standard that they will adhere to, something they haven't bothered to do in the past.

Techno+
14-Nov-2006, 17:02
dont u think that using CPUs with on die GPUs along with vector processors would solve all those problems?

3dilettante
14-Nov-2006, 18:49
dont u think that using CPUs with on die GPUs along with vector processors would solve all those problems?

It would help some problems, not really matter for some, and make others worse.

ECC could be used if the processor used ECC, but that really depends on whether the manufacturer targets that market.

Precision is dependent on the design, and it doesn't really matter if the CPU is nearby or not.
ISA or design consistency may or may not be affected, it's more of a manufacturer's decision.

Communication would be faster, but memory bandwidth would go way down.

The problem with too much unused silicon/power draw is something that would likely get worse when the CPU, GPU, and other units are put on the same die, because it's not easy to manufacture a thousand different chips that are all just a little bit different. That leaves a small number of bigger chips with parts that will only occassionally get used.

DemoCoder
15-Nov-2006, 00:51
You don't really need ECC as long as you're willing to to burn extra nodes on redundancy, or you checkpoint frequently. Even the most robust supercomputer CPUs aren't fully protected against soft errors from cosmic rays et al. Even those that have ECC on L1/L2/L3 and main memory, still run the risk of a soft error hitting a register file, data fifo, or other part of the system. The only difference is the length of time before you'll see the error. IIRC, a system with 1TB of memory has an expected single-bit ECC event on the other of 10-100hrs. With onchip memory typically a factor of 1,000 less, you might expect a cache failure every 10,000-100,000 hrs, at the low end, about once a year. Which is pretty much what people saw on Sun E4500/10000 systems back when they didn't have ECC protected cache. But for long running super-computer simulations, where the failure rate is so low on cache memory, you could get by with checkpointing and verifications, especially since many of the exponential algorithms being run have slow search, but fast certificate.

Putting that aside, if you're willing to burn a small constant factor of performance, you can compute in the presense of errors or untrustworthy nodes. This is exactly how the distributed computing cluster projects work, and it's how Google runs their internal clusters. Thus, one could run a cluster of 1,000 G80s, as long as one is wiling to accept a factor of 1/N drop for verification. This may not be such a big deal, if you cluster is a factor of 100,000 cheaper than a traditional supercomputer.

silent_guy
15-Nov-2006, 04:34
1) Memory reliability: Graphics cards can stand to have a few transient errors. In graphics, it would probably amount to one bad pixel on one frame.

1 bad pixel per frame would completely unacceptable. A GPU can context switch its state to the memory when switching between threads, so it also stores state in external memory.

If memory were as unreliable as 1 pixel per frame, it would also make the saved state unreliable. Upon swapping the context back into the GPU, mayhem would be unavoidable: in general, it is virtually impossible to prevent deadlocks in hardware statemachines in the presence of uncertainty, unless error recovery mechanisms are in place.

It's of course not impossible that GPU's have Reed-Solomon-like error recovery algorithms built-in to avoid this, but it's much easier to simply design a memory interface that's as reliable as, say, a CPU, because that's usually were trouble arises. Memory corruption due to cosmic rays is really not that common.

Rufus
15-Nov-2006, 06:52
1 bad pixel per frame would completely unacceptable.
Note he said "one bad pixel on one frame", not on every frame. You're right that memory errors aren't very common, and I'd expect it to be on the order of 1 corrupt pixel every year. The problem is when you have a cluster of 1000 units and that becomes 1 corrupt calculation on your scientific calculation every 8 hours. A GPU or a consumer randomly locking up due to memory corruption once a year is basically unnoticed. It's when you have a large number, as you would need in a cluster, that it becomes a problem.

nutball
15-Nov-2006, 10:54
The problem with too much unused silicon/power draw is something that would likely get worse when the CPU, GPU, and other units are put on the same die, because it's not easy to manufacture a thousand different chips that are all just a little bit different. That leaves a small number of bigger chips with parts that will only occassionally get used.

From the perspective of flexibility and commoditization it seems to me that the 'Torrenza' philosophy is more attractive than on-die vector-style processors, at least in the immediate future.

For starters it allows the vector unit to have its own memory bus and local memory whilst remaining a first-class peer of the general-purpose CPU as far as a uniform shared-memory address space is concerned. More aggregate bandwidth and less contention is always a good thing it seems to me.

Secondly it obviates the need for any Special Edition CPUs. Lastly it allows more flexibility in the configuration of the nodes in a cluster, eg. in a quad-socket box you could choose to have 1CPU + 3 VPUs, 2+2, or 3+1, depending on the customers workload.

3dilettante
15-Nov-2006, 14:29
You don't really need ECC as long as you're willing to to burn extra nodes on redundancy, or you checkpoint frequently.

ECC isn't absolutely necessary, but it's very desirable. When Virginia Tech (I think that was them) got their G5-based supercomputer cluster from Apple, the first round of systems came without ECC DRAM. The contract specifically called for an upgrade with ECC support once the necessary chipset came out.

Reducing the frequency of checkpointing and increasing the maximum utilization of all nodes is very desireable.


Even the most robust supercomputer CPUs aren't fully protected against soft errors from cosmic rays et al. Even those that have ECC on L1/L2/L3 and main memory, still run the risk of a soft error hitting a register file, data fifo, or other part of the system. The only difference is the length of time before you'll see the error.


Cosmic ray hits and alpha decay in the substrate are the reasons why ECC is so desireable.

It's all about containment and extending the mean time before errors. Most CPUs only have parity protection on-chip. Upon detecting an error, they go back to RAM, which is why it is helpful to have ECC on the memory pool.


IIRC, a system with 1TB of memory has an expected single-bit ECC event on the other of 10-100hrs. With onchip memory typically a factor of 1,000 less, you might expect a cache failure every 10,000-100,000 hrs, at the low end, about once a year.

A cluster of off-the-shelf CPUs might go as far as quad-SLI 1GB cards per board. With a thousand nodes, that's 4TB. This could mean a cosmic ray will flip a bit in something like 2-25 hours.

I don't think checkpointing is free in wall-clock terms, and the minimum time to failure shrinks so fast that it becomes more than just an annoyance.


Putting that aside, if you're willing to burn a small constant factor of performance, you can compute in the presense of errors or untrustworthy nodes. This is exactly how the distributed computing cluster projects work, and it's how Google runs their internal clusters. Thus, one could run a cluster of 1,000 G80s, as long as one is wiling to accept a factor of 1/N drop for verification. This may not be such a big deal, if you cluster is a factor of 100,000 cheaper than a traditional supercomputer.

With a system that supports ECC, you can avoid a lot of that 1/N drop in performance. It seems like a worthwhile investment.