Nvidia GT300 core: Speculation

3dilettante · Jun 26, 2009

Chalnoth said:
And if you've got much more than a 3x performance improvement by using the GPU's, it's not a problem, is it?

It can still be a problem.

GPUs don't exist in isolation.

Each additional board is going to mean additional networking, CPU, rack space, power, and system maintenance overhead.

It's not just ECC at that point, though it would go a long way in protecting against common transient errors and also really speed up the process of isolating and replacing failing boards.

KimB · Jun 26, 2009

3dilettante said:
It can still be a problem.

GPUs don't exist in isolation.

Each additional board is going to mean additional networking, CPU, rack space, power, and system maintenance overhead.

It's not just ECC at that point, though it would go a long way in protecting against common transient errors and also really speed up the process of isolating and replacing failing boards.

My point was that if you're going to compare it against an ECC system, if you are looking for a specific performance target, it is entirely conceivable that even with 3x vote you could build a smaller, more power efficient, and cheaper system that performs this task at the same performance with GPU-accelerated calculations than one that does not (and uses ECC instead of a software-based error correction setup).

3dilettante · Jun 26, 2009

Chalnoth said:
My point was that if you're going to compare it against an ECC system, if you are looking for a specific performance target, it is entirely conceivable that even with 3x vote you could build a smaller, more power efficient, and cheaper system that performs this task at the same performance with GPU-accelerated calculations than one that does not (and uses ECC instead of a software-based error correction setup).

A system with ECC GPUs or ECC on a CPU-only system (in either case, the system RAM is likely to be ECC).

If the former, there is no conceivable way.

If the latter, if the system as a whole is difficult to maintain and has serious reliability issues, it still might not work.

KimB · Jun 26, 2009

3dilettante said:
A system with ECC GPUs or ECC on a CPU-only system (in either case, the system RAM is likely to be ECC).

Obviously it would be better, in principle, to have ECC-enabled GPU's. But that's not available. So in the mean time it's GPU + software error correction vs. CPU with ECC. As long as you get enough a performance boost out of the GPU acceleration, then, you can still benefit over using CPU with ECC to do all the processing.

3dilettante said:
If the latter, if the system as a whole is difficult to maintain and has serious reliability issues, it still might not work.

Well, obviously you need to have a stable system to have a chance of it working. Presumably the best way would be to not bother with low-level error checking, but instead break up the processing into chunks that take a reasonable amount of time to produce, generate said chunks twice, and compare them against one another: if there's a difference, re-run the chunk at least once more.

This will obviously only work as long as the chunks are short enough such that you only get an error in a single chunk once in a great while (say, once in every 100 chunks or better), and as long as the errors are almost never identical in nature (which shouldn't be a problem). You also need software that is amenable to such breakups, but if it's run on a parallel machine, that should be the case anyway.

3dilettante · Jun 26, 2009

Chalnoth said:
Obviously it would be better, in principle, to have ECC-enabled GPU's. But that's not available. So in the mean time it's GPU + software error correction vs. CPU with ECC. As long as you get enough a performance boost out of the GPU acceleration, then, you can still benefit over using CPU with ECC to do all the processing.

That's begging the question.
Your claim is that as long as the GPU solution is faster, then the workload benefits.
It is not necessarily faster without ironing out details like:

How much hardware is duplicated?
How much additional network, CPU, power, and monetary cost will this duplication incur?
How much more maintenance effort is needed for the additional hardware?
How much overhead does the serializing voting step incur?
How frequently do we expect mismatched output, and how intensive is the revote?

It may not be worth it if the section of processing they improve is only one of many, and the CPU power that could have been using on the whole process is lost because GPU nodes are in their place.

KimB · Jun 26, 2009

3dilettante said:
That's begging the question.
Your claim is that as long as the GPU solution is faster, then the workload benefits.
It is not necessarily faster without ironing out details like:

How much hardware is duplicated?
How much additional network, CPU, power, and monetary cost will this duplication incur?
How much more maintenance effort is needed for the additional hardware?
How much overhead does the serializing voting step incur?
How frequently do we expect mismatched output, and how intensive is the revote?

It may not be worth it if the section of processing they improve is only one of many, and the CPU power that could have been using on the whole process is lost because GPU nodes are in their place.

It seems to me you still have in mind a somewhat low-level voting system. Here's the system that I would try to build, were I given the chance:

I would have a single large cluster with a large number of compute nodes (depending upon the task). Data would be processed in chunks, with each data chunk processed twice, followed by a verification step. Presumably the optimal situation would be to have the two data chunks for each step processed by separate nodes, which would require a bit of careful programming for appropriate queuing of the verification step, but nothing too horrible. Any verification step that fails, then, would just re-insert a pair of chunks into the queue.

This all requires, of course, that the data can be divided into wholly discrete chunks that are completely independent of one another. If this is not the case, the verification obviously gets dramatically more challenging. But I doubt that it would ever be impossible.

aaronspink · Jun 26, 2009

Chalnoth said:
My point was that if you're going to compare it against an ECC system, if you are looking for a specific performance target, it is entirely conceivable that even with 3x vote you could build a smaller, more power efficient, and cheaper system that performs this task at the same performance with GPU-accelerated calculations than one that does not (and uses ECC instead of a software-based error correction setup).

But the problem is, you don't know what the bottle neck is for outputting 3x the data and then doing a comparison. For all we know, moving the data and doing the comparison could take as long as it normally does to just do it on a pool of cpus.

Then there is the additional heat and power to consider, does adding the GPUs mean you've have to significantly scale back the number of machines?

Nor do we know whether the speed ups are for the whole workload or a small part of it. Nor do we know how correct the results from the GPU portion are.

KimB · Jun 26, 2009

aaronspink said:
But the problem is, you don't know what the bottle neck is for outputting 3x the data and then doing a comparison. For all we know, moving the data and doing the comparison could take as long as it normally does to just do it on a pool of cpus.

Presumably if you're disk limited, there's no reason to bother with GPU acceleration in the first place, whether ECC or no.

aaronspink said:
Then there is the additional heat and power to consider, does adding the GPUs mean you've have to significantly scale back the number of machines?

Nor do we know whether the speed ups are for the whole workload or a small part of it. Nor do we know how correct the results from the GPU portion are.

Sure, but as I said: it's still perfectly plausible. GPU acceleration can result in speedups of an order of magnitude or more over performing the operations on a CPU. If it just so happens that this particular processing task is amenable to such acceleration, then the lack of ECC on GPU's is at most a complicating factor that prevents full use of the GPU's and makes it a tiny bit more challenging to setup the thread scheduling. Other than that, there really is no problem.

DemoCoder · Jun 26, 2009

Maintaining distributed clusters these days is mostly a solved problem, and there are well known algorithms like Byzantine Paxos which allow you to build your cluster out of really cheap components and deal unreliable nodes. It's pretty much no work if you're using CPUs as there are a number of out of the box systems for C and Java to setup up large scale self-healing clusters.

The issue really is:
a) needing to do it for GPU
b) how does node+GPU vs node affect your data center's power consumption and cooling systems
c) whether node+GPU delivers enough of a speedup vs buying N regular CPU nodes.
d) whether the algorithm can even be meaningfully GPU accelerated

There's just not enough information to know in this thread, but Google's entire infrastructure sits on top of Paxos and it is extremely efficient for what it does, not just for search, but for virtually everything else they do, like hosting people's applications on AppEngine. It's all super-distributed across millions of cheap PC nodes.

3dilettante · Jun 26, 2009

Chalnoth said:
It seems to me you still have in mind a somewhat low-level voting system. Here's the system that I would try to build, were I given the chance:

I would have a single large cluster with a large number of compute nodes (depending upon the task). Data would be processed in chunks, with each data chunk processed twice, followed by a verification step. Presumably the optimal situation would be to have the two data chunks for each step processed by separate nodes, which would require a bit of careful programming for appropriate queuing of the verification step, but nothing too horrible. Any verification step that fails, then, would just re-insert a pair of chunks into the queue.

This all requires, of course, that the data can be divided into wholly discrete chunks that are completely independent of one another. If this is not the case, the verification obviously gets dramatically more challenging. But I doubt that it would ever be impossible.

Doubling the node count is the one case I saw that would have the highest guaranteed penalty for a GPU-system.

If we go with the 200x speedup claim for OPC (ignoring the "up to"):
Let's use a CPU cluster a the baseline, and assume an equivalent GPU system would be 200 times faster at the same footprint, or only need 1/200th the footprint for the same performance.

Doubling nodes automatically means it's 100x.
It's still very good.

GPGPU is not without other extremely large scalar divisors, unfortunately.

With just 1 GPU, various GPGPU loads are already CPU limited performing the necessary computations and bookkeeping needed to control the slave board.
The speedup with verification overhead is going to be 100*(1-(per-chunk vote overhead + re-vote cost*rate of mismatch)).
Some additional performance would be lost with driver overheads and the more complicated software stack.

I don't know how high the penalty could get.
Maybe speedup is between 50x and 90x.
Still, good...

If we couple this with the probable case that the 200x was derived from code nobody would trust a multi-million dollar chip tape-out to without serious elaboration, and that the test would not have included all the necessary processes that a full HPC installation would have, there might be another loss of performance.

It still might work for the narrow case of problems that have two orders of magnitude improvement on GPUs--assuming the 200x speedup is enough to cover potential losses of performance in other parts of the process.
We'd also ignore the extreme fragility of GPU performance to cut things by significant factors as well.

Anything even slightly more modest would be right out.

If it's 200x reduced to something like 50-80x speedup and we throw in DP (assuming GPU FLOPs weren't crap FLOPs to start with), Nvidia would be right out.
AMD could might be slightly better or also being a waste of time, depending on whatever variables interfere, even with a theoretical 200x advantage.

GPUs would be much more compelling with ECC and other RAS features.
(They'd be even more compelling if they weren't slave cards, but I digress.)

This leaves out the practicalities of handling more error-prone hardware with very poor monitoring capabilities and fault detection.
For a small compute section, it might not matter. An HPC installation may make it less practical.

aaronspink · Jun 26, 2009

Chalnoth said:
Presumably if you're disk limited, there's no reason to bother with GPU acceleration in the first place, whether ECC or no.

There is a somewhat major difference between 1x load and 2-3x load.

KimB · Jun 27, 2009

aaronspink said:
There is a somewhat major difference between 1x load and 2-3x load.

But, as I said, if you're disk limited, there's no point in bothering with GPU acceleration at all. It's just a question as to whether or not you can get significant performance improvements from GPU acceleration (10x real performance improvement should be a good minimum, given the extra cost/space/power of nodes with GPU's and the added verification overhead).

And besides, at a worst case scenario, this would just mean you'd have to beef up your storage infrastructure (e.g. having larger RAID, dividing up data among multiple storage servers, etc.).

Jawed · Jun 27, 2009

Alive?

http://www.gauda.com/news_events.html

KimB · Jun 27, 2009

Jawed said:
http://www.gauda.com/news_events.html

Hmm, looks like they're running it only on a single PC. That makes verification of the results vastly easier, as if you want to be absolutely sure you're doing it right, you'd just do it on two different PC's and compare the results (or run it twice on the one).

Psycho · Jun 29, 2009

Theo still claims tape out, but yields are < 30%:
http://brightsideofnews.com/news/2009/6/24/nvidia-gt300---geforce-gtx-380-yields-are-sub-3025.aspx

Ailuros · Jun 29, 2009

The sollution is simple: cut the die in half with an axe and voila their yields are automaticall better... (yes it's a joke heh...)

Jawed · Jun 29, 2009

Understanding Software Approaches for GPGPU Reliability

http://www.cs.ucf.edu/~zhou/GPGPU_v1.pdf

CarstenS · Jun 29, 2009

Is that 20% chips with no dot-defects whatsoever or is this 20% functional chips wrt to the full target specs (including clockrate, full SIMD-deployment, full ROPs etc.) or is it 20% of chips that can populate even the lowest specced SKU?

That would make hell of a difference.

Arnold Beckenbauer · Jun 29, 2009

Jawed said:
http://www.cs.ucf.edu/~zhou/GPGPU_v1.pdf

http://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-2/Zhou.pdf (for lazy people like me)

Psycho · Jun 29, 2009

CarstenS said:
Is that 20% chips with no dot-defects whatsoever or is this 20% functional chips wrt to the full target specs (including clockrate, full SIMD-deployment, full ROPs etc.) or is it 20% of chips that can populate even the lowest specced SKU?

Seems like the later:

The current situation is that three faulty chips are made in the process to yeild one working one and that is much too much, since those faulty chips aren't exactly "GTX 360" or "slower Quadro FX" grade material.

Nvidia GT300 core: Speculation

3dilettante

KimB

3dilettante

KimB

3dilettante

KimB

aaronspink

KimB

DemoCoder

3dilettante

aaronspink

KimB

Jawed

KimB

Psycho

Ailuros

Epsilon plus three

Jawed

CarstenS

Moderator

Arnold Beckenbauer

Psycho

Similar threads