The issue isn't just hardware problems, but cosmic rays flipping your bits as well.
I know. I guess you've not been following the discussion on ECC and measured error rates on GPUs without ECC.
Soak testing will do nothing to stop that. Build a big enough cluster and run it long enough, and the probability of failure becomes non-trivial.
Actually, the only test to date with GPUs shows no failures (while Aaron reckons the method of that test is faulty). What it does show is that graphics cards delivered with faulty memory are a serious problem.
The estimate for cosmic ray bit flips is about 1 event per 256MB of memory per month. Amazon got taken down by cosmic rays in the 90s for 24 hrs by a cosmic ray event.
Yeah, you're really been out of the loop on this subject:
http://forum.beyond3d.com/showthread.php?t=54676
No-one has demonstrated a need for video memory to be protected by ECC. GPU on-die memory? Not that either. Fact is, the error rates in contemporary systems do not match with "received wisdom".
People building HPC are clusters going to be using several hundred cards and running them on jobs which could run for weeks or months and consume huge $$$ of power costs and time, so to have the results fscked up halfway through is a bitch.
In the end ECC isn't terribly expensive to implement in hardware, the way NVidia's implemented it. The performance loss isn't a deal breaker either. NVidia took the easy way forward, relatively speaking. And will be marketing it purely on FUD as there is no evidence that GPU video memory suffers from cosmic ray events.
By the way, I'm not saying GPU video memory can't suffer from cosmic ray events - I'm saying the evidence one way or the other (or any measurements of failure rate) doesn't exist.
Even if fears are overrated, the people in the position of purchasing huge amounts of equipment, especially for government laboratories, are risk averse and like to buy safety.
Yeah, it's why "no-one ever got fired for buying IBM" became a paradigm. Of course, the cost of the FLOPS can make one quite pragmatic about risk. Like these guys:
http://forum.beyond3d.com/showthread.php?p=1353418#post1353418
using GPUs that are obviously not ECC protected. I wonder if they'll be reporting about their cosmic ray problems...
Honestly, I don't think it really matters - the option's there for those who "need" it and the reality is that NVidia has, effectively, not priced ECC as a premium option (since DP throughput and memory amount are the dominant facets of the premium option). But the science on cosmic ray events is sorely, or is that "hilariously", lacking.
Of course, now that NVidia's built ECC it will be possible to experimentally compare ECC, one test with it turned on and one with it turned off and see how they fare. Though as I understand it, the on-die ECC can't be turned off - so it's not an entirely controlled experiment.
Jawed