Error correction in modern GPUs

Raqia

Regular
GPUs now have multi-teraflop performance, stew in more of their heat than ever before, and are manufactured with finer and finer processes. I imagine that at some point a calculation error is bound to crop up due to random physical fluctuations.

This doesn't matter too much in gaming; who cares if you get a mis-colored pixel every once in a while. However, it matters quite a bit if you're simulating hyperbolic dynamics. Are hardware manufacturers implementing more stringent error correction in their GPUs now? I've heard of ECC memory in compute products, but are say error correcting codes getting longer internally as well? Also, what's the best practice for modern simulation code to take the inevitable physical error into account?
 
If you are not worried about miscolored pixels, then you should be worried about geometry corruption as the same units used to compute pixel colors are used to compute geometry positions and associated interpolants. This would be far more apparent than a miscolored pixel.

In games, most artifacts like this are transient, meaning they won't persist from frame to frame.

If you are performing computations on a GPU without ECC and require correctness, then you can perform redundant computations to help detect errors. Note that errors in registers are pretty rare unless you are working on a large compute farm. Boards with GDDR5 already have ECC to detect transmission errors which is a great feature.
 
There's also a fair amount of hidden state managed by the command processors and dedicated hardware. Something like the context status getting corrupted could have long running impact which would not be reliably detectable for a running kernel, assuming the GPU doesn't just crash.

There might be ECC in the command processor, not that we'd really know. It might be too much needless work to build a non-ECC version, even for GPUs with no ECC for compute.
 
The biggest concern is ECC for the on-chip SRAMs (i.e. register files, caches, etc.).

Real compute oriented GPUs have this feature, although they are quite a bit more expensive than the graphics brethren.

Note that GK104 does not have this feature, and it also significantly lacks in double precision performance.

DK
 
The biggest concern is ECC for the on-chip SRAMs (i.e. register files, caches, etc.).

Real compute oriented GPUs have this feature, although they are quite a bit more expensive than the graphics brethren.

Or you could use Tahiti-based boards since they have ECC for the SRAMs and pretty good double precision performance too ;)
 
Too bad AMD has yet to capitalize on this by putting it into a FirePro.


Yep and i dont really understand what they are waiting for do it .. cause Tahiti is an extremely good computing competitor...

let say specially for dual float precision, but if we take raw number of the dual gk104 who can be find in the Tesla10, a single 7970 will give it an hard time ( 5x more double float point speed, and just under on single precision ). (ofc there#s software, CUDA etc who make a big difference, i really just speak about numbers ) ..

I dont even know who will buy the tesla10: the Tesla 20 is around the corner ( and entreprise dont like buy a product who will last 6 months ). ( I say that, i can be wrong )
 
Last edited by a moderator:
Validation and drivers for professional products must take a really long time. Maybe there are also bugs mostly relevant to mission critical tasks in current products that require a re-spin for prime time.
 
Back
Top