Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 08-Jun-2012, 16:09   #1
Raqia
Member
 
Join Date: Oct 2003
Posts: 320
Default Error correction in modern GPUs

GPUs now have multi-teraflop performance, stew in more of their heat than ever before, and are manufactured with finer and finer processes. I imagine that at some point a calculation error is bound to crop up due to random physical fluctuations.

This doesn't matter too much in gaming; who cares if you get a mis-colored pixel every once in a while. However, it matters quite a bit if you're simulating hyperbolic dynamics. Are hardware manufacturers implementing more stringent error correction in their GPUs now? I've heard of ECC memory in compute products, but are say error correcting codes getting longer internally as well? Also, what's the best practice for modern simulation code to take the inevitable physical error into account?
Raqia is offline   Reply With Quote
Old 08-Jun-2012, 16:29   #2
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,291
Send a message via ICQ to OpenGL guy
Default

If you are not worried about miscolored pixels, then you should be worried about geometry corruption as the same units used to compute pixel colors are used to compute geometry positions and associated interpolants. This would be far more apparent than a miscolored pixel.

In games, most artifacts like this are transient, meaning they won't persist from frame to frame.

If you are performing computations on a GPU without ECC and require correctness, then you can perform redundant computations to help detect errors. Note that errors in registers are pretty rare unless you are working on a large compute farm. Boards with GDDR5 already have ECC to detect transmission errors which is a great feature.
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 08-Jun-2012, 16:54   #3
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,141
Default

There's also a fair amount of hidden state managed by the command processors and dedicated hardware. Something like the context status getting corrupted could have long running impact which would not be reliably detectable for a running kernel, assuming the GPU doesn't just crash.

There might be ECC in the command processor, not that we'd really know. It might be too much needless work to build a non-ECC version, even for GPUs with no ECC for compute.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 08-Jun-2012, 18:35   #4
dkanter
Regular
 
Join Date: Jan 2008
Posts: 354
Default

The biggest concern is ECC for the on-chip SRAMs (i.e. register files, caches, etc.).

Real compute oriented GPUs have this feature, although they are quite a bit more expensive than the graphics brethren.

Note that GK104 does not have this feature, and it also significantly lacks in double precision performance.

DK
__________________
www.realworldtech.com
dkanter is offline   Reply With Quote
Old 08-Jun-2012, 18:54   #5
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,291
Send a message via ICQ to OpenGL guy
Default

Quote:
Originally Posted by dkanter View Post
The biggest concern is ECC for the on-chip SRAMs (i.e. register files, caches, etc.).

Real compute oriented GPUs have this feature, although they are quite a bit more expensive than the graphics brethren.
Or you could use Tahiti-based boards since they have ECC for the SRAMs and pretty good double precision performance too
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 08-Jun-2012, 19:21   #6
dkanter
Regular
 
Join Date: Jan 2008
Posts: 354
Default

Quote:
Originally Posted by OpenGL guy View Post
Or you could use Tahiti-based boards since they have ECC for the SRAMs and pretty good double precision performance too
Yes, Tahiti is a real compute oriented product. Not so much GK104 and it's derivatives.

DK
__________________
www.realworldtech.com
dkanter is offline   Reply With Quote
Old 08-Jun-2012, 19:24   #7
Alexko
Senior Member
 
Join Date: Aug 2009
Posts: 2,024
Send a message via MSN to Alexko
Default

Quote:
Originally Posted by dkanter View Post
Yes, Tahiti is a real compute oriented product. Not so much GK104 and it's derivatives.

DK
Too bad AMD has yet to capitalize on this by putting it into a FirePro.
__________________
"Well, you mentioned Disneyland, I thought of this porn site, and then bam! A blue Hulk." —The Creature
My (currently dormant) blog: Teχlog
Alexko is online now   Reply With Quote
Old 09-Jun-2012, 16:02   #8
lanek
Member
 
Join Date: Mar 2012
Location: Switzerland
Posts: 660
Default

Quote:
Originally Posted by Alexko View Post
Too bad AMD has yet to capitalize on this by putting it into a FirePro.

Yep and i dont really understand what they are waiting for do it .. cause Tahiti is an extremely good computing competitor...

let say specially for dual float precision, but if we take raw number of the dual gk104 who can be find in the Tesla10, a single 7970 will give it an hard time ( 5x more double float point speed, and just under on single precision ). (ofc there#s software, CUDA etc who make a big difference, i really just speak about numbers ) ..

I dont even know who will buy the tesla10: the Tesla 20 is around the corner ( and entreprise dont like buy a product who will last 6 months ). ( I say that, i can be wrong )

Last edited by lanek; 09-Jun-2012 at 16:09.
lanek is offline   Reply With Quote
Old 09-Jun-2012, 22:36   #9
Raqia
Member
 
Join Date: Oct 2003
Posts: 320
Default

Validation and drivers for professional products must take a really long time. Maybe there are also bugs mostly relevant to mission critical tasks in current products that require a re-spin for prime time.
Raqia is offline   Reply With Quote
Old 11-Jun-2012, 00:49   #10
dkanter
Regular
 
Join Date: Jan 2008
Posts: 354
Default

I bet they will launch a professional product soon. AFDS will be a logical venue.

DK
__________________
www.realworldtech.com
dkanter is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 01:10.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.