If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 |
|
Member
Join Date: Oct 2003
Posts: 320
|
GPUs now have multi-teraflop performance, stew in more of their heat than ever before, and are manufactured with finer and finer processes. I imagine that at some point a calculation error is bound to crop up due to random physical fluctuations.
This doesn't matter too much in gaming; who cares if you get a mis-colored pixel every once in a while. However, it matters quite a bit if you're simulating hyperbolic dynamics. Are hardware manufacturers implementing more stringent error correction in their GPUs now? I've heard of ECC memory in compute products, but are say error correcting codes getting longer internally as well? Also, what's the best practice for modern simulation code to take the inevitable physical error into account? |
|
|
|
|
|
#2 |
|
Senior Member
|
If you are not worried about miscolored pixels, then you should be worried about geometry corruption as the same units used to compute pixel colors are used to compute geometry positions and associated interpolants. This would be far more apparent than a miscolored pixel.
In games, most artifacts like this are transient, meaning they won't persist from frame to frame. If you are performing computations on a GPU without ECC and require correctness, then you can perform redundant computations to help detect errors. Note that errors in registers are pretty rare unless you are working on a large compute farm. Boards with GDDR5 already have ECC to detect transmission errors which is a great feature.
__________________
I speak only for myself. |
|
|
|
|
|
#3 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,141
|
There's also a fair amount of hidden state managed by the command processors and dedicated hardware. Something like the context status getting corrupted could have long running impact which would not be reliably detectable for a running kernel, assuming the GPU doesn't just crash.
There might be ECC in the command processor, not that we'd really know. It might be too much needless work to build a non-ECC version, even for GPUs with no ECC for compute.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#4 |
|
Regular
Join Date: Jan 2008
Posts: 354
|
The biggest concern is ECC for the on-chip SRAMs (i.e. register files, caches, etc.).
Real compute oriented GPUs have this feature, although they are quite a bit more expensive than the graphics brethren. Note that GK104 does not have this feature, and it also significantly lacks in double precision performance. DK
__________________
www.realworldtech.com |
|
|
|
|
|
#5 | |
|
Senior Member
|
Quote:
__________________
I speak only for myself. |
|
|
|
|
|
|
#6 | |
|
Regular
Join Date: Jan 2008
Posts: 354
|
Quote:
DK
__________________
www.realworldtech.com |
|
|
|
|
|
|
#7 |
|
Senior Member
|
Too bad AMD has yet to capitalize on this by putting it into a FirePro.
__________________
"Well, you mentioned Disneyland, I thought of this porn site, and then bam! A blue Hulk." —The Creature My (currently dormant) blog: Teχlog |
|
|
|
|
|
#8 | |
|
Member
Join Date: Mar 2012
Location: Switzerland
Posts: 660
|
Quote:
Yep and i dont really understand what they are waiting for do it .. cause Tahiti is an extremely good computing competitor... let say specially for dual float precision, but if we take raw number of the dual gk104 who can be find in the Tesla10, a single 7970 will give it an hard time ( 5x more double float point speed, and just under on single precision ). (ofc there#s software, CUDA etc who make a big difference, i really just speak about numbers ) .. I dont even know who will buy the tesla10: the Tesla 20 is around the corner ( and entreprise dont like buy a product who will last 6 months ). ( I say that, i can be wrong ) Last edited by lanek; 09-Jun-2012 at 16:09. |
|
|
|
|
|
|
#9 |
|
Member
Join Date: Oct 2003
Posts: 320
|
Validation and drivers for professional products must take a really long time. Maybe there are also bugs mostly relevant to mission critical tasks in current products that require a re-spin for prime time.
|
|
|
|
|
|
#10 |
|
Regular
Join Date: Jan 2008
Posts: 354
|
I bet they will launch a professional product soon. AFDS will be a logical venue.
DK
__________________
www.realworldtech.com |
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|