View Full Version : Memory Errors in GPUs
On Testing GPU Memory for Hard and Soft Errors:
http://saahpc.ncsa.illinois.edu/papers/Shi_paper.pdf
Couldn't detect any soft errors over a substantial test. Did find that 1.8% of their GPUs had permanent errors. Seems NVidia didn't test them properly and has replaced them.
Software-Based ECC for GPUs:
http://saahpc.ncsa.illinois.edu/papers/Maruyama_paper.pdf
These papers come from:
http://saahpc.ncsa.illinois.edu/agenda.html
Jawed
Damn ... and those are their bloody Teslas, makes you wonder about the memory integrity on consumer cards.
Since consumer cards are manufactured by third parties, I think this would depend on whether NVIDIA provides a good tool to test for memory defects. I would not hold too much confidence though. I've seen weird effects on consumer cards (from both NVIDIA and ATI) which look like some sorts of memory error.
Blazkowicz
25-Jul-2009, 10:34
Is software ECC applicable to CPUs as well? as pointed out by the article, regular PCs don't use ECC memory despite providing a lot of CPU power, memory amount and memory bandwith (with ddr3) for real cheap.
That may be non trivial to achieve system wide, though.
CarstenS
25-Jul-2009, 12:33
You left our an important detail:
Memory Errors are quite common - hence ECC an the like.
"But even with 1.8% of systems
confirmed to have memory issues, the memory failure rate is
below of what has been reported on non-GPU clusters, e.g.,
Li et al. [6] discovered hardware memory faults on 9 out of
212 (or 4.5%) Ask.com servers."
Is software ECC applicable to CPUs as well? as pointed out by the article, regular PCs don't use ECC memory despite providing a lot of CPU power, memory amount and memory bandwith (with ddr3) for real cheap.
That may be non trivial to achieve system wide, though.
It could be more complicated on a CPU because CPU's memory hierarchy is more complex. Normally GPU have no cache, but CPU generally have multiple levels of caches and some are protected by ECC.
I think if you really care about the correctness of your operations, it's probably better to just do some redundant (or verification) calculations.
You left our an important detail:
Memory Errors are quite common - hence ECC an the like.
"But even with 1.8% of systems
confirmed to have memory issues, the memory failure rate is
below of what has been reported on non-GPU clusters, e.g.,
Li et al. [6] discovered hardware memory faults on 9 out of
212 (or 4.5%) Ask.com servers."
NVidia's testing of Tesla cards (since it seems as if NVidia does do some testing) may well be having a dramatic effect on the error rate.
Were the Ask.com servers ever tested before Li came along? Were they tested by the manufacturer? By the installer? By the people running it? I don't know if Li's report is a useful control. It doesn't sound like it, frankly.
Products that are faulty, when brand new, are quite common.
Jawed
dkanter
26-Jul-2009, 05:19
Very interesting papers.
I strongly suspect that their testing methodology for SERs has a hole in it if they didn't detect any (over a sufficient period of time). I couldn't say how to fix it...but I don't have any reason to believe that GDDx is less susceptible to SERs than DDRx.
Another related issue is the problem of SERs on-chip. SRAM arrays are a known problem (hence the use of ECC for most caches). Register files are generally more robust against SERs, but hardly infallible (some high-end CPUs have ECC on their integer reg files).
Unfortunately, I don't know how much SRAM and RF* is used in a GPU, whether there is parity, etc. etc. I also don't know what the design of their register files is like. As always, you can improve your resilience to errors by increasing cell size, reducing frequency, etc. etc.
* We do know that there's at least 88KB of constant, register and shared memories in each SM, and 30 SMs/GT200 so there's at least 2640KB per chip (assuming all three use RFs). In reality, there's going to be substantially more as I'm sure there's all sorts of RFs and SRAMs floating around in the TMUs, ROPs and memory controllers.
I also have no idea how to measure the SERs in your register files or SRAMs without parity or ECC...
It would definitely be interesting to see papers at ISSCC on these issues, but sadly NV and ATI have never gone down that route. Perhaps Intel will with Larrabee, which could encourage NV and ATI to follow. It'd certainly be great for GPU enthusiasts if they did.
Obviously, ECC would be a big step forward, and I'm sure that will happen in the future...perhaps even the next generation of Tesla/Quadro cards.
DK
I strongly suspect that their testing methodology for SERs has a hole in it if they didn't detect any (over a sufficient period of time). I couldn't say how to fix it...but I don't have any reason to believe that GDDx is less susceptible to SERs than DDRx.
The physical implementation of GDDR and DDR is slightly different (i.e. use of DIMMs for the latter), which might affect electrical performance, which might affect the behaviour of the refresh circuitry :?: How many soft errors are caused by faulty refresh?
It's interesting that their testing methodology seems more thorough than MemtestG80.
Another related issue is the problem of SERs on-chip. SRAM arrays are a known problem (hence the use of ECC for most caches). Register files are generally more robust against SERs, but hardly infallible (some high-end CPUs have ECC on their integer reg files).
The trade for using consumer CPUs for supercomputing.
Obviously, ECC would be a big step forward, and I'm sure that will happen in the future...perhaps even the next generation of Tesla/Quadro cards.
http://www.cs.ucf.edu/~zhou/GPGPU_v1.pdf
The R-scatter result for FFT presented in that paper (which show relatively little change in performance) would be quite different with R700 because R700's ALUs would process the non-redundant code far more quickly (they're much stronger for integer/bitwise instructions).
Additionally those tests seem to have been done a long time a go - the PCI Express bus appears to be constraining performance much more severely than it would on a contemporary system. Then again, there are plenty of kernels out there whose copy time to/from GPU is vanishingly small.
Overall the paper's lack of breadth means it isn't much of a guide on the usefulness of ECC. But I think it can be argued that the R-Scatter and R-Thread techniques are not worth the hassle.
If NVidia goes with memory hubs in its next major revision there's an opportunity to implement hubs that are specific to ECC.
The other side of the coin is that there's no point implementing ECC when you could simply use redundant hardware. In a year's time the hardware to do the same computation will halve in cost. CPUs' price-performance curve isn't anything like as compelling, so ECC is more interesting. That is, of course, if the GPU is providing something like 20x or more speed-up for the entire application. If the GPU is only making the overall application 50% faster then you're in la-la land using GPUs.
If GPUs are worth doing then arguably the ideal is to build a cluster out of consumer cards, so that the GPUs are easy to change over time - rather than being stuck with manufactured blades whose GPUs cannot be changed. Consumer cards are also cheaper to change when they fail. Consumer cards will tend to have less memory on them, though.
If you build with consumer cards you have to test that they're working when you buy them. But since you have to test them regularly once they're in service, anyway, testing them when you buy them is a non-issue.
Jawed
aaronspink
26-Jul-2009, 13:06
You left our an important detail:
Memory Errors are quite common - hence ECC an the like.
"But even with 1.8% of systems
confirmed to have memory issues, the memory failure rate is
below of what has been reported on non-GPU clusters, e.g.,
Li et al. [6] discovered hardware memory faults on 9 out of
212 (or 4.5%) Ask.com servers."
You'd kinda expect that a server farm with full error detection to actually report a higher error rate than some part software testing...
aaronspink
26-Jul-2009, 13:17
The physical implementation of GDDR and DDR is slightly different (i.e. use of DIMMs for the latter), which might affect electrical performance, which might affect the behaviour of the refresh circuitry :?: How many soft errors are caused by faulty refresh?
The actual cells in DDR and GDDR are basically the same thing. The macro level difference is in the size of the sub-arrays. Basically there should be close to zero difference as far as particle induced SER.
It's interesting that their testing methodology seems more thorough than MemtestG80.
Memtest86 and by extention memtestg80 are trying to detect/induce static and margin related errors. Neither are good tests for SER.
Aaron, can you describe or link the right way to test for soft errors?
Jawed
aaronspink
27-Jul-2009, 04:10
Aaron, can you describe or link the right way to test for soft errors?
Jawed
In any event where you are talking about externally induced soft errors in memory bit cells, you must first determine the polarity of the cells themselves. Are the cells inverted or not? You then create a test pattern that is most likely to show the effects of a bit flip if a strike occurs. If your standard particle would set a cell to 1 and your test pattern is such that the cell would be a 1 to begin with, the probability of detecting the strike goes way down.
Once you have the pattern in memory, to slowly start reading through memory counting any mismatches in pattern you get. You let it sit there for sufficient time to get a relevant statistical sample.
If you are concerned about full characterization, you also do accelerated testing either via artificial means such as a isotope source (there are actually several government labs that basically ONLY do this) or via high altitude exposure (the intensity of things such as gamma rays at high altitude is significantly higher than are sea level or underground. Places like Sandia National Labs have issue with this in the compute arrays due to the high altitude there. 5500 feet may not seem like much but can change a 1 PPB issue into a < 1 PPM issue which with large compute arrays has issues.)
Doing repeated readings and writing of memory with various patterns really doesn't tell you as much as one would think about SER except for things around margin. This is why Memtest is great to see if your memory works but no so great at detection of particle induced SER. For testing SER, you really want to treat the memory as a small particle detector like a CCD and then de-rate based on real world pattern analysis (Things like the probability that you are actively using that portion of memory, data patterns, etc).
Is Ask's server farm running ECC memory?
Even then, the memory ECC happens on the DIMM so there could still be transport interference on the pin's sockets etc.
Even when swapping memory intensively we hardly run into ECC errors (way lower percentages than those reported for Ask) and then they always seem to be faulty DIMM's
http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
Jawed
http://www.hpcwire.com/features/Reliable-Memory-Coming-to-a-GPU-Near-You-56751022.html
Patricia Harrell, AMD's director of Stream Computing, admits that the need for more robust data protection in GPUs already exists. She says error corrected memory is a requirement for a number of customers, especially those looking to deploy GPUs at scale, i.e., high performance computing users with large compute clusters. Although individual memory error rates are low, as you add more GPUs (and thus more graphics memory) to the system, and run applications for longer periods of time, the chances of hitting a flipped memory bit increases proportionally.
[...]
Overall though, AMD seems to be taking a cautious approach to error correcting GPUs. "It's really important to put in the required features intelligently, and make sure you do the research and engineering to protect the data structures that are going to return the most value," notes Harrell. If not, she says, you end up with devices that are too big and too hot, in which case you lose the performance advantages GPGPU was originally intended for.
[...]
Unlike AMD's more wait-and-see attitude, NVIDIA appears to be fully committed to bringing error protection to GPU computing. According to Andy Keane, general manager of the GPU computing business unit at NVIDIA, it is not a matter of if, but when. From his point of view, ECC memory is a hard requirement in datacenters. "We have to respond to that by building that kind of support into our roadmap," Keane said unequivocally. "It will be in a future GPU."
The final sentence, I think, is likely to be the clincher for GT300:
Just like double precision performance and on-board memory capacity, error correction is destined to become an important differentiator in high-end GPU computing.
Which is why I think memory hubs will be part of GT300's architecture, supporting high capacity and ECC DDR2/DDR3. GT200 demonstrated NVidia's zeal in making CUDA work at any cost and I believe GT300 is more, much more, of the same. I expect it to be pretty cool.
Jawed
ChrisRay
03-Sep-2009, 14:05
Good find Jawed.
rpg.314
03-Sep-2009, 14:51
Which is why I think memory hubs will be part of GT300's architecture, supporting high capacity and ECC DDR2/DDR3. GT200 demonstrated NVidia's zeal in making CUDA work at any cost and I believe GT300 is more, much more, of the same. I expect it to be pretty cool.
Oh please, there's almost no way SMB is useful over integrating the ECC into the GPU itself (however hacky it may be). And let's not forget, after the rv770 shock, their first priority is going to be graphics and games, not computing.
Oh please, there's almost no way SMB is useful over integrating the ECC into the GPU itself (however hacky it may be).
Hacking ECC in software? That destroys performance as it is computationally intensive.
Figure 1 compares the throughputs of memory reads with and without ECC. The throughputs decreased to 24%, 40%, and 35% with GTX 285, S1070, and 8800 GTS, respectively. We speculate that this throughput degradation can be explained by the ECC computation cost. Our prototype takes 63 integer 32-bit logical operations to generate an ECC for a 64-bit datum, which means that one byte of data approximately requires eight integer operation. As shown in the blue bars in the graph, the GTX 285 GPU achieves more than 140 GB/s. To keep up the memory throughput, the GPU need to process 1120 giga operations per second (GOPS), which is far beyond of its theoretical limit of 355 GOPS. In other words, the GPU can at most afford 44 GB/s throughputs. In addition, read accesses incur other instructions, including the comparison of codes, so the actual ECC throughputs is lower than the limit. Thus, we believe that the performance of software ECC is computation bottleneck. This analysis is consistent with the fact that the ECC throughput of S1070 is approximately the same as that of GTX 285: While the latter achieves much higher throughput without ECC, they have very similar processing power (355 GOPS and 345 GOPS). The 8800 GTS GPU, whose processing speed is 208 GOPS, shows the similar behavior.
Even though that paragraph unrealistically makes a baseline out of 140GB/s on GTX285, you can see that the computational cost of ECC is monstrous.
Obviously, one option would be to make an instruction that does this, so that it would still be a software operation.
Making the DDR I/O on the die include ECC capability has a fairly severe area cost, though if GT300 is the only chip with it, size is arguably not the most pressing parameter.
And let's not forget, after the rv770 shock, their first priority is going to be graphics and games, not computing.
GT300 has spent more time in design than RV770 has been a retail product.
I think ECC is a white-elephant, for what it's worth. The evidence points towards soft errors in GPU GDDR being negligible (faulty chips are the problem and such faulty chips are relatively easy to weed-out - plus system level error detection/correction requires more byzantine redundancy mechanisms, way beyond ECC). But it seems "customers want it" so that's what NVidia's chasing, and only NVidia will be providing it any time soon.
Jawed
aaronspink
03-Sep-2009, 18:07
I think ECC is a white-elephant, for what it's worth. The evidence points towards soft errors in GPU GDDR being negligible (faulty chips are the problem and such faulty chips are relatively easy to weed-out - plus system level error detection/correction requires more byzantine redundancy mechanisms, way beyond ECC). But it seems "customers want it" so that's what NVidia's chasing, and only NVidia will be providing it any time soon.
The evidence of your case is sorely lacking jawed. There is 1 paper with less than sound methods and mountains of evidence that software "ECC" massively under reports actual error counts. We haven't even gotten into all the other various factors such as location and elevation.
And I still remain skeptical that anyone will provide a real ECC system for GPUs anytime soon.
The evidence of your case is sorely lacking jawed. There is 1 paper with less than sound methods
How many errors does doctrine dictate should have been detected? They detected none.
What location and elevation would be required to achieve a zero-soft-error test, according to the doctrine that you're going by?
and mountains of evidence that software "ECC" massively under reports actual error counts.
And Google's evidence clearly shows that faulty chips and interface physical properties are swamping the cosmological factor you keep citing. The cosmological factor for which there is no publicly documented proof with contemporary large cluster systems based on commodity PC hardware.
Unless someone would like to link it? I can't find anything. The Ziegler et al. paper from 1996 keeps coming up though.
Jawed
dkanter
05-Sep-2009, 07:34
How many errors does doctrine dictate should have been detected? They detected none.
What location and elevation would be required to achieve a zero-soft-error test, according to the doctrine that you're going by?
Lead lined bunker, underground and away from any radiation sources : )
And Google's evidence clearly shows that faulty chips and interface physical properties are swamping the cosmological factor you keep citing. The cosmological factor for which there is no publicly documented proof with contemporary large cluster systems based on commodity PC hardware.
Google's systems aren't quite commodity PC hardware. They use commodity CPUs and other commodity parts, but are utterly unlike anything you can buy. Also, they run very specific workloads, which have different characteristics than what other users may need.
Unless someone would like to link it? I can't find anything. The Ziegler et al. paper from 1996 keeps coming up though.
Jawed
A lot of the internal studies by semiconductor vendors aren't published at all...but when semiconductor companies target a 7 year life, they do a hell of a lot of testing to determine that.
David
Hacking ECC in software?
No, simply putting the ECC hardware in the GPU.
As Rolf was kind enough to point out for us, we were all being stupid silly about trying to use the ECC memory schemes from PCs. No need to use DDR2/3 ... just use standard bus width GDDR5. So for every 64-byte burst to/from memory dedicate 8 bytes to ECC, done. The cheapest and most obvious solution.
With GDDR5 memory the memory hub makes no sense ... and they don't need to use anything else.
PS. well the obvious solution to people smarter than me obviously :)
dkanter
05-Sep-2009, 19:13
No, simply putting the ECC hardware in the GPU.
As Rolf was kind enough to point out for us, we were all being stupid silly about trying to use the ECC memory schemes from PCs. No need to use DDR2/3 ... just use standard bus width GDDR5. So for every 64-byte burst to/from memory dedicate 8 bytes to ECC, done. The cheapest and most obvious solution.
With GDDR5 memory the memory hub makes no sense ... and they don't need to use anything else.
PS. well the obvious solution to people smarter than me obviously :)
It's not that simple at all, and there are other problems as well.
David
aaronspink
06-Sep-2009, 01:19
No, simply putting the ECC hardware in the GPU.
As Rolf was kind enough to point out for us, we were all being stupid silly about trying to use the ECC memory schemes from PCs. No need to use DDR2/3 ... just use standard bus width GDDR5. So for every 64-byte burst to/from memory dedicate 8 bytes to ECC, done. The cheapest and most obvious solution.
sure, only requires COMPLETELY CUSTOM MEMORY!
You misunderstand me ... for every 64 bits of the normal memory 8 bits are set aside for ECC ... but for the memory nothing changes compared to a normal 64 bits wide bus.
Calculating the correct physical memory addresses requires a division, but another cycle of latency isn't that disastrous. Physical memory won't be 8 byte aligned anymore, so you have to throw away a couple of bytes on each access or design for partially filled cache lines ... even if you throw the bytes away you still have something like 6/8th of the original speed, which is enough IMO.
aaronspink
06-Sep-2009, 16:49
You misunderstand me ... for every 64 bits of the normal memory 8 bits are set aside for ECC ... but for the memory nothing changes compared to a normal 64 bits wide bus.
Calculating the correct physical memory addresses requires a division, but another cycle of latency isn't that disastrous. Physical memory won't be 8 byte aligned anymore, so you have to throw away a couple of bytes on each access or design for partially filled cache lines ... even if you throw the bytes away you still have something like 6/8th of the original speed, which is enough IMO.
me thinks you need to think this through more while looking at a spec sheet for gddr!
The specs of the memory are entirely irrelevant ... ECC is just some extra data for memory to store. You could do this in software right now ... hell, it has been done already on GPUs (http://saahpc.ncsa.illinois.edu/sessions/day2/session2/Maruyama_presentation.pdf). They stored data+ECC in 64+8 chunks in separate arrays rather than in 56+8 chunks in an interleaved array, but the former is the better scheme for software.
aaronspink
06-Sep-2009, 17:25
The specs of the memory are entirely irrelevant ... error correction codes are and always have been just some extra data for memory to store. You could do this in software right now ... hell, it has been done already on GPUs. They stored data+ECC in 64+8 chunks in separate arrays rather than in 56+8 chunks in an interleaved array, but the former is the better scheme for software.
yes, you can do it that way at a 2x+ bandwidth cost. You have for every write a Burst RMW and for every read an extra Burst Read.
It's only a RMW on a partial update. This is something you can't get rid off without using custom memory, a memory hub using DDR2/3 won't get rid of the RMW on partial updates ... ECC DIMMs are just 72 bits memory, there is no internal ECC logic.
With interleaved ECC codes a single burst read would be enough for any byte read.
PS. I guess it makes more sense to simply use 64+8 bits interleaved data+ECC ... alignment gets shot either way, so you might as well minimize the number of partial updates.
http://www.techreport.com/discussions.x/19141
Those tests have revealed that the memory interface is the most vulnerable point in the system, and GDDR5's error correction adds a measure of protection at that point.
Can't say I'm surprised.
3dilettante
24-Jun-2010, 20:58
I'm not surprised that AMD would say that when their GPU product does not have it.
It will be interesting to see how AMD reconciles that position with the fact that the products that will make up the bulk of its HPC endeavors sport ECC both for memory and for a large portion of their on-die memory.
It reinforces a kind of quality pyramid, with CPUs at the narrower top, and the compromised compliance and reliability of GPUs at the wider base.
EduardoS
24-Jun-2010, 23:54
Someone must tell AMD that farmers buy GPUs :)
http://devgurus.amd.com/thread/159124
The problem is a not so obvious missing local barrier needed after the line
seed = randomBlock[blockDim -1]; // blockDim=256
Whoops:
http://cs.stanford.edu/people/ihaque/talks/gpuser_lacss_oct_2010.pdf
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.