Memory Errors in GPUs

Discussion in 'GPGPU Technology & Programming' started by Jawed, Jul 24, 2009.

  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
  2. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,810
    Likes Received:
    478
    Damn ... and those are their bloody Teslas, makes you wonder about the memory integrity on consumer cards.
     
  3. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,751
    Likes Received:
    128
    Location:
    Taiwan
    Since consumer cards are manufactured by third parties, I think this would depend on whether NVIDIA provides a good tool to test for memory defects. I would not hold too much confidence though. I've seen weird effects on consumer cards (from both NVIDIA and ATI) which look like some sorts of memory error.
     
  4. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    Is software ECC applicable to CPUs as well? as pointed out by the article, regular PCs don't use ECC memory despite providing a lot of CPU power, memory amount and memory bandwith (with ddr3) for real cheap.
    That may be non trivial to achieve system wide, though.
     
  5. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    You left our an important detail:
    Memory Errors are quite common - hence ECC an the like.

    "But even with 1.8% of systems
    confirmed to have memory issues, the memory failure rate is
    below of what has been reported on non-GPU clusters, e.g.,
    Li et al. [6] discovered hardware memory faults on 9 out of
    212 (or 4.5%) Ask.com servers."
     
  6. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,751
    Likes Received:
    128
    Location:
    Taiwan
    It could be more complicated on a CPU because CPU's memory hierarchy is more complex. Normally GPU have no cache, but CPU generally have multiple levels of caches and some are protected by ECC.

    I think if you really care about the correctness of your operations, it's probably better to just do some redundant (or verification) calculations.
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    NVidia's testing of Tesla cards (since it seems as if NVidia does do some testing) may well be having a dramatic effect on the error rate.

    Were the Ask.com servers ever tested before Li came along? Were they tested by the manufacturer? By the installer? By the people running it? I don't know if Li's report is a useful control. It doesn't sound like it, frankly.

    Products that are faulty, when brand new, are quite common.

    Jawed
     
  8. dkanter

    Regular

    Joined:
    Jan 19, 2008
    Messages:
    360
    Likes Received:
    20
    Very interesting papers.

    I strongly suspect that their testing methodology for SERs has a hole in it if they didn't detect any (over a sufficient period of time). I couldn't say how to fix it...but I don't have any reason to believe that GDDx is less susceptible to SERs than DDRx.


    Another related issue is the problem of SERs on-chip. SRAM arrays are a known problem (hence the use of ECC for most caches). Register files are generally more robust against SERs, but hardly infallible (some high-end CPUs have ECC on their integer reg files).

    Unfortunately, I don't know how much SRAM and RF* is used in a GPU, whether there is parity, etc. etc. I also don't know what the design of their register files is like. As always, you can improve your resilience to errors by increasing cell size, reducing frequency, etc. etc.

    * We do know that there's at least 88KB of constant, register and shared memories in each SM, and 30 SMs/GT200 so there's at least 2640KB per chip (assuming all three use RFs). In reality, there's going to be substantially more as I'm sure there's all sorts of RFs and SRAMs floating around in the TMUs, ROPs and memory controllers.

    I also have no idea how to measure the SERs in your register files or SRAMs without parity or ECC...



    It would definitely be interesting to see papers at ISSCC on these issues, but sadly NV and ATI have never gone down that route. Perhaps Intel will with Larrabee, which could encourage NV and ATI to follow. It'd certainly be great for GPU enthusiasts if they did.


    Obviously, ECC would be a big step forward, and I'm sure that will happen in the future...perhaps even the next generation of Tesla/Quadro cards.

    DK
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    The physical implementation of GDDR and DDR is slightly different (i.e. use of DIMMs for the latter), which might affect electrical performance, which might affect the behaviour of the refresh circuitry :?: How many soft errors are caused by faulty refresh?

    It's interesting that their testing methodology seems more thorough than MemtestG80.

    The trade for using consumer CPUs for supercomputing.

    http://www.cs.ucf.edu/~zhou/GPGPU_v1.pdf

    The R-scatter result for FFT presented in that paper (which show relatively little change in performance) would be quite different with R700 because R700's ALUs would process the non-redundant code far more quickly (they're much stronger for integer/bitwise instructions).

    Additionally those tests seem to have been done a long time a go - the PCI Express bus appears to be constraining performance much more severely than it would on a contemporary system. Then again, there are plenty of kernels out there whose copy time to/from GPU is vanishingly small.

    Overall the paper's lack of breadth means it isn't much of a guide on the usefulness of ECC. But I think it can be argued that the R-Scatter and R-Thread techniques are not worth the hassle.

    If NVidia goes with memory hubs in its next major revision there's an opportunity to implement hubs that are specific to ECC.

    The other side of the coin is that there's no point implementing ECC when you could simply use redundant hardware. In a year's time the hardware to do the same computation will halve in cost. CPUs' price-performance curve isn't anything like as compelling, so ECC is more interesting. That is, of course, if the GPU is providing something like 20x or more speed-up for the entire application. If the GPU is only making the overall application 50% faster then you're in la-la land using GPUs.

    If GPUs are worth doing then arguably the ideal is to build a cluster out of consumer cards, so that the GPUs are easy to change over time - rather than being stuck with manufactured blades whose GPUs cannot be changed. Consumer cards are also cheaper to change when they fail. Consumer cards will tend to have less memory on them, though.

    If you build with consumer cards you have to test that they're working when you buy them. But since you have to test them regularly once they're in service, anyway, testing them when you buy them is a non-issue.

    Jawed
     
  10. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    You'd kinda expect that a server farm with full error detection to actually report a higher error rate than some part software testing...
     
  11. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    The actual cells in DDR and GDDR are basically the same thing. The macro level difference is in the size of the sub-arrays. Basically there should be close to zero difference as far as particle induced SER.

    Memtest86 and by extention memtestg80 are trying to detect/induce static and margin related errors. Neither are good tests for SER.
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Aaron, can you describe or link the right way to test for soft errors?

    Jawed
     
  13. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    In any event where you are talking about externally induced soft errors in memory bit cells, you must first determine the polarity of the cells themselves. Are the cells inverted or not? You then create a test pattern that is most likely to show the effects of a bit flip if a strike occurs. If your standard particle would set a cell to 1 and your test pattern is such that the cell would be a 1 to begin with, the probability of detecting the strike goes way down.

    Once you have the pattern in memory, to slowly start reading through memory counting any mismatches in pattern you get. You let it sit there for sufficient time to get a relevant statistical sample.

    If you are concerned about full characterization, you also do accelerated testing either via artificial means such as a isotope source (there are actually several government labs that basically ONLY do this) or via high altitude exposure (the intensity of things such as gamma rays at high altitude is significantly higher than are sea level or underground. Places like Sandia National Labs have issue with this in the compute arrays due to the high altitude there. 5500 feet may not seem like much but can change a 1 PPB issue into a < 1 PPM issue which with large compute arrays has issues.)

    Doing repeated readings and writing of memory with various patterns really doesn't tell you as much as one would think about SER except for things around margin. This is why Memtest is great to see if your memory works but no so great at detection of particle induced SER. For testing SER, you really want to treat the memory as a small particle detector like a CCD and then de-rate based on real world pattern analysis (Things like the probability that you are actively using that portion of memory, data patterns, etc).
     
  14. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    Is Ask's server farm running ECC memory?
    Even then, the memory ECC happens on the DIMM so there could still be transport interference on the pin's sockets etc.

    Even when swapping memory intensively we hardly run into ECC errors (way lower percentages than those reported for Ask) and then they always seem to be faulty DIMM's
     
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    HPCwire - Reliable Memory: Coming to a GPU Near You

    http://www.hpcwire.com/features/Reliable-Memory-Coming-to-a-GPU-Near-You-56751022.html

    The final sentence, I think, is likely to be the clincher for GT300:

    Which is why I think memory hubs will be part of GT300's architecture, supporting high capacity and ECC DDR2/DDR3. GT200 demonstrated NVidia's zeal in making CUDA work at any cost and I believe GT300 is more, much more, of the same. I expect it to be pretty cool.

    Jawed
     
  17. ChrisRay

    ChrisRay <span style="color: rgb(124, 197, 0)">R.I.P. 1983-
    Veteran

    Joined:
    Nov 25, 2002
    Messages:
    2,234
    Likes Received:
    26
    Good find Jawed.
     
  18. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Oh please, there's almost no way SMB is useful over integrating the ECC into the GPU itself (however hacky it may be). And let's not forget, after the rv770 shock, their first priority is going to be graphics and games, not computing.
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Hacking ECC in software? That destroys performance as it is computationally intensive.

    Even though that paragraph unrealistically makes a baseline out of 140GB/s on GTX285, you can see that the computational cost of ECC is monstrous.

    Obviously, one option would be to make an instruction that does this, so that it would still be a software operation.

    Making the DDR I/O on the die include ECC capability has a fairly severe area cost, though if GT300 is the only chip with it, size is arguably not the most pressing parameter.

    GT300 has spent more time in design than RV770 has been a retail product.

    I think ECC is a white-elephant, for what it's worth. The evidence points towards soft errors in GPU GDDR being negligible (faulty chips are the problem and such faulty chips are relatively easy to weed-out - plus system level error detection/correction requires more byzantine redundancy mechanisms, way beyond ECC). But it seems "customers want it" so that's what NVidia's chasing, and only NVidia will be providing it any time soon.

    Jawed
     
  20. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    The evidence of your case is sorely lacking jawed. There is 1 paper with less than sound methods and mountains of evidence that software "ECC" massively under reports actual error counts. We haven't even gotten into all the other various factors such as location and elevation.

    And I still remain skeptical that anyone will provide a real ECC system for GPUs anytime soon.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...