Memory Errors in GPUs

Discussion in 'GPGPU Technology & Programming' started by Jawed, Jul 24, 2009.

  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    How many errors does doctrine dictate should have been detected? They detected none.

    What location and elevation would be required to achieve a zero-soft-error test, according to the doctrine that you're going by?

    And Google's evidence clearly shows that faulty chips and interface physical properties are swamping the cosmological factor you keep citing. The cosmological factor for which there is no publicly documented proof with contemporary large cluster systems based on commodity PC hardware.

    Unless someone would like to link it? I can't find anything. The Ziegler et al. paper from 1996 keeps coming up though.

    Jawed
     
  2. dkanter

    Regular

    Joined:
    Jan 19, 2008
    Messages:
    360
    Likes Received:
    20
    Lead lined bunker, underground and away from any radiation sources : )

    Google's systems aren't quite commodity PC hardware. They use commodity CPUs and other commodity parts, but are utterly unlike anything you can buy. Also, they run very specific workloads, which have different characteristics than what other users may need.

    A lot of the internal studies by semiconductor vendors aren't published at all...but when semiconductor companies target a 7 year life, they do a hell of a lot of testing to determine that.

    David
     
  3. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,710
    Likes Received:
    458
    No, simply putting the ECC hardware in the GPU.

    As Rolf was kind enough to point out for us, we were all being stupid silly about trying to use the ECC memory schemes from PCs. No need to use DDR2/3 ... just use standard bus width GDDR5. So for every 64-byte burst to/from memory dedicate 8 bytes to ECC, done. The cheapest and most obvious solution.

    With GDDR5 memory the memory hub makes no sense ... and they don't need to use anything else.

    PS. well the obvious solution to people smarter than me obviously :)
     
    #23 MfA, Sep 5, 2009
    Last edited by a moderator: Sep 5, 2009
  4. dkanter

    Regular

    Joined:
    Jan 19, 2008
    Messages:
    360
    Likes Received:
    20
    It's not that simple at all, and there are other problems as well.

    David
     
  5. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,710
    Likes Received:
    458
    Shoot.
     
  6. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    sure, only requires COMPLETELY CUSTOM MEMORY!
     
  7. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,710
    Likes Received:
    458
    You misunderstand me ... for every 64 bits of the normal memory 8 bits are set aside for ECC ... but for the memory nothing changes compared to a normal 64 bits wide bus.

    Calculating the correct physical memory addresses requires a division, but another cycle of latency isn't that disastrous. Physical memory won't be 8 byte aligned anymore, so you have to throw away a couple of bytes on each access or design for partially filled cache lines ... even if you throw the bytes away you still have something like 6/8th of the original speed, which is enough IMO.
     
  8. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    me thinks you need to think this through more while looking at a spec sheet for gddr!
     
  9. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,710
    Likes Received:
    458
    The specs of the memory are entirely irrelevant ... ECC is just some extra data for memory to store. You could do this in software right now ... hell, it has been done already on GPUs. They stored data+ECC in 64+8 chunks in separate arrays rather than in 56+8 chunks in an interleaved array, but the former is the better scheme for software.
     
  10. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    yes, you can do it that way at a 2x+ bandwidth cost. You have for every write a Burst RMW and for every read an extra Burst Read.
     
  11. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,710
    Likes Received:
    458
    It's only a RMW on a partial update. This is something you can't get rid off without using custom memory, a memory hub using DDR2/3 won't get rid of the RMW on partial updates ... ECC DIMMs are just 72 bits memory, there is no internal ECC logic.

    With interleaved ECC codes a single burst read would be enough for any byte read.

    PS. I guess it makes more sense to simply use 64+8 bits interleaved data+ECC ... alignment gets shot either way, so you might as well minimize the number of partial updates.
     
    #31 MfA, Sep 6, 2009
    Last edited by a moderator: Sep 6, 2009
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Neutron Beam Testing

    http://www.techreport.com/discussions.x/19141

    Can't say I'm surprised.
     
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,119
    Likes Received:
    2,864
    Location:
    Well within 3d
    I'm not surprised that AMD would say that when their GPU product does not have it.

    It will be interesting to see how AMD reconciles that position with the fact that the products that will make up the bulk of its HPC endeavors sport ECC both for memory and for a large portion of their on-die memory.
    It reinforces a kind of quality pyramid, with CPUs at the narrower top, and the compromised compliance and reliability of GPUs at the wider base.
     
  14. EduardoS

    Newcomer

    Joined:
    Nov 8, 2008
    Messages:
    131
    Likes Received:
    0
    Someone must tell AMD that farmers buy GPUs :)
     
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
  16. A1xLLcqAgt0qc2RyMz0y

    Regular

    Joined:
    Feb 6, 2010
    Messages:
    980
    Likes Received:
    268
    GPU Errors on HPC Systems: Characterization, Quantification and Implications for Architects and Operations

    http://on-demand.gputechconf.com/gtc/2015/video/S5566.html

    Interesting talk on this subject and details on Keplers data protection features.

    Reposted here: Thanks jawed
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...