Nice follow-up to this thread:
https://forum.beyond3d.com/threads/memory-errors-in-gpus.46616/
I asserted back then that simple testing during commissioning would eliminate pretty much all the errors. And, vindicated.
The assertion that rigorous culling at the outset was one.
What of the white elephant claim?
The links to the papers are unfortunately broken, but going by the description of a failure to detect any soft errors, the paper was off by approximately 120000 errors in ORNL's experience.
There are 186888 nodes with 6GB of GDDR5, so 112128 GB in total. 112TB had 120K SBEs (excluding the 10 that were presumed L2 validation test escapes), and a smaller number of DBEs and a conglomeration of the two in the page-retirement error category.
For SBEs alone, that is 1 SBE per GB over 22 months.
This is about an order of magnitude lower than the rule of thumb that has been used over the years for CPU systems of 1 SBE per GB per month.
22 months being 16K hours, Titan had a rate of 7-8 bit flips per hour, possibly a little under 7 in DRAM if the rough ratio of on-die L2 to GDDR5 errors held.
It's not mentioned in the presentation, but I think it likely they also have figures for the ECC-protected RAM on the CPU nodes. Comparing those numbers could have illuminated us on whether there is some additional factor impacting the reliability of the memory types.
Unknown in this is the overall effect of Nvidia's page retirement functionality. By cutting off pages that start showing signs of degradation, it removes a source of chronic memory errors as systems age. Titan's has apparently made use of it to stave off node loss due to degradation of memory cells.
The system came down from an absolutely egregious error rate due to the test escape scenario, down to a figure that would be considered at or below the old ballpark figure used to justify ECC for workloads or clients sensitive to data corruption. A factor of 10 is not likely enough to sway, as there are factors that can sway things either way. ORNL's aggressive pre-screening and special effort by Cray and Nvidia may not extend over the full range of systems that the rule of thumb did, or necessarily take factors like geography into account. ORNL's takeaway is that its early-life debugging and the long-term survival of nodes is significantly helped by keeping the white elephant around.
My gaze goes to the items not mentioned, like the microcontrollers and command processors used by the GPUs, and whether they are ECC protected, since on-chip register files showed up as a major component of DBEs.
Of all the characters Jawed typed, the acronym SXM accounted for 100% or the errors, and of all the 384 characters in his post (not including the hyperlink or spaces), 98.4375% of them were error-free. Of those errors, only 2 of the three characters in the three letter acronym SXM were incorrect. The remaining 378 characters in the post were error-free. If you swap the last two characters of that faulty acronym (which appeared only 3 times) you'd have eliminated all the errors in the post.
That is a multi-character error in a sentence. Multiple single character errors in a sentence or a double-character error will require sentence retirement if the feature is activated in the Beyond3d driver.