Nvidia BigK GK110 Kepler Speculation Thread

Good read, thanks.

One question though. Slide 11 says "leakage goes up with powered transistor count [and] doesn't matter what the frequency is," so wouldn't the 2x part on slide 24 have more leakage than the 1x part and so perform lower?

"Lower" .. in performance ? no, just will need more voltage for function at his rated performance..
 
There might be a clue in this presentation (link to parent page). Pages 11-25.

Thanks for that link..some very good information there.

GK180 was renamed to GK110B and replaced the original GK110 in all of Nvidia's product line. It has lower power consumption and a few bug fixes.

It should have been named GK110B from the beginning to avoid all this confusion.

So this GK180 or GK110B is basically the same GK110 with no architectural/cache changes correct? Any idea on the die size in relation to GK110? Would be very interesting to see what the die size of GK210 is as well.
 
So this GK180 or GK110B is basically the same GK110 with no architectural/cache changes correct? Any idea on the die size in relation to GK110? Would be very interesting to see what the die size of GK210 is as well.

The L1 cache is fixed on GK110B so that it can actually be used as designed. The L1 cache on GK110 can't be used in CUDA applications, for example. GK110B has some power savings as well over GK110, but I believe the die size stays the same.

I don't know the die size of GK210, but I'd believe it's quite a bit larger: they added an additional 3.75 MB of register file and about 1 MB of L1 cache/scratchpad. GK210 might be the biggest GPU Nvidia has ever produced, since GK110 was up there, already pretty close to GT200 (the prior record holder).
 
GPU Errors on HPC Systems: Characterization, Quantification and Implications for Architects and Operations

http://on-demand.gputechconf.com/gtc/2015/video/S5566.html
The fastest US supercomputer, Titan, installed at Oak Ridge National Laboratory, has more than 18,000 GPUs that are used for a broad range of scientific workloads. In this talk, Rogers points out that while the performance efficiency of GPUs is well understood, their resilience characteristics in a large-scale computing system have not been fully evaluated. He goes on to describe a study, drawn from 300,000,000 Titan node hours, that was undertaken to boost understanding of GPU errors on large-scale heterogenous machines. The work has implications for future GPU architects and HPC centers that use graphics processors.

Interesting talk on this subject and details on Keplers data protection features.
 
Nice follow-up to this thread:

https://forum.beyond3d.com/threads/memory-errors-in-gpus.46616/

I asserted back then that simple testing during commissioning would eliminate pretty much all the errors. And, vindicated.

10 GPU SXMs produced 98% of all the errors. The remaining 18678 GPUs each suffered 6.5 single-bit-errors on average in "2 years" (I'm not sure of the precise duration, close to 91 weeks it seems).

But only 899 SXMs had any single-bit-errors. That's 5%. So, erm just throwing away these faulty 5% of the SXMs would have eliminated all SBEs :p
 
Last edited:
Thanks for the correction, Bob. I got so frustrated trying to scroll backwards and forwards through that video to view individual slides, I missed that error.
 
This goes to one of my old assertions that a simple proof-read can eliminate pretty much all errors in any given post.

Of all the characters Jawed typed, the acronym SXM accounted for 100% or the errors, and of all the 384 characters in his post (not including the hyperlink or spaces), 98.4375% of them were error-free. Of those errors, only 2 of the three characters in the three letter acronym SXM were incorrect. The remaining 378 characters in the post were error-free. If you swap the last two characters of that faulty acronym (which appeared only 3 times) you'd have eliminated all the errors in the post.
 
Nice follow-up to this thread:

https://forum.beyond3d.com/threads/memory-errors-in-gpus.46616/

I asserted back then that simple testing during commissioning would eliminate pretty much all the errors. And, vindicated.
The assertion that rigorous culling at the outset was one.
What of the white elephant claim?

The links to the papers are unfortunately broken, but going by the description of a failure to detect any soft errors, the paper was off by approximately 120000 errors in ORNL's experience.

There are 186888 nodes with 6GB of GDDR5, so 112128 GB in total. 112TB had 120K SBEs (excluding the 10 that were presumed L2 validation test escapes), and a smaller number of DBEs and a conglomeration of the two in the page-retirement error category.
For SBEs alone, that is 1 SBE per GB over 22 months.
This is about an order of magnitude lower than the rule of thumb that has been used over the years for CPU systems of 1 SBE per GB per month.
22 months being 16K hours, Titan had a rate of 7-8 bit flips per hour, possibly a little under 7 in DRAM if the rough ratio of on-die L2 to GDDR5 errors held.
It's not mentioned in the presentation, but I think it likely they also have figures for the ECC-protected RAM on the CPU nodes. Comparing those numbers could have illuminated us on whether there is some additional factor impacting the reliability of the memory types.

Unknown in this is the overall effect of Nvidia's page retirement functionality. By cutting off pages that start showing signs of degradation, it removes a source of chronic memory errors as systems age. Titan's has apparently made use of it to stave off node loss due to degradation of memory cells.

The system came down from an absolutely egregious error rate due to the test escape scenario, down to a figure that would be considered at or below the old ballpark figure used to justify ECC for workloads or clients sensitive to data corruption. A factor of 10 is not likely enough to sway, as there are factors that can sway things either way. ORNL's aggressive pre-screening and special effort by Cray and Nvidia may not extend over the full range of systems that the rule of thumb did, or necessarily take factors like geography into account. ORNL's takeaway is that its early-life debugging and the long-term survival of nodes is significantly helped by keeping the white elephant around.

My gaze goes to the items not mentioned, like the microcontrollers and command processors used by the GPUs, and whether they are ECC protected, since on-chip register files showed up as a major component of DBEs.

Of all the characters Jawed typed, the acronym SXM accounted for 100% or the errors, and of all the 384 characters in his post (not including the hyperlink or spaces), 98.4375% of them were error-free. Of those errors, only 2 of the three characters in the three letter acronym SXM were incorrect. The remaining 378 characters in the post were error-free. If you swap the last two characters of that faulty acronym (which appeared only 3 times) you'd have eliminated all the errors in the post.

That is a multi-character error in a sentence. Multiple single character errors in a sentence or a double-character error will require sentence retirement if the feature is activated in the Beyond3d driver.
 
You need to start again with all 899 SXMs removed.

As for the white elephant, well that's been proven effectively. Doing ECC with 7/8 data with 1/8 ECC (suffering bandwidth/capacity loss) in GDDR5 was an intermediate technique that hadn't arisen at that point in the discussion. It's not a full hardware parity solution and it's not the full software technique that had been discussed. The software techniques were very clumsy and there's a substantial implementation problem to get ECC on GDDR5 with GPUs, which NVidia neatly avoided.

Obviously, as an experiment in ECC topics, Titan succeeded in generating useful data which would have been much harder or impossible without hardware level ECC.

The presentation doesn't help us understand if the DBEs correlated with the SBEs - were known-bad cards the cause of DBEs? Once the 5% of bad SXMs have been weeded out, how many DBEs were experienced?

I seem to have all the papers I linked in that thread. I imagine they're googlable, but if not I suppose I can re-distribute them.
 
You need to start again with all 899 SXMs removed.
Why?
Only the outlier 10 were attributable to a hardware issue out of the factory, the other nodes most likely were allowed to continue operating as these were soft errors correctable by ECC that did not occur again. Even if they did occur in the same region, the driver would retire that page.

The SBE error case is what the earlier paper failed to detect, which prompted the most skepticism.
In the case of Titan, it would also be asking for the removal of 899 blades, since per the discussion of the PCIe connector failure problem, these blades are a package deal.

The presentation doesn't help us understand if the DBEs correlated with the SBEs - were known-bad cards the cause of DBEs? Once the 5% of bad SXMs have been weeded out, how many DBEs were experienced?
The presentation gave a geographic distribution of DBEs. The scatter was much more random than the highly concentrated SBE case.
I am a little unclear on whether DBE failures were considered a failure for the running kernel, or were considered a reason to scratch a node. The followup on Nvidia's page retirement seemed to indicate that nodes could remain in service after a DBE.
 
GTX Titan OC vs Radeon 6400


Thank you, I find this very amusing. Now it takes 10 years for a lowest-end card to be the former champions—another measurement of slowing performance improvements.
 
Back
Top