Nvidia BigK GK110 Kepler Speculation Thread

Discussion in 'Architecture and Products' started by A1xLLcqAgt0qc2RyMz0y, Apr 21, 2012.

Tags:
  1. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    "Lower" .. in performance ? no, just will need more voltage for function at his rated performance..
     
  2. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    647
    Likes Received:
    92
    Thanks for that link..some very good information there.

    So this GK180 or GK110B is basically the same GK110 with no architectural/cache changes correct? Any idea on the die size in relation to GK110? Would be very interesting to see what the die size of GK210 is as well.
     
  3. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    The L1 cache is fixed on GK110B so that it can actually be used as designed. The L1 cache on GK110 can't be used in CUDA applications, for example. GK110B has some power savings as well over GK110, but I believe the die size stays the same.

    I don't know the die size of GK210, but I'd believe it's quite a bit larger: they added an additional 3.75 MB of register file and about 1 MB of L1 cache/scratchpad. GK210 might be the biggest GPU Nvidia has ever produced, since GK110 was up there, already pretty close to GT200 (the prior record holder).
     
  4. spworley

    Newcomer

    Joined:
    Apr 19, 2013
    Messages:
    146
    Likes Received:
    190
    GK110B also added GPU Boost.
     
  5. A1xLLcqAgt0qc2RyMz0y

    Regular

    Joined:
    Feb 6, 2010
    Messages:
    988
    Likes Received:
    280
    GPU Errors on HPC Systems: Characterization, Quantification and Implications for Architects and Operations

    http://on-demand.gputechconf.com/gtc/2015/video/S5566.html
    Interesting talk on this subject and details on Keplers data protection features.
     
    Lightman likes this.
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Nice follow-up to this thread:

    https://forum.beyond3d.com/threads/memory-errors-in-gpus.46616/

    I asserted back then that simple testing during commissioning would eliminate pretty much all the errors. And, vindicated.

    10 GPU SXMs produced 98% of all the errors. The remaining 18678 GPUs each suffered 6.5 single-bit-errors on average in "2 years" (I'm not sure of the precise duration, close to 91 weeks it seems).

    But only 899 SXMs had any single-bit-errors. That's 5%. So, erm just throwing away these faulty 5% of the SXMs would have eliminated all SBEs :razz:
     
    #1866 Jawed, Mar 29, 2015
    Last edited: Mar 29, 2015
    Grall, entity279 and Lightman like this.
  7. Bob

    Bob
    Regular

    Joined:
    Apr 22, 2004
    Messages:
    424
    Likes Received:
    47
    SXM != SMX
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Thanks for the correction, Bob. I got so frustrated trying to scroll backwards and forwards through that video to view individual slides, I missed that error.
     
  9. homerdog

    homerdog donator of the year
    Legend Veteran Subscriber

    Joined:
    Jul 25, 2008
    Messages:
    6,153
    Likes Received:
    928
    Location:
    still camping with a mauler
    This goes to one of my old assertions that a simple proof-read can eliminate pretty much all errors in any given post.

    Of all the characters Jawed typed, the acronym SXM accounted for 100% or the errors, and of all the 384 characters in his post (not including the hyperlink or spaces), 98.4375% of them were error-free. Of those errors, only 2 of the three characters in the three letter acronym SXM were incorrect. The remaining 378 characters in the post were error-free. If you swap the last two characters of that faulty acronym (which appeared only 3 times) you'd have eliminated all the errors in the post.
     
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    The assertion that rigorous culling at the outset was one.
    What of the white elephant claim?

    The links to the papers are unfortunately broken, but going by the description of a failure to detect any soft errors, the paper was off by approximately 120000 errors in ORNL's experience.

    There are 186888 nodes with 6GB of GDDR5, so 112128 GB in total. 112TB had 120K SBEs (excluding the 10 that were presumed L2 validation test escapes), and a smaller number of DBEs and a conglomeration of the two in the page-retirement error category.
    For SBEs alone, that is 1 SBE per GB over 22 months.
    This is about an order of magnitude lower than the rule of thumb that has been used over the years for CPU systems of 1 SBE per GB per month.
    22 months being 16K hours, Titan had a rate of 7-8 bit flips per hour, possibly a little under 7 in DRAM if the rough ratio of on-die L2 to GDDR5 errors held.
    It's not mentioned in the presentation, but I think it likely they also have figures for the ECC-protected RAM on the CPU nodes. Comparing those numbers could have illuminated us on whether there is some additional factor impacting the reliability of the memory types.

    Unknown in this is the overall effect of Nvidia's page retirement functionality. By cutting off pages that start showing signs of degradation, it removes a source of chronic memory errors as systems age. Titan's has apparently made use of it to stave off node loss due to degradation of memory cells.

    The system came down from an absolutely egregious error rate due to the test escape scenario, down to a figure that would be considered at or below the old ballpark figure used to justify ECC for workloads or clients sensitive to data corruption. A factor of 10 is not likely enough to sway, as there are factors that can sway things either way. ORNL's aggressive pre-screening and special effort by Cray and Nvidia may not extend over the full range of systems that the rule of thumb did, or necessarily take factors like geography into account. ORNL's takeaway is that its early-life debugging and the long-term survival of nodes is significantly helped by keeping the white elephant around.

    My gaze goes to the items not mentioned, like the microcontrollers and command processors used by the GPUs, and whether they are ECC protected, since on-chip register files showed up as a major component of DBEs.

    That is a multi-character error in a sentence. Multiple single character errors in a sentence or a double-character error will require sentence retirement if the feature is activated in the Beyond3d driver.
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    You need to start again with all 899 SXMs removed.

    As for the white elephant, well that's been proven effectively. Doing ECC with 7/8 data with 1/8 ECC (suffering bandwidth/capacity loss) in GDDR5 was an intermediate technique that hadn't arisen at that point in the discussion. It's not a full hardware parity solution and it's not the full software technique that had been discussed. The software techniques were very clumsy and there's a substantial implementation problem to get ECC on GDDR5 with GPUs, which NVidia neatly avoided.

    Obviously, as an experiment in ECC topics, Titan succeeded in generating useful data which would have been much harder or impossible without hardware level ECC.

    The presentation doesn't help us understand if the DBEs correlated with the SBEs - were known-bad cards the cause of DBEs? Once the 5% of bad SXMs have been weeded out, how many DBEs were experienced?

    I seem to have all the papers I linked in that thread. I imagine they're googlable, but if not I suppose I can re-distribute them.
     
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Why?
    Only the outlier 10 were attributable to a hardware issue out of the factory, the other nodes most likely were allowed to continue operating as these were soft errors correctable by ECC that did not occur again. Even if they did occur in the same region, the driver would retire that page.

    The SBE error case is what the earlier paper failed to detect, which prompted the most skepticism.
    In the case of Titan, it would also be asking for the removal of 899 blades, since per the discussion of the PCIe connector failure problem, these blades are a package deal.

    The presentation gave a geographic distribution of DBEs. The scatter was much more random than the highly concentrated SBE case.
    I am a little unclear on whether DBE failures were considered a failure for the running kernel, or were considered a reason to scratch a node. The followup on Nvidia's page retirement seemed to indicate that nodes could remain in service after a DBE.
     
  13. A1xLLcqAgt0qc2RyMz0y

    Regular

    Joined:
    Feb 6, 2010
    Messages:
    988
    Likes Received:
    280
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...