Nvidia Volta Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 19, 2013.

Tags:
  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    They said some cards had numerical errors, which is a broad descriptor.
    It's not clear if it's a modest number of irregularities in otherwise fine test output, or errors that vary wildly in magnitude.
    The latter could be errors somewhere from memory, to the bus, to cache, to registers.
    The former could be something more subtle in the execution hardware.

    The lack of ECC can make this harder to diagnose. There was a paper years ago about a GPU supercomputer where ECC turned up a set of Tesla cards that apparently had a gap in QA for memory, and they were logging corrected ECC errors. Flaky memory, marginal mounting, or a flaw in the controller could be in play even if the clocks are in-spec.
     
  2. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    P100 and V100 Tesla though have ECC (not saying it cannot be somehow related to this but earlier point was about the memory being pushed too hard), but I expanded a little upon another aspect raised in my previous response to Kaotik.
    And agree not enough details.
    Why I think it would had been interesting if those running Amber bench could be contacted and run some further FP32 solvent tests.
     
    #1082 CSI PC, Mar 23, 2018
    Last edited: Mar 23, 2018
  3. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    How ironic I mentioned Amber twice earlier :)

    Update to original article at TheRegister:
     
    nnunn and pharma like this.
  4. nnunn

    Newcomer

    Joined:
    Nov 27, 2014
    Messages:
    28
    Likes Received:
    23
    Since Titan V has one of its four memory partitions disabled, and the (apparent) glitch only appeared of 2 of 4 cards tested, wonder if this issue can be traced to which partition gets disabled... ?
     
    CSI PC and pharma like this.
  5. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Sorry with you now, specific to Titan V context rather than Tesla.
     
    #1085 CSI PC, Mar 23, 2018
    Last edited: Mar 23, 2018
  6. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Will be interesting to see, although what is the likelyhood the cause and resolution will never be made public by Nvidia.
    It is not really clear how they setup their system and environment.
     
    #1086 CSI PC, Mar 23, 2018
    Last edited: Mar 23, 2018
  7. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,910
    Likes Received:
    1,607
    It is possible it's a driver issue since Titan V bug fixes keep appearing lately in driver releases. Wonder whether they are using Geforce drivers or Tesla drivers?
     
    nnunn likes this.
  8. keldor314

    Newcomer

    Joined:
    Feb 23, 2010
    Messages:
    132
    Likes Received:
    13
    Do we even know that this is a hardware bug? Given that Volta significantly changed thread scheduling in a warp, perhaps this is a rare race condition in the code that cannot occur at all with the older scheduling? The Cuda programming guide makes it abundantly clear multiple times that warp divergence behavior is different, and indeed it's possible to have divergence in unexpected places. In fact, should Nvidia make this scheduler a professional card architecture only feature, I foresee an abundance of Volta-only or Ampere-only (or Turing? It's not clear which one is the professional market one) bugs showing up in software.
     
  9. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,172
    Location:
    La-la land
    If it's a race condition or other software bug, why is this issue showing up on some boards only, and not on others? Should it not be more consistent if the issue is due to architectural changes that affect all boards?
     
  10. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Depends how the scientist had it setup with regards to those 4 Titans (he may or may not tested them individually), also as part of being told of the issue they were also told (not by Nvidia):
    And it does not apply to all scientific applications sensitive in terms of high accuracy calculations, Amber is one that had been identified but not by themselves.

    Amber has been doing testing and cannot reproduce the issue:
    So quite a bit of conflicting information if considering the whole of the original article and indirectly issues with Amber, one aspect still to be determined is how that scientific user had the 4 Titan Vs implemented and with what environment.
    Could be driver related, HW but seems unlikely for now if the original report is accurate and also the HBM is more mature with Volta and still not pushed to full spec, memory in another way rather than failing due to HW limits, configuration-environment-CUDA implementation-etc.


    Has there been other reports of this beyond the original article user and 2 of their 4 Titan V, not sure myself.
     
    #1090 CSI PC, Mar 26, 2018
    Last edited: Mar 26, 2018
  11. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    None of the V100 cards out there are fully enabled. So it you could have one with 4 GPCs missing one SM or one with 1 GPCs missing 4 SMs. So cards behaving a bit differently in cases is not out of the realm of possibility in my opinion.
     
    Grall likes this.
  12. LiXiangyang

    Newcomer

    Joined:
    Mar 4, 2013
    Messages:
    81
    Likes Received:
    47
    Dont have time to test it on amber, but for the computing software (mostly in-house) I have tested so far, Titan V work just as good as my other cards, and can always produce reproducible results unless the software designed to not to be so, but I only test it on a 3-GPU workstation with 2 of them being Titan V, and no, the computing software tested are not light-weighted, many push the GPU to its limit (for instance 110% of TDP on a Titan Xp, and running for days etc), and contain both computing-bound or I/O-bound cases.

    Not sure if the version of Amber on Titan V use the tensor-core feature on Titan V, according to nvidia, GEMM with tensor core will not produce reproducible results.
     
    Lightman and pharma like this.
  13. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Not sure it could be pushed that much though, although Nvidia has never been clear how they structure it when disabling aspects of the architecture (including as Nunn raised 1 of the 4 memory partitions disabled for Titan V).
    Remember P100/ Titan X/ Tesla V100 /etc are all cut models missing SMs.
     
  14. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Ryan Smith likes this.
  15. A1xLLcqAgt0qc2RyMz0y

    Regular

    Joined:
    Feb 6, 2010
    Messages:
    985
    Likes Received:
    277
  16. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Kinda funny he is calling it a GPU instead of a node, but "worlds largest GPU" has a zing to it that will be picked up by various short news briefs.
     
    #1096 CSI PC, Mar 27, 2018
    Last edited: Mar 27, 2018
  17. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Just to clarify.
    I think he mentioned it was over 300lbs and tried to lift it, but difficult to know how much he says is light hearted balanced against the news headline brief he needs-wants to create.
    Sort of like the 10,000 Watts section on this node, some of it light hearted but with a brief around efficiency.
     
    #1097 CSI PC, Mar 27, 2018
    Last edited: Mar 27, 2018
  18. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,166
    Likes Received:
    1,836
    Location:
    Finland
    So GTC in short:
    1 GPU now means 2 GPUs on 2 separate cards
    1 GPU now also means 16 GPUs and a full dual socket machine to accompany them
    Worlds first GV100 and NVLink2 were launched, even though they were launched already a year ago
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    The big 18-port NVlink switch chip is new, as would be using 12 of them.
     
    A1xLLcqAgt0qc2RyMz0y and nnunn like this.
  20. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    You think that may be why he is calling it the "worlds largest gpu" due to the new fabric-switch connectivity and non-blocking?
    Only reason I can think of and also takes NVLink 2 to another level, also wonder if there will be a comparable Power9 solution down the line.

    Separately V100 now has 32GB HBM2 for certain models (probably all but Titan-V).
     
    #1100 CSI PC, Mar 27, 2018
    Last edited: Mar 27, 2018
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...