Nvidia Volta Speculation Thread

They made it sound like it was very reproducable on all the cards.
The HBM2 memory core is running lower than AMDs and well within spec from what we can tell.
Like I said the P100 was running closer to the edge with HBM2 (with what was available) but does not have the problems it seems.

They said some cards had numerical errors, which is a broad descriptor.
It's not clear if it's a modest number of irregularities in otherwise fine test output, or errors that vary wildly in magnitude.
The latter could be errors somewhere from memory, to the bus, to cache, to registers.
The former could be something more subtle in the execution hardware.

The lack of ECC can make this harder to diagnose. There was a paper years ago about a GPU supercomputer where ECC turned up a set of Tesla cards that apparently had a gap in QA for memory, and they were logging corrected ECC errors. Flaky memory, marginal mounting, or a flaw in the controller could be in play even if the clocks are in-spec.
 
They said some cards had numerical errors, which is a broad descriptor.
It's not clear if it's a modest number of irregularities in otherwise fine test output, or errors that vary wildly in magnitude.
The latter could be errors somewhere from memory, to the bus, to cache, to registers.
The former could be something more subtle in the execution hardware.

The lack of ECC can make this harder to diagnose. There was a paper years ago about a GPU supercomputer where ECC turned up a set of Tesla cards that apparently had a gap in QA for memory, and they were logging corrected ECC errors. Flaky memory, marginal mounting, or a flaw in the controller could be in play even if the clocks are in-spec.
P100 and V100 Tesla though have ECC (not saying it cannot be somehow related to this but earlier point was about the memory being pushed too hard), but I expanded a little upon another aspect raised in my previous response to Kaotik.
And agree not enough details.
Why I think it would had been interesting if those running Amber bench could be contacted and run some further FP32 solvent tests.
 
Last edited:
How ironic I mentioned Amber twice earlier :)

Update to original article at TheRegister:
Updated to add
A spokesperson for Nvidia has been in touch to say people should drop the chip designer a note if they have any problems. The biz acknowledged it is aware of at least one scientific application – a molecular dynamics package called Amber – that reportedly is affected by the Titan V weirdness.

"All of our GPUs add correctly," the rep told us. "Our Tesla line, which has ECC [error-correcting code memory], is designed for these types of large scale, high performance simulations. Anyone who does experience issues should contact support@nvidia.com."
 
Since Titan V has one of its four memory partitions disabled, and the (apparent) glitch only appeared of 2 of 4 cards tested, wonder if this issue can be traced to which partition gets disabled... ?
 
They said some cards had numerical errors, which is a broad descriptor.
It's not clear if it's a modest number of irregularities in otherwise fine test output, or errors that vary wildly in magnitude.
The latter could be errors somewhere from memory, to the bus, to cache, to registers.
The former could be something more subtle in the execution hardware.

The lack of ECC can make this harder to diagnose. There was a paper years ago about a GPU supercomputer where ECC turned up a set of Tesla cards that apparently had a gap in QA for memory, and they were logging corrected ECC errors. Flaky memory, marginal mounting, or a flaw in the controller could be in play even if the clocks are in-spec.
Sorry with you now, specific to Titan V context rather than Tesla.
 
Last edited:
Since Titan V has one of its four memory partitions disabled, and the (apparent) glitch only appeared of 2 of 4 cards tested, wonder if this issue can be traced to which partition gets disabled... ?
Will be interesting to see, although what is the likelyhood the cause and resolution will never be made public by Nvidia.
It is not really clear how they setup their system and environment.
 
Last edited:
It is possible it's a driver issue since Titan V bug fixes keep appearing lately in driver releases. Wonder whether they are using Geforce drivers or Tesla drivers?
 
Will be interesting to see, although what is the likelyhood the cause and resolution will never be made public by Nvidia.
It is not really clear how they setup their system and environment.

Do we even know that this is a hardware bug? Given that Volta significantly changed thread scheduling in a warp, perhaps this is a rare race condition in the code that cannot occur at all with the older scheduling? The Cuda programming guide makes it abundantly clear multiple times that warp divergence behavior is different, and indeed it's possible to have divergence in unexpected places. In fact, should Nvidia make this scheduler a professional card architecture only feature, I foresee an abundance of Volta-only or Ampere-only (or Turing? It's not clear which one is the professional market one) bugs showing up in software.
 
I foresee an abundance of Volta-only or Ampere-only (or Turing? It's not clear which one is the professional market one) bugs showing up in software.
If it's a race condition or other software bug, why is this issue showing up on some boards only, and not on others? Should it not be more consistent if the issue is due to architectural changes that affect all boards?
 
If it's a race condition or other software bug, why is this issue showing up on some boards only, and not on others? Should it not be more consistent if the issue is due to architectural changes that affect all boards?
Depends how the scientist had it setup with regards to those 4 Titans (he may or may not tested them individually), also as part of being told of the issue they were also told (not by Nvidia):
original article said:
It is not down to random defects in the chipsets nor a bad batch of products, since Nvidia has encountered this type of cockup in the past, we are told.
And it does not apply to all scientific applications sensitive in terms of high accuracy calculations, Amber is one that had been identified but not by themselves.

Amber has been doing testing and cannot reproduce the issue:
Amber said:
Mar 2018: Titan-V reliability concerns. We have received conflicting reports about Titan-V cards failing the validation tests. Early reports suggested problems, but many subsequent tests have failed to reproduce this.

So quite a bit of conflicting information if considering the whole of the original article and indirectly issues with Amber, one aspect still to be determined is how that scientific user had the 4 Titan Vs implemented and with what environment.
Could be driver related, HW but seems unlikely for now if the original report is accurate and also the HBM is more mature with Volta and still not pushed to full spec, memory in another way rather than failing due to HW limits, configuration-environment-CUDA implementation-etc.


Has there been other reports of this beyond the original article user and 2 of their 4 Titan V, not sure myself.
 
Last edited:
If it's a race condition or other software bug, why is this issue showing up on some boards only, and not on others? Should it not be more consistent if the issue is due to architectural changes that affect all boards?
None of the V100 cards out there are fully enabled. So it you could have one with 4 GPCs missing one SM or one with 1 GPCs missing 4 SMs. So cards behaving a bit differently in cases is not out of the realm of possibility in my opinion.
 
Dont have time to test it on amber, but for the computing software (mostly in-house) I have tested so far, Titan V work just as good as my other cards, and can always produce reproducible results unless the software designed to not to be so, but I only test it on a 3-GPU workstation with 2 of them being Titan V, and no, the computing software tested are not light-weighted, many push the GPU to its limit (for instance 110% of TDP on a Titan Xp, and running for days etc), and contain both computing-bound or I/O-bound cases.

Not sure if the version of Amber on Titan V use the tensor-core feature on Titan V, according to nvidia, GEMM with tensor core will not produce reproducible results.
 
None of the V100 cards out there are fully enabled. So it you could have one with 4 GPCs missing one SM or one with 1 GPCs missing 4 SMs. So cards behaving a bit differently in cases is not out of the realm of possibility in my opinion.
Not sure it could be pushed that much though, although Nvidia has never been clear how they structure it when disabling aspects of the architecture (including as Nunn raised 1 of the 4 memory partitions disabled for Titan V).
Remember P100/ Titan X/ Tesla V100 /etc are all cut models missing SMs.
 
Maybe it has wheels.
Just to clarify.
I think he mentioned it was over 300lbs and tried to lift it, but difficult to know how much he says is light hearted balanced against the news headline brief he needs-wants to create.
Sort of like the 10,000 Watts section on this node, some of it light hearted but with a brief around efficiency.
 
Last edited:
The big 18-port NVlink switch chip is new, as would be using 12 of them.
You think that may be why he is calling it the "worlds largest gpu" due to the new fabric-switch connectivity and non-blocking?
Only reason I can think of and also takes NVLink 2 to another level, also wonder if there will be a comparable Power9 solution down the line.

Separately V100 now has 32GB HBM2 for certain models (probably all but Titan-V).
 
Last edited:
Back
Top