They made it sound like it was very reproducable on all the cards.
The HBM2 memory core is running lower than AMDs and well within spec from what we can tell.
Like I said the P100 was running closer to the edge with HBM2 (with what was available) but does not have the problems it seems.
They said some cards had numerical errors, which is a broad descriptor.
It's not clear if it's a modest number of irregularities in otherwise fine test output, or errors that vary wildly in magnitude.
The latter could be errors somewhere from memory, to the bus, to cache, to registers.
The former could be something more subtle in the execution hardware.
The lack of ECC can make this harder to diagnose. There was a paper years ago about a GPU supercomputer where ECC turned up a set of Tesla cards that apparently had a gap in QA for memory, and they were logging corrected ECC errors. Flaky memory, marginal mounting, or a flaw in the controller could be in play even if the clocks are in-spec.