AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

computerbase.de seem to think that it is enabled for FE, though not sure where they're getting their information from. The Frontier Edition page doesn't mention ECC anywhere, which isn't a yes or no.

AMD isn't shy about mentioning it when it's offered. The MI25 page mentions it. The description seems to be consistent with native ECC, although it doesn't flat out state it as much as the usual cost in capacity and bandwidth penalties are not noted.

Somewhere on AMD's forums or Reddit it was stated the Pro did not, and it was questioned why the Pro class would even need it.

As an aside, the footnotes for MI25 indicate no ECC protection for on-die structures. Some GPUs have had it in the past, but AMD has made arguments for removing it going forward.
 
AMD isn't shy about mentioning it when it's offered. The MI25 page mentions it. The description seems to be consistent with native ECC, although it doesn't flat out state it as much as the usual cost in capacity and bandwidth penalties are not noted.
The MI25 page is where I saw it. Assumption was they'd use it on FE and pro products as they're essentially the same part. Specifics are difficult to come by.

Even native ECC should still be a hair slower. The process will cost at least some latency, even if bandwidth stays the same. Considering their product lines, using ECC on all the Vegas might make sense, so long as it's not hideously more expensive. Converging low volume supplies should help.
 
Thanks @3dilettante, that is my understanding of the FE as well regarding ECC.

Either way, no one should expect any significant gains over 1-2% from ECC anyway.
 
The MI25 page is where I saw it. Assumption was they'd use it on FE and pro products as they're essentially the same part. Specifics are difficult to come by.
Turning it on means enabling and supporting the ability to handle and report the errors (correctable or not) from the controller all the way through the driver. If the user and software haven't the need and wouldn't know what to do about ECC events, it's effort and support not worth committing to.

Even native ECC should still be a hair slower. The process will cost at least some latency, even if bandwidth stays the same.
More latency-sensitive CPUs rarely notice a difference distinguishable from measurement noise. What a GPU would notice when it tolerates far worse latencies seems like it would be negligible in this instance.
 
Turning it on means enabling and supporting the ability to handle and report the errors (correctable or not) from the controller all the way through the driver. If the user and software haven't the need and wouldn't know what to do about ECC events, it's effort and support not worth committing to.
Assuming that FE was meant as a dev kit, ECC would be useful there for the very reason you stated. Use FE as opposed to a single Instinct or perhaps Pro to actually develop and test that support. For some limited cases the driver could just rerun the dispatch or forget about the event.

Regardless of software support, ECC should be able to scrub relatively simple errors. Which is why I suggested it could be alleviating some overheating issues. Potentially crashing as they become more complex. It's those complex errors where the software support is really needed. A software driven approach, or both, would also work at the cost of significant bandwidth.

More latency-sensitive CPUs rarely notice a difference distinguishable from measurement noise. What a GPU would notice when it tolerates far worse latencies seems like it would be negligible in this instance.
Agreed, which is why I wouldn't mind seeing it on GPUs just to scrub some basic errors. Only concern is that Vega could have some latency sensitive operations.

I've still been trying to find that register configuration I mentioned the other day. I'm thinking it was in the LLVM compiler, but that was something I came across months ago skimming a lot of code commits for details. That 1k value should work out to 4 VGPRs which could be a RF cache. Drivers readily mention all registers being backed by memory. Conceivably AMD could have dumped the entire register file into memory with that 40-70% access reduction quoted in that paper I linked above. Re-purposing the freed registers as a giant scratch pad for tiling, etc. Of course I could be completely wrong on that.
 
Slightly off-topic, but maybe not since it could apply to Vega as well. From here:
http://images.nvidia.com/content/volta-architecture/pdf/Volta-Architecture-Whitepaper-v1.0.pdf

HBM2 supports native or sideband ECC where a small memory region, separate from main memory, is used for ECC bits. This compares to inline ECC where a portion of main memory is carved out for ECC bits, as in the GDDR5 memory subsystem of the Tesla K40 GPU where 6.25% of the overall GDDR5 is reserved for ECC bits. With V100 and P100, ECC can be active without a bandwidth or capacity penalty. For memory writes, the ECC bits are calculated across 32 bytes of data in a write request. Eight ECC bits are created for each eight bytes of data. For memory reads, the 32 ECC bits are read in parallel with each 32-byte read of data. ECC bits are used to correct
single bit errors or flag double bit errors.
 
Well, despite all the noise, the $1000 one is out of stock on Newegg. Which is the only place selling it. Looks like they really didn't need reviews or etc.

I'll be interested to see if the $1,500 one goes out as well. Probably. What weirds me out is the sheer amount of binning going on here. The RX Vega is, assumedly, a different bin from Frontier Edition, is a different bin from the Mi25 (12.3 teraflops...). Hard to keep track of all the "editions". Especially as I assume both a Nano(2?) and some memory mapped SSD version will both be coming out eventually. Yay!
 
Assuming that FE was meant as a dev kit, ECC would be useful there for the very reason you stated. Use FE as opposed to a single Instinct or perhaps Pro to actually develop and test that support. For some limited cases the driver could just rerun the dispatch or forget about the event.
The option to enable, disable, or parameterize behavior for ECC has not been shown, and all that extra work sounds like more driver quality than has been in evidence. Perhaps at some point it would be available, but right now the tenor is that it isn't considered necessary.

Regardless of software support, ECC should be able to scrub relatively simple errors. Which is why I suggested it could be alleviating some overheating issues.
That's not what ECC is for. If there are enough thermal bit flips that ECC is somehow salvaging stability, 1) there's a high likelihood that things are on the verge of damaging something, and 2) the thermal issue would be "resolving" by the channel stalling at various times due to having to process the error. If the error is uncorrectable, it seems like there would be a problem since this scenario has no infrastructure in place for the GPU to know what to do, saving nothing.

The unexpectedly lower performance of excessively overclocked GDDR5 wasn't about salvaging thermally dangerous conditions, but a consequence of transmission retries allowing a misleadingly high bus speed.

Agreed, which is why I wouldn't mind seeing it on GPUs just to scrub some basic errors. Only concern is that Vega could have some latency sensitive operations.
DRAM has enough device latencies measured in tens of ns that I wouldn't know what's latency sensitive for an ECC check.
Actual scrubbing, as in having the system systematically go through its memory at various intervals would be something more explicitly enabled. It seems excessive to have that level of RAS when the product flat-out leaves everything else unprotected.

I've still been trying to find that register configuration I mentioned the other day. I'm thinking it was in the LLVM compiler, but that was something I came across months ago skimming a lot of code commits for details. That 1k value should work out to 4 VGPRs which could be a RF cache. Drivers readily mention all registers being backed by memory. Conceivably AMD could have dumped the entire register file into memory with that 40-70% access reduction quoted in that paper I linked above. Re-purposing the freed registers as a giant scratch pad for tiling, etc. Of course I could be completely wrong on that.
I would like to see references to these.
 
There's an AMD slide from the Fury X launch claiming 35GB/s-per-watt.
512/35= 14.6W for the 4 HBM1 stacks in Fiji.
There are only 2 stacks in Vega, but HBM2 may consume more per stack due to increased frequencies, and 8-Hi stacks are also a new variable.
 
Thanks @3dilettante, that is my understanding of the FE as well regarding ECC.

Either way, no one should expect any significant gains over 1-2% from ECC anyway.

Gains from disabling ECC may be tangible, but the higher clocks (= higher bandwidth) attainable by not having that extra step may be more substantial.
 
There's an AMD slide from the Fury X launch claiming 35GB/s-per-watt.
512/35= 14.6W for the 4 HBM1 stacks in Fiji.
There are only 2 stacks in Vega, but HBM2 may consume more per stack due to increased frequencies, and 8-Hi stacks are also a new variable.
wow, if that's true maybe I won't have to change my 450W power supply when Vega comes out.
 
There's an AMD slide from the Fury X launch claiming 35GB/s-per-watt.
512/35= 14.6W for the 4 HBM1 stacks in Fiji.
There are only 2 stacks in Vega, but HBM2 may consume more per stack due to increased frequencies, and 8-Hi stacks are also a new variable.
I read days ago that Vega Frontier power consumption is around 300W
 
That's not what ECC is for. If there are enough thermal bit flips that ECC is somehow salvaging stability, 1) there's a high likelihood that things are on the verge of damaging something, and 2) the thermal issue would be "resolving" by the channel stalling at various times due to having to process the error. If the error is uncorrectable, it seems like there would be a problem since this scenario has no infrastructure in place for the GPU to know what to do, saving nothing.
Not what it's for, but it can help to a limited degree. HBM was having issues with voltage fluctuations from the wide IO, so it could be playing a larger part. Especially if they did away with non-ECC parts.

Actual scrubbing, as in having the system systematically go through its memory at various intervals would be something more explicitly enabled. It seems excessive to have that level of RAS when the product flat-out leaves everything else unprotected.
Didn't mean to imply that level of scrubbing, but just fixing single bit errors transparently as they're detected. Similar to most parity checks.

I would like to see references to these.
http://llvm.org/docs/AMDGPUUsage.html
Some interesting tidbits there on GFX9, but still haven't found that line I recall. Closest thing was 4x256 dwords for VGPR allocations. What I recall was a series of resource allocations. It may have been in one of their independent testing branches, so having a difficult time finding it. Maybe that "Register Mapping" WIP section.
 
Back
Top