There's a specialized data flow and ALU arrangement to get a 4x4 matrix multiply and accumulate that result with another 4x4 in one clock, and it seems mostly the same number of register file accesses as a regular instruction barring the higher precision of the accumulator.How so? The only difference seems to be that Tensor throws a lot more silicon and ALUs at the problem. Strip all instructions not related to FMA from a SIMD and you have Tensor.
In one clock, with GCN's existing broadcast/permute capability and regular register accesses?That's a distinct possibility with a flexible scalar if AMD went that route. Put a pair of 32 bit FMA units capable of packed math in each SIMD lane along with L0 cache and suddenly AMD has 4 Tensor'ish cores per CU with the ability to bond dice with Infinity.
Is this still with a 64-wide wave?
I've tried to mentally picture how this works, without getting into the relative widths of the units.
Going by Nvidia's description of its 4x4x4 operation D=AxB+C, let's give each operand one register access.
Each element in A would be broadcast in a strided manner to 4 elements in B. That's 64 horizontal broadcasts. While it should be possible to optimize away a broadcast when the lanes align, that would require a separate operation since the network is broadcasting.
A SIMD can broadcast 1 value horizontally per clock, and this is part of a cross-lane network that is both more complex than the specialized broadcast and also insufficient since that's one pattern it cannot do.
After the initial full-precision multiply, there would need to be a set of additions whose operands are at a stride of 4 in the other dimension and the elements from register C--not a standard 1 cycle operation or handled by the regular write-back path of a SIMD.
Is there a way to synthesize this behavior with existing SIMD data paths and ALUs without a lot of serialization and separate instructions?
That's a lot of register file and bypass/swizzle networks, and any number of instruction fetches and cycles.
Officially, there is currently no special relationship between Vega and Zeppelin. It's only described as using PCIe for the compute cards. xGMI was on a slide for Vega 20.Infinity Fabric and MCM Threadripper/Naples as a backbone doesn't exist? Even better it doesn't require IBMs Power line of CPUs, so x86 works. That's 8 GPUs per server with direct access to 8 memory channels and potentially better density and perf/watt.
If CUDA is not that relevant, AMD's efforts that not that relevant relative to that.Guess I didn't realize CUDA was that relevant to HPC. With less than 20% of supercomputers even using a GPU after all. CUDA for CPU acceleration over C/C++, Fortran, and most other languages constituting the vast majority of applications then?
It's sufficiently relevant that AMD's revamped compute platform specifically provisions for translating from CUDA.