Tensors! *spawn*

3dilettante

Legend
Supporter
How so? The only difference seems to be that Tensor throws a lot more silicon and ALUs at the problem. Strip all instructions not related to FMA from a SIMD and you have Tensor.
There's a specialized data flow and ALU arrangement to get a 4x4 matrix multiply and accumulate that result with another 4x4 in one clock, and it seems mostly the same number of register file accesses as a regular instruction barring the higher precision of the accumulator.

That's a distinct possibility with a flexible scalar if AMD went that route. Put a pair of 32 bit FMA units capable of packed math in each SIMD lane along with L0 cache and suddenly AMD has 4 Tensor'ish cores per CU with the ability to bond dice with Infinity.
In one clock, with GCN's existing broadcast/permute capability and regular register accesses?
Is this still with a 64-wide wave?
I've tried to mentally picture how this works, without getting into the relative widths of the units.

Going by Nvidia's description of its 4x4x4 operation D=AxB+C, let's give each operand one register access.
Each element in A would be broadcast in a strided manner to 4 elements in B. That's 64 horizontal broadcasts. While it should be possible to optimize away a broadcast when the lanes align, that would require a separate operation since the network is broadcasting.
A SIMD can broadcast 1 value horizontally per clock, and this is part of a cross-lane network that is both more complex than the specialized broadcast and also insufficient since that's one pattern it cannot do.
After the initial full-precision multiply, there would need to be a set of additions whose operands are at a stride of 4 in the other dimension and the elements from register C--not a standard 1 cycle operation or handled by the regular write-back path of a SIMD.
Is there a way to synthesize this behavior with existing SIMD data paths and ALUs without a lot of serialization and separate instructions?
That's a lot of register file and bypass/swizzle networks, and any number of instruction fetches and cycles.

Infinity Fabric and MCM Threadripper/Naples as a backbone doesn't exist? Even better it doesn't require IBMs Power line of CPUs, so x86 works. That's 8 GPUs per server with direct access to 8 memory channels and potentially better density and perf/watt.
Officially, there is currently no special relationship between Vega and Zeppelin. It's only described as using PCIe for the compute cards. xGMI was on a slide for Vega 20.


Guess I didn't realize CUDA was that relevant to HPC. With less than 20% of supercomputers even using a GPU after all. CUDA for CPU acceleration over C/C++, Fortran, and most other languages constituting the vast majority of applications then?
If CUDA is not that relevant, AMD's efforts that not that relevant relative to that.
It's sufficiently relevant that AMD's revamped compute platform specifically provisions for translating from CUDA.
 
The point you appear to be missing is that Tensor cores are distinct from the rest of the ALU/FPU units and can run independently .
You quoted me saying they threw silicon at the problem, through larger dice and ALUs. Having separate units would be a bad thing as a chunk of the chip would only work with Tensors. All that register space, FMA units, etc wasted when it could otherwise be used for standard FP heavy workloads. Tensor looks like the standard SIMT with all logic NOT related to FMA removed. What I suggested for GCN was roughly the same thing, with performance determined by instruction distribution of code. An adjustment there wouldn't be unreasonable as AMD and Microsoft did just that when profiling code from lots of console titles for Scorpio.
 
You quoted me saying they threw silicon at the problem, through larger dice and ALUs. Having separate units would be a bad thing as a chunk of the chip would only work with Tensors. All that register space, FMA units, etc wasted when it could otherwise be used for standard FP heavy workloads. Tensor looks like the standard SIMT with all logic NOT related to FMA removed. What I suggested for GCN was roughly the same thing, with performance determined by instruction distribution of code. An adjustment there wouldn't be unreasonable as AMD and Microsoft did just that when profiling code from lots of console titles for Scorpio.

What you suggest is adding two 32b scalar units to each lane of the SIMD, I don't know how this is even remotely comparable to performing matrix FMA in one clock. We are talking about 1024 operations per clock per SM.

You talk of '4 tensor'ish cores' per CU. Assuming a tensor'ish core retains the throughput of a Tensor core that's 512 operations per clock per CU. As 3dilletante said eloquently stated a few posts up that requires significant numbers of reigster accesses to load and store all of the operands, and so far we don't know the details of the implementation.

Needless to say I think it's far from trivial to implement, and "simply adding two 'flexible scalars' to each lane" (which is simply throwing more ALU and die space at the problem, ironically) doesn't seem like a very convincing alternative
 
What you suggest is adding two 32b scalar units to each lane of the SIMD, I don't know how this is even remotely comparable to performing matrix FMA in one clock. We are talking about 1024 operations per clock per SM.
Not exactly, what I'm suggesting is streamlining the SIMD and bulking up on execution units related to Tensors. Which also accounts for most graphics math. At FP16 that would be 8 operations per lane, half of those not counted as accumulators.

The operations per clock is somewhat irrelevant as it's an issue of area and dimensions. I'm unsure what the ideal matrix size for deep learning would be as they seem to vary a lot.

we don't know the details of the implementation.
Hence "Tensor'ish". More a question of throughput.

Needless to say I think it's far from trivial to implement, and "simply adding two 'flexible scalars' to each lane" (which is simply throwing more ALU and die space at the problem, ironically) doesn't seem like a very convincing alternative
It's not flexible per lane, but per SIMD to allow for streamlining the SIMD. It's a method to get more execution units with denser logic. Avoid replicating parts that aren't used frequently.

Just to be clear I'm speculating on this, but it could be a method to reorganize the CU. It's those details we haven't seen yet.
 
Going by Nvidia's description of its 4x4x4 operation D=AxB+C, let's give each operand one register access.
I think I figured out our problem. "Warp-Level Matrix Operations". Nvidia just stated part of the math being performed, not the tensor. The result of two 4x4 matrices is a 16x16 array. A 16 wide SIMD would just broadcast the (0,0) element, two for packed math, with the assumption of tiled results being accumulated. The accumulator would be sitting on the equivalent of 4 VGPRs(4x64=256) aligned to quads which would fit with the existing permute capabilities if they were even needed. Permutations on four read operands should be able to handle most alignment concerns.

What the tensor cores provided was the extra FP32 adds as the data alignment looks relatively simple now. Voltas increases coming down to execution units and clockspeed increases from the lower precision. Time the speed to the FP16 mul with alternating adders to keep pace. A single tensor op shouldn't write the same location twice. They just use accumulation over multiple sequential operations to free operands.
 
I think I figured out our problem. "Warp-Level Matrix Operations". Nvidia just stated part of the math being performed, not the tensor. The result of two 4x4 matrices is a 16x16 array. A 16 wide SIMD would just broadcast the (0,0) element, two for packed math, with the assumption of tiled results being accumulated. The accumulator would be sitting on the equivalent of 4 VGPRs(4x64=256) aligned to quads which would fit with the existing permute capabilities if they were even needed. Permutations on four read operands should be able to handle most alignment concerns.

What the tensor cores provided was the extra FP32 adds as the data alignment looks relatively simple now. Voltas increases coming down to execution units and clockspeed increases from the lower precision. Time the speed to the FP16 mul with alternating adders to keep pace. A single tensor op shouldn't write the same location twice. They just use accumulation over multiple sequential operations to free operands.

What do you mean by tiled results being accumulated?

The result of multiplying two 4x4 matrices is still a 4x4 matrix, not 16x16 perhaps I misunderstood but I'm not following.
Each element of the resultant matrix is the scalar product of the respective column and row vectors.

Thats 4 products and 3 accumulates per element
 
What do you mean by tiled results being accumulated?

The result of multiplying two 4x4 matrices is still a 4x4 matrix, not 16x16 perhaps I misunderstood but I'm not following.
Each element of the resultant matrix is the scalar product of the respective column and row vectors.

Thats 4 products and 3 accumulates per element
They aren't matrices though, they're 2nd order Tensors that look like matrices. The product of two 2nd orders is 4x4x4 (recall 64 flops per cycle on tensor cores). Different math operation. The accumulation is a data flow thing, not the traditional consolidation were used to.
 
They aren't matrices though, they're 2nd order Tensors that look like matrices. The product of two 2nd orders is 4x4x4 (recall 64 flops per cycle on tensor cores). Different math operation. The accumulation is a data flow thing, not the traditional consolidation were used to.

A vector is a tensor of rank (or order) 1. A rank 2 tensor is represented by a matrix
 
Are you talking about tensor products? This is not performing tensor products, it's performing matrix FMA on 4x4 matrices.
Yes, tensor products with the accumulation being part of the deep learning algorithm. That's why we keep having difficulty visualizing all the adds in a consolidation step. They aren't accumulating rows/cols, but results over time which maps to a multiply+accumulate loosely following a SGEMM routine.
 
Yes, tensor products with the accumulation being part of the deep learning algorithm. That's why we keep having difficulty visualizing all the adds in a consolidation step. They aren't accumulating rows/cols, but results over time which maps to a multiply+accumulate loosely following a SGEMM routine.

There's no accumulation involved in a tensor product and that's besides the point it's extremely clear from the limited information nvidia has released that they are performing matrix-matrix multiply.

I think you've been mislead by the diagram they have on their blog
 
There's no accumulation involved in a tensor product and that's besides the point it's extremely clear from the limited information nvidia has released that they are performing matrix-matrix multiply.

I think you've been mislead by the diagram they have on their blog
Tesla V100 delivers industry-leading floating-point and integer performance. Peak computation rates (based on GPU Boost clock rate) are:
  • 7.5 TFLOP/s of double precision floating-point (FP64) performance;
  • 15 TFLOP/s of single precision (FP32) performance;
  • 120 Tensor TFLOP/s of mixed-precision matrix-multiply-and-accumulate.
https://devblogs.nvidia.com/parallelforall/inside-volta/
Maybe, but it literally states "matrix multiply and accumulate" for Tensor ops. The only difference is a storage register and not transposing a matrix in the process. Plus it makes more sense a "Tensor Core" operates on "Tensor" math with multi-linear algebra. The real performance gain looks to be from additional execution units and some more efficient pipelining. On paper and FP64 aside, that's only 20% higher throughput than Vega with FP32 despite the 60% size advantage. Conceivably Vega could be close on "Tensor" ops if they piled on the adders and clocked around FP16.
 
Maybe, but it literally states "matrix multiply and accumulate" for Tensor ops. The only difference is a storage register and not transposing a matrix in the process. Plus it makes more sense a "Tensor Core" operates on "Tensor" math with multi-linear algebra. The real performance gain looks to be from additional execution units and some more efficient pipelining. On paper and FP64 aside, that's only 20% higher throughput than Vega with FP32 despite the 60% size advantage. Conceivably Vega could be close on "Tensor" ops if they piled on the adders and clocked around FP16.

Vega is rated for 100tflops?
 
Architectural enhancements in Volta are abundant and wide-ranging. The most important is the addition of the aforementioned Tensor Cores, which provide 120 Tensor teraflops for either training and inferencing neural networks. That’s 12 times faster than the P100 for FP32 operations used for training and 6 times faster than the P100 for FP16 used for inferencing.

In a nutshell, the Tensor Cores provide matrix processing operations that align well with both deep learning training and inferencing, which involves multiplying large matrices of data and weights associated with neural networks. More specifically, each of the 640 Tensor Cores does mixed precision floating point operations on a 4x4x4 array. In a single clock cycle, each core can do 64 FMA (fused multiply-add) operations. Each FMA multiplies two FP16 matrices and adds a FP16 or FP32 matrix, with the result stored in a FP16 or FP32 matrix.

https://www.top500.org/news/nvidia-raises-performance-bar-with-volta-gpu/
 

I don't know where this 4x4x4 came from, I can't figure out if it's to signify it operates on 3 matrices in one clock, or to signify the total number of scalar FMA (4 per element of the matrix) but it is most definitely operating on 2 dimensional arrays, not higher.

Its extremely clear in the blog, it is performing FMA matrix-matrix operations
tmp_14643-Screenshot_20170604-041326392441661.jpg

This is essentially what each unit in the tensor core is doing every clock, there are 64 of these in each tensor core.
tmp_14643-Screenshot_20170604-0416321402412454.jpg
 
I don't know where this 4x4x4 came from, I can't figure out if it's to signify it operates on 3 matrices in one clock, or to signify the total number of scalar FMA (4 per element of the matrix) but it is most definitely operating on 2 dimensional arrays, not higher.

Its extremely clear in the blog, it is performing FMA matrix-matrix operations
View attachment 1999

This is essentially what each unit in the tensor core is doing every clock, there are 64 of these in each tensor core.
View attachment 2000

As to the 4x4x4 dimensions, every 2D matrix multiplication has 3 dimensions, usually called m, n, and k.

The first matrix is (m, k) dimensions, the second is (k, n), and multiplying them produces an output matrix of (m, n) dimensions. The inner dimension, k, has to be the same for both matrices being multiplied or else the multiplication isn't defined. So 4x4x4 just means that you multiply a (4x4) and a (4x4) matrix to yield another (4x4) matrix that you then accumulate into a (4x4) matrix.
 
Nvidia just stated part of the math being performed, not the tensor. The result of two 4x4 matrices is a 16x16 array.
I am not exactly sure where this now fits since the discussion moved on, but Nvidia's description of the tensor operation D=A x B + C would have a multiplication phase between A and B that would produce 64 outputs. From there, the groups of 4 results coming from a given row*column would be summed together. This would return things to a 4x4 matrix to be added to C. In operation, I think it's possible that the hardware is arranged to broadcast/duplicate values to allow for the 64 multiply operations to be in-flight and to start the accumulations in parallel.

A sticking point to using a standard SIMD is that the operands A,B,C,D are sourced from or written to a register file whose configuration gives an initial arrangement where the values are delivered as 1x16 matrices, with the constraint introduced by the SIMD that only matching positions in the operands could interact. Cross-lane operations would relax the constraint, but I am drawing a blank on an instruction for this specific pattern.

Also, if this is using a SIMD fundamentally similar to GCN, there is a serious amount of underutilization.

A 16 wide SIMD would just broadcast the (0,0) element, two for packed math, with the assumption of tiled results being accumulated.
I'm not sure which value is (0,0) in this case, or why it is useful to more than 4 lanes.

The accumulator would be sitting on the equivalent of 4 VGPRs(4x64=256) aligned to quads which would fit with the existing permute capabilities if they were even needed.
Is there a specific operation or combination thereof you have in mind from those listed in the following?
http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

What the tensor cores provided was the extra FP32 adds as the data alignment looks relatively simple now.
Nvidia seems to be insistent that it's doing the 4x4 D=A x B + C operation in one clock.
 
Last edited:
I think what people are missing here is that in a standard BLAS library, matrix multiply effectively comes in two variants: with and without a "constant" that is added. The addition of C is "free" when performing a matrix multiply on hardware where the base operation is FMA, which is why the tensor operation is 64 FLOP. Indeed, in BLAS, C is always present in the definition but the library will be optimised for the two cases: whether matrix C is present or not present.

In conventional matrix multiply C is zero and that is what people normally call matrix multiply. Indeed, when using FMA hardware to compute a conventional matrix multiply, the FLOP count is over-stated. Taking the resultant for D[0, 0], which consists of:

A[0, 0] * B[0, 0]
+ A[1, 0] * B[0, 1]
+ A[2, 0] * B[0, 2]
+ A[3, 0] * B[0, 3]

the very first operation in FMA hardware is A[0, 0] * B[0, 0] + 0. Notice that 0. It's completely pointless. But in hardware that does FMA, the loop normally performs four FMAs, not one multiply followed by a loop of three FMAs. So the FLOP count is over-stated because it's counted as four FMAs = 8 FLOP.

When C is non-zero, instead of zeroing the starting accumulator for each "cell" in the result matrix, one puts the corresponding cell from C there. Every computation that follows is then the same as a "normal" matrix multiply. So D[0, 0] is now:

C[0, 0] + A[0, 0] * B[0, 0]
+ A[1, 0] * B[0, 1]
+ A[2, 0] * B[0, 2]
+ A[3, 0] * B[0, 3]

This is now a true sequence of four FMAs and it's justifiable to call it 8 FLOP.
 
From there, the groups of 4 results coming from a given row*column would be summed together.
That's what I originally thought, but I'm not sure that's the case. The tensor product I believe is creating 256 "weights" that are then added to the 256 from a previous operation and so on. The FMA is fully utilized, just not how we thought.

In operation, I think it's possible that the hardware is arranged to broadcast/duplicate values to allow for the 64 multiply operations to be in-flight and to start the accumulations in parallel.
As I stated above, all accumulations for a single instruction would be parallel. The 4x4 product taking a single element of the first matrix and broadcasting to all 16 lanes that multiply and accumulate. Two elements processed for packed math.

Then it's just a question of how many 4x4 tiles you can process with given hardware as the result should be 256 parallel FMA operations.

I'm not sure which value is (0,0) in this case, or why it is useful to more than 4 lanes.
One element of the matrix at a time. The pattern is simplified as the adds aren't dependent like a typical matrix.

Nvidia seems to be insistent that it's doing the 4x4 D=A x B + C operation in one clock.
They would be, what's changing is what is being accumulated along with a much wider execution unit. In theory a 256 wide traditional SIMD with FMA would do this is one cycle.

I was also proposing pipelining that operation to increase throughput while keeping it compact. FP16 logic should settle more quickly than FP32, so use more and adjust the pattern.

Is there a specific operation or combination thereof you have in mind from those listed in the following?
http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
Quad permute (QDMode), but it would only be needed to flush the accumulators or perhaps coalesce inputs under some conditions.
 
Back
Top