Tensors! *spawn*

Discussion in 'Architecture and Products' started by 3dilettante, Jun 2, 2017.

  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,361
    Likes Received:
    3,940
    Location:
    Well within 3d
    There's a specialized data flow and ALU arrangement to get a 4x4 matrix multiply and accumulate that result with another 4x4 in one clock, and it seems mostly the same number of register file accesses as a regular instruction barring the higher precision of the accumulator.

    In one clock, with GCN's existing broadcast/permute capability and regular register accesses?
    Is this still with a 64-wide wave?
    I've tried to mentally picture how this works, without getting into the relative widths of the units.

    Going by Nvidia's description of its 4x4x4 operation D=AxB+C, let's give each operand one register access.
    Each element in A would be broadcast in a strided manner to 4 elements in B. That's 64 horizontal broadcasts. While it should be possible to optimize away a broadcast when the lanes align, that would require a separate operation since the network is broadcasting.
    A SIMD can broadcast 1 value horizontally per clock, and this is part of a cross-lane network that is both more complex than the specialized broadcast and also insufficient since that's one pattern it cannot do.
    After the initial full-precision multiply, there would need to be a set of additions whose operands are at a stride of 4 in the other dimension and the elements from register C--not a standard 1 cycle operation or handled by the regular write-back path of a SIMD.
    Is there a way to synthesize this behavior with existing SIMD data paths and ALUs without a lot of serialization and separate instructions?
    That's a lot of register file and bypass/swizzle networks, and any number of instruction fetches and cycles.

    Officially, there is currently no special relationship between Vega and Zeppelin. It's only described as using PCIe for the compute cards. xGMI was on a slide for Vega 20.


    If CUDA is not that relevant, AMD's efforts that not that relevant relative to that.
    It's sufficiently relevant that AMD's revamped compute platform specifically provisions for translating from CUDA.
     
    Razor1, sonen and pharma like this.
  2. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    You quoted me saying they threw silicon at the problem, through larger dice and ALUs. Having separate units would be a bad thing as a chunk of the chip would only work with Tensors. All that register space, FMA units, etc wasted when it could otherwise be used for standard FP heavy workloads. Tensor looks like the standard SIMT with all logic NOT related to FMA removed. What I suggested for GCN was roughly the same thing, with performance determined by instruction distribution of code. An adjustment there wouldn't be unreasonable as AMD and Microsoft did just that when profiling code from lots of console titles for Scorpio.
     
  3. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    What you suggest is adding two 32b scalar units to each lane of the SIMD, I don't know how this is even remotely comparable to performing matrix FMA in one clock. We are talking about 1024 operations per clock per SM.

    You talk of '4 tensor'ish cores' per CU. Assuming a tensor'ish core retains the throughput of a Tensor core that's 512 operations per clock per CU. As 3dilletante said eloquently stated a few posts up that requires significant numbers of reigster accesses to load and store all of the operands, and so far we don't know the details of the implementation.

    Needless to say I think it's far from trivial to implement, and "simply adding two 'flexible scalars' to each lane" (which is simply throwing more ALU and die space at the problem, ironically) doesn't seem like a very convincing alternative
     
  4. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Not exactly, what I'm suggesting is streamlining the SIMD and bulking up on execution units related to Tensors. Which also accounts for most graphics math. At FP16 that would be 8 operations per lane, half of those not counted as accumulators.

    The operations per clock is somewhat irrelevant as it's an issue of area and dimensions. I'm unsure what the ideal matrix size for deep learning would be as they seem to vary a lot.

    Hence "Tensor'ish". More a question of throughput.

    It's not flexible per lane, but per SIMD to allow for streamlining the SIMD. It's a method to get more execution units with denser logic. Avoid replicating parts that aren't used frequently.

    Just to be clear I'm speculating on this, but it could be a method to reorganize the CU. It's those details we haven't seen yet.
     
  5. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    I think I figured out our problem. "Warp-Level Matrix Operations". Nvidia just stated part of the math being performed, not the tensor. The result of two 4x4 matrices is a 16x16 array. A 16 wide SIMD would just broadcast the (0,0) element, two for packed math, with the assumption of tiled results being accumulated. The accumulator would be sitting on the equivalent of 4 VGPRs(4x64=256) aligned to quads which would fit with the existing permute capabilities if they were even needed. Permutations on four read operands should be able to handle most alignment concerns.

    What the tensor cores provided was the extra FP32 adds as the data alignment looks relatively simple now. Voltas increases coming down to execution units and clockspeed increases from the lower precision. Time the speed to the FP16 mul with alternating adders to keep pace. A single tensor op shouldn't write the same location twice. They just use accumulation over multiple sequential operations to free operands.
     
    BRiT likes this.
  6. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    What do you mean by tiled results being accumulated?

    The result of multiplying two 4x4 matrices is still a 4x4 matrix, not 16x16 perhaps I misunderstood but I'm not following.
    Each element of the resultant matrix is the scalar product of the respective column and row vectors.

    Thats 4 products and 3 accumulates per element
     
    pharma and Razor1 like this.
  7. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    They aren't matrices though, they're 2nd order Tensors that look like matrices. The product of two 2nd orders is 4x4x4 (recall 64 flops per cycle on tensor cores). Different math operation. The accumulation is a data flow thing, not the traditional consolidation were used to.
     
  8. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    A vector is a tensor of rank (or order) 1. A rank 2 tensor is represented by a matrix
     
  9. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    Are you talking about tensor products? This is not performing tensor products, it's performing matrix FMA on 4x4 matrices.
     
    pharma and Razor1 like this.
  10. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Yes, tensor products with the accumulation being part of the deep learning algorithm. That's why we keep having difficulty visualizing all the adds in a consolidation step. They aren't accumulating rows/cols, but results over time which maps to a multiply+accumulate loosely following a SGEMM routine.
     
  11. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    There's no accumulation involved in a tensor product and that's besides the point it's extremely clear from the limited information nvidia has released that they are performing matrix-matrix multiply.

    I think you've been mislead by the diagram they have on their blog
     
    CSI PC and Razor1 like this.
  12. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Maybe, but it literally states "matrix multiply and accumulate" for Tensor ops. The only difference is a storage register and not transposing a matrix in the process. Plus it makes more sense a "Tensor Core" operates on "Tensor" math with multi-linear algebra. The real performance gain looks to be from additional execution units and some more efficient pipelining. On paper and FP64 aside, that's only 20% higher throughput than Vega with FP32 despite the 60% size advantage. Conceivably Vega could be close on "Tensor" ops if they piled on the adders and clocked around FP16.
     
  13. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    Vega is rated for 100tflops?
     
  14. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,051
    Likes Received:
    2,925
    Location:
    Finland
    You missed the "piled on the adders" part
     
  15. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    3,526
    Likes Received:
    2,213
    https://www.top500.org/news/nvidia-raises-performance-bar-with-volta-gpu/
     
    Jawed and manux like this.
  16. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    I don't know where this 4x4x4 came from, I can't figure out if it's to signify it operates on 3 matrices in one clock, or to signify the total number of scalar FMA (4 per element of the matrix) but it is most definitely operating on 2 dimensional arrays, not higher.

    Its extremely clear in the blog, it is performing FMA matrix-matrix operations
    tmp_14643-Screenshot_20170604-041326392441661.jpg

    This is essentially what each unit in the tensor core is doing every clock, there are 64 of these in each tensor core.
    tmp_14643-Screenshot_20170604-0416321402412454.jpg
     
    Gubbi, Jawed and pharma like this.
  17. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    As to the 4x4x4 dimensions, every 2D matrix multiplication has 3 dimensions, usually called m, n, and k.

    The first matrix is (m, k) dimensions, the second is (k, n), and multiplying them produces an output matrix of (m, n) dimensions. The inner dimension, k, has to be the same for both matrices being multiplied or else the multiplication isn't defined. So 4x4x4 just means that you multiply a (4x4) and a (4x4) matrix to yield another (4x4) matrix that you then accumulate into a (4x4) matrix.
     
    Razor1 and pharma like this.
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,361
    Likes Received:
    3,940
    Location:
    Well within 3d
    I am not exactly sure where this now fits since the discussion moved on, but Nvidia's description of the tensor operation D=A x B + C would have a multiplication phase between A and B that would produce 64 outputs. From there, the groups of 4 results coming from a given row*column would be summed together. This would return things to a 4x4 matrix to be added to C. In operation, I think it's possible that the hardware is arranged to broadcast/duplicate values to allow for the 64 multiply operations to be in-flight and to start the accumulations in parallel.

    A sticking point to using a standard SIMD is that the operands A,B,C,D are sourced from or written to a register file whose configuration gives an initial arrangement where the values are delivered as 1x16 matrices, with the constraint introduced by the SIMD that only matching positions in the operands could interact. Cross-lane operations would relax the constraint, but I am drawing a blank on an instruction for this specific pattern.

    Also, if this is using a SIMD fundamentally similar to GCN, there is a serious amount of underutilization.

    I'm not sure which value is (0,0) in this case, or why it is useful to more than 4 lanes.

    Is there a specific operation or combination thereof you have in mind from those listed in the following?
    http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

    Nvidia seems to be insistent that it's doing the 4x4 D=A x B + C operation in one clock.
     
    #18 3dilettante, Jun 4, 2017
    Last edited: Jun 4, 2017
    BRiT and pharma like this.
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,875
    Likes Received:
    767
    Location:
    London
    I think what people are missing here is that in a standard BLAS library, matrix multiply effectively comes in two variants: with and without a "constant" that is added. The addition of C is "free" when performing a matrix multiply on hardware where the base operation is FMA, which is why the tensor operation is 64 FLOP. Indeed, in BLAS, C is always present in the definition but the library will be optimised for the two cases: whether matrix C is present or not present.

    In conventional matrix multiply C is zero and that is what people normally call matrix multiply. Indeed, when using FMA hardware to compute a conventional matrix multiply, the FLOP count is over-stated. Taking the resultant for D[0, 0], which consists of:

    A[0, 0] * B[0, 0]
    + A[1, 0] * B[0, 1]
    + A[2, 0] * B[0, 2]
    + A[3, 0] * B[0, 3]

    the very first operation in FMA hardware is A[0, 0] * B[0, 0] + 0. Notice that 0. It's completely pointless. But in hardware that does FMA, the loop normally performs four FMAs, not one multiply followed by a loop of three FMAs. So the FLOP count is over-stated because it's counted as four FMAs = 8 FLOP.

    When C is non-zero, instead of zeroing the starting accumulator for each "cell" in the result matrix, one puts the corresponding cell from C there. Every computation that follows is then the same as a "normal" matrix multiply. So D[0, 0] is now:

    C[0, 0] + A[0, 0] * B[0, 0]
    + A[1, 0] * B[0, 1]
    + A[2, 0] * B[0, 2]
    + A[3, 0] * B[0, 3]

    This is now a true sequence of four FMAs and it's justifiable to call it 8 FLOP.
     
  20. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    That's what I originally thought, but I'm not sure that's the case. The tensor product I believe is creating 256 "weights" that are then added to the 256 from a previous operation and so on. The FMA is fully utilized, just not how we thought.

    As I stated above, all accumulations for a single instruction would be parallel. The 4x4 product taking a single element of the first matrix and broadcasting to all 16 lanes that multiply and accumulate. Two elements processed for packed math.

    Then it's just a question of how many 4x4 tiles you can process with given hardware as the result should be 256 parallel FMA operations.

    One element of the matrix at a time. The pattern is simplified as the adds aren't dependent like a typical matrix.

    They would be, what's changing is what is being accumulated along with a much wider execution unit. In theory a 256 wide traditional SIMD with FMA would do this is one cycle.

    I was also proposing pipelining that operation to increase throughput while keeping it compact. FP16 logic should settle more quickly than FP32, so use more and adjust the pattern.

    Quad permute (QDMode), but it would only be needed to flush the accumulators or perhaps coalesce inputs under some conditions.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...