Tensors! *spawn*

The second important point about C is that it enables you to construct the multiplication of larger matrices, from the building blocks of smaller multiplies. So if we're working with 8x8 matrices, we can break each matrix down into 4 sub-matrices: top-left, top-right, bottom-left and bottom-right.

So, the top-left sub-matrix of D, is computed as follows:

D-tl =

A-tl * B-tl
+ A-tr * B-bl

then:

D-bl =

A-bl * B-tl
+ A-br * B-bl

and then similar for D-tr and D-br.

So the tensor operation becomes the fundamental building block of arbitrary-sized matrix multiplication. In prior GPUs, FMA was that building block. This tensor operation is essentially an FMA on matrix-blocks.

So 4 tensor cores share the same register file as 32 FP32 cores (or 32 int cores or 16 FP64 cores). Since an FP32 core can fetch two FP16 operands per clock for each of A, B and C in a conventional FMA (D = A * B + C - though V100 does not have ordinary, general purpose, FP16 cores) and since each tensor core takes the place of eight FP32 cores, a tensor core has access to 48 FP16 operands. Which is precisely the required operand count for the tensor core in FP16 mode.

In mixed FP16/32 mode, D is computed as FP32 and then it would be fed forwards into successive tensor operations as C. In this scenario, register forwarding would take on the strain of providing the operand bandwidth for 32-bit C, so the GPU wouldn't be starved of bandwidth in trying to fetch 2x 16-bit for A and B then 32-bit for C from the register file.

So in the computation of D-tl, the second tensor operation is A-tr * B-bl (A * B, both FP16 4x4 matrices) added to C which is an FP32 4x4 matrix. C was computed in the previous tensor operation from A-tl * B-tl as an FP32 4x4 matrix resultant. So forwarding C to the second tensor operation, instead of writing it to the register file means there is no issue with the operand bandwidth that FP32 C uses.
 
The second important point about C is that it enables you to construct the multiplication of larger matrices, from the building blocks of smaller multiplies. So if we're working with 8x8 matrices, we can break each matrix down into 4 sub-matrices: top-left, top-right, bottom-left and bottom-right.

So, the top-left sub-matrix of D, is computed as follows:

D-tl =

A-tl * B-tl
+ A-tr * B-bl

then:

D-bl =

A-bl * B-tl
+ A-br * B-bl

and then similar for D-tr and D-br.

So the tensor operation becomes the fundamental building block of arbitrary-sized matrix multiplication. In prior GPUs, FMA was that building block. This tensor operation is essentially an FMA on matrix-blocks.

So 4 tensor cores share the same register file as 32 FP32 cores (or 32 int cores or 16 FP64 cores). Since an FP32 core can fetch two FP16 operands per clock for each of A, B and C in a conventional FMA (D = A * B + C - though V100 does not have ordinary, general purpose, FP16 cores) and since each tensor core takes the place of eight FP32 cores, a tensor core has access to 48 FP16 operands. Which is precisely the required operand count for the tensor core in FP16 mode.

In mixed FP16/32 mode, D is computed as FP32 and then it would be fed forwards into successive tensor operations as C. In this scenario, register forwarding would take on the strain of providing the operand bandwidth for 32-bit C, so the GPU wouldn't be starved of bandwidth in trying to fetch 2x 16-bit for A and B then 32-bit for C from the register file.

So in the computation of D-tl, the second tensor operation is A-tr * B-bl (A * B, both FP16 4x4 matrices) added to C which is an FP32 4x4 matrix. C was computed in the previous tensor operation from A-tl * B-tl as an FP32 4x4 matrix resultant. So forwarding C to the second tensor operation, instead of writing it to the register file means there is no issue with the operand bandwidth that FP32 C uses.

The wording of the devblog suggests they are performing this Matrix FMA in one cycle, which is what most of the discussions have been about.

They explicitly state each tensor core is capable of performing 64 ops per cycle, but that only covers the multiplication of two matrices, and it's unclear how all the elements of every row x column product are accumulated in one cycle as each parallel fma op would be dependent on the "previous", or adjacent one rather
 
Deep Learning 101: Demystifying Tensors
At its core, linear algebra consists of fairly simple and very regular operations involving repeated multiplication and addition operations on one- and two-dimensional arrays of numbers (often called vectors and matrices in this context) and it is tremendously general in the sense that many problems can be solved or approximated by linear methods. These range from rendering the images in computer games to nuclear weapon design as well for as a huge range of other applications between these extremes.
...
We are at just such a threshold right now. The difference is that there isn’t anywhere to go beyond matrix-matrix operations. That is, there isn’t anywhere else to go using linear algebra.

But we don’t have to restrict ourselves to linear algebra. As it turns out, we can move up the mathematical food chain a bit. It has long been known that there are bigger fish in the mathematical abstraction sea than just matrices. One such candidate is called a tensor. Tensors figure prominently in the mathematical underpinnings of general relativity and are fundamental to other branches of physics as well. But just as the mathematical concepts of matrices and vectors can be simplified down to the arrays we use in computers, tensors can also be simplified and represented as multidimensional arrays and some related operations. Unfortunately, things aren’t quite as simple as they were with matrices and vectors, largely because there isn’t an obvious and simple set of operations to be performed on tensors like there were with matrices and vectors.

There is some really good news, though. Even though we can’t write just a few operations on tensors, we can write down a set of patterns of operations on tensors. That isn’t quite enough, however, because programs written in terms of these patterns can’t be executed at all efficiently as they were written. The rest of the good news is that our inefficient but easy to write programs can be transformed (almost) automatically into programs that do execute pretty efficiently.
http://www.kdnuggets.com/2017/06/de...rs.html?utm_source=dlvr.it&utm_medium=twitter
 
Last edited:
The wording of the devblog suggests they are performing this Matrix FMA in one cycle, which is what most of the discussions have been about.
They are performing one per cycle, not in one cycle. This is very likely a pipelined operation taking (at least) 4 cycles.
That is not too difficult to do. AMDs old VLIW architectures support a "horizontal" FP32 DP4 (i.e. dependent operations) in a single VLIW instruction (using 4 slots) meaning the old VLIW-SIMDs (consisting of 16 VLIW units) could do almost the same as a tensor core (they had the capacity for 16 DP4 per clock, which is what you need to do for a 4x4 matrix multiplication, and that even in FP32). The trick is to cleverly route the operands to the f16-FMAs and share them between the individual FMA units to enable a more power efficient operation. But that appears to be fairly easy to do.
They explicitly state each tensor core is capable of performing 64 ops per cycle, but that only covers the multiplication of two matrices, and it's unclear how all the elements of every row x column product are accumulated in one cycle as each parallel fma op would be dependent on the "previous", or adjacent one rather
They do 64 FMAs (f16xf16+f32). NV claims 120 TOPS with 640 tensor cores.
640 tensor cores * 128 ops/cycle * 1.465 GHz = 120 TOPS
Appears to be consistent. And the dependencies can be dealt with as in previous GPUs.
Has nV said anything about a guaranteed order of the operations? I guess not (it is often undefined for matrix operations anyway). That means the actual implementation doesn't need to consist of 16 blocks of four dependent f16-FMAs. The multiplication part can be done in parallel (all 64 of them, other possibilities exist), but the accumulation part could be 16 shared 5 input adders (each adding the results of 4 multiplications to the accumulator). One alternative would be four 2 input adders for each block of 4 multipliers, which one can arrange over 3 stages after the parallel multiplications (2 adds in parallel followed by two serial ones, resulting in 4 pipeline stages in total for the actual execution, or 8 assuming each multiplication or addition needs two stages).
 
Last edited:
I asked this in the main Volta thread, but maybe it will have more traction here. Can the normal SMs operate at full capacity together with the tensor cores or are they throttled when using them?

If you can use them together it would be strange there is not any mention of visualization for the tensor cores even if it isn't as hip or high margin as artificial stupidity.
 
Can the normal SMs operate at full capacity together with the tensor cores or are they throttled when using them?

Don't think so; As Jawed pointed out, the bandwidth to/from the register files for the tensor units match that of 8 32fp "cores". Each SM has 8 tensor units and 64 fp32 cores, which suggests the register file, and associated busses, are shared between the execution units.

Cheers
 
Nvidia mentions it is partitioned as:
However, the GV100 SM uses a new partitioning method to improve SM utilization and overall performance. Recall the GP100 SM is partitioned into two processing blocks, each with 32 FP32 Cores, 16 FP64 Cores, an instruction buffer, one warp scheduler, two dispatch units, and a 128 KB Register File.
The GV100 SM is partitioned into four processing blocks, each with 16 FP32 Cores, 8 FP64 Cores, 16 INT32 Cores, two of the new mixed-precision Tensor Cores for deep learning matrix arithmetic, a new L0 instruction cache, one warp scheduler, one dispatch unit, and a 64 KB Register File. Note that the new L0 instruction cache is now used in each partition to provide higher efficiency than the instruction buffers used in prior NVIDIA GPUs.

Also worth noting this statement from Bulat Ziganshin that was said to be correct by Senior Nvidia engineer:
t seems that each FP32/INT32 instruction sheduled occupy sub-SM for a two cycles, so on the next sycle other type of instruction should be issued - pretty similar to LD/ST instructions on all NVidia GPUs as well as sheduling on SM 1.x GPUs

So, the new architecture allows to run FP32 instructions at full rate, and use remaining 50% of issue slots to execute all other type of istructions - INT32 for index/cycle calculations, load/store, branches, SFU, FP64 and so on. And unlike Maxwell/Pascal, full GPU utilization doesn't need to pack pairs of coissued instructions into the same thread - each next cycle can execute instructions from differemnt thread, so one thread perfroming series of FP32 instructions and other thread perfroming series of INT32 instructions, will load both blocks by 100%

is my understanding correct?
 
Generally speaking they appear throttled. There probably exists a corner case where they all run full speed however. That would involve instructions that can operate primarily out of a register cache while running concurrently. The actual register file being shared among units. That's where that 50-75% reduction in RF accesses would come into play. In practice that would be a rather strange workload.
 
So...

I believe a lot of the analysis in this thread is based on a misunderstanding: that tensor cores operate on 4x4 matrices. The calculations Jawed did in terms of register file bandwidth prove that there is enough register file bandwidth for them to work on 4x4 in *theory* - but in practice, I believe the wiring, scheduling, and instruction set do not allow this: the minimum matrix size is 16x16 (as per the CUDA documentation and other indications).

There is a very fundamental reason why this is makes sense: power consumption. Reading/writing the register file costs a *lot* of power - depending on the architecture, potentially more power than the tensor calculations themselves! By working on 16x16 matrices across a warp (rather than 4x4 across 1 lane), you are reading 256 inputs for 4096 multiplies, instead of 64 inputs for 256 multiplies with 4x4; i.e. 4x less register file bandwidth per flop, so 4x less register file power consumption! (note I'm ignoring the issue of forwarding the results between operations, which I suspect NVIDIA is not doing, but it's hard to say).

In theory, you could support smaller or larger matrix sizes; for 4x4, it would "simply" require a larger bus and extra wiring to the tensor cores. But why pay that area cost if deep learning workloads aren't going to benefit from it? And the more "requesters" need high bandwidth access to the register file, the more the wiring distance from the register file to the main FP32 ALUs may have to be increased in practice, resulting in slightly higher power consumption for non-tensor workloads...

On the other end, larger matrices would reduce register file accesses further, but may require more temporary storage inside the tensor cores themselves, and/or more complex logic inside them if the matrix size is variable from instruction to instruction. My assumption is that NVIDIA decided 16x16 matrices were a sweetspot where register file bandwidth was a sufficiently low percentage of total power, so there's no need to support larger matrices in the current architecture. One downside of not supporting 4x4 matrices is that the tensor cores become even more useless for graphics...

Finally I suspect what NVIDIA did on Volta is pairing a 16-wide ALU with a 32-wide register file. There are several ways they could have done this and several possible benefits depending on their approach; one possibility is that by "overspeccing" the register file bandwidth, they could simplify the logic for bank clash mitigation (possibly saving power). Certainly it should allow much better co-issue than in the past, and FP32+[SFU/TMU/...] co-issue is a clear benefit for traditional graphics workloads.

---

P.S.: I was tempted to buy a Titan V at some point to run custom microbenchmarks and do an in-depth analysis, but the shipping date is only Dec 30 on the UK website, so it might have to wait...
 
I'm going to guess that 16x16 is the macro they compile to, it's just the unit size for the macro.

The sheer quantity of data that needs to be held in flight for a 16x16 matrix-matrix multiply is stupendous: 32 fp16 operands and then there's the in-flight resultant, which is now an even bigger monster, 256 fp16s!

This is the same old gotcha that applies to any MMM algorithm on GPU: the register allocation for the in-flight operands/partial-resultants when performing FMAs on large, square, portions of a matrix in conventional code are the pain point for optimisation.

If you allocate lots of registers, then you lose threads for latency-hiding. Similarly, if you try to build a "single instruction" pipeline that operates on 32 operands concurrently with a 256 resultant space, you're going to be in pain.
 
The sheer quantity of data that needs to be held in flight for a 16x16 matrix-matrix multiply is stupendous: 32 fp16 operands and then there's the in-flight resultant, which is now an even bigger monster, 256 fp16s!

Don’t forget that the tensor cores produce FP32 outputs.
 
The sheer quantity of data that needs to be held in flight for a 16x16 matrix-matrix multiply is stupendous: 32 fp16 operands and then there's the in-flight resultant, which is now an even bigger monster, 256 fp16s!

This is the same old gotcha that applies to any MMM algorithm on GPU: the register allocation for the in-flight operands/partial-resultants when performing FMAs on large, square, portions of a matrix in conventional code are the pain point for optimisation.
Agreed but remember NVIDIA's trick is to execute the tensor operation across an entire 32-wide warp rather than 1 lane, so the 256 inputs/outputs would really be 32 inputs/outputs-per-lane. Inside a program using the tensor cores, the 32-wide warp is arguably being used more like a single thread rather than 32 threads of the "SIMT" model, so the constraints are a lot more relaxed than for normal GPU programs.

Taking the very naive case of 256 FP16 inputs + 256 FP16 inputs + 256 FP32 inputs + 256 FP32 outputs with 1 warp executing at a time per tensor core, that's 256x(2+2+4+4) = 3KiB of temporary storage per tensor core, or 2016KiB (~2MiB) per GV100. For comparison, the register file is 21MiB and the L1 caches are 10.5MiB. So an extra 2MiB isn't all that crazy, and there's surely room to make it a lot less than that if you're clever in the order of operations.
 
I'm going to guess that 16x16 is the macro they compile to, it's just the unit size for the macro.

The sheer quantity of data that needs to be held in flight for a 16x16 matrix-matrix multiply is stupendous: 32 fp16 operands and then there's the in-flight resultant, which is now an even bigger monster, 256 fp16s!

This is the same old gotcha that applies to any MMM algorithm on GPU: the register allocation for the in-flight operands/partial-resultants when performing FMAs on large, square, portions of a matrix in conventional code are the pain point for optimisation.

If you allocate lots of registers, then you lose threads for latency-hiding. Similarly, if you try to build a "single instruction" pipeline that operates on 32 operands concurrently with a 256 resultant space, you're going to be in pain.
The WMMA API is what sets this to 16x16 due to being warp level orientated, and separately each core as noted in the attached link below can do 4x4 tensor maths, comes back to the partition information I posted a bit earlier in the thread.
Separately some additional info here: https://devtalk.nvidia.com/default/topic/1018180/volta-tensor-computation-question/

This (access to the Tensor cores including such as WMMA API) was under Preview Feature classification in CUDA 9, so will probably change/evolve going forward.
 
Last edited:
Agreed but remember NVIDIA's trick is to execute the tensor operation across an entire 32-wide warp rather than 1 lane,
Yes this is what I expect the macro is doing: ganging the lanes together. This is precisely how any GPU MMM is written, typically using shared memory as a "programmable operand collector" (though not necessarily for all operands). This is normally done to keep within the constraints of register file bandwidth as well as to gain maximum arithmetic intensity - the more operands in flight across a gang of work items, the greater the arithmetic intensity (since registers and shared memory can be assumed to have "no latency").

Back to the original question, "can you do other math at the same time as the tensor cores are running at full speed?" and despite my calculations regarding RF bandwidth, it seems reasonable to presume that it is actually possible to do some math at the same time, if macros are used for ganging lanes, since operand-fetch bandwidth is now amortised across multiple lanes (a sort of broadcast, but it wouldn't actually be a full broadcast).

Judging from the comments about the latest versions of these macros, it would appear that 4x4 is genuinely the building block and would be subject to the RF bandwidth calculations I did earlier. Whether use of macros is enforced in order to achieve the declared performance (and/or allowing other math in parallel) is harder to say.

There may be other things going on with the macros, e.g. they produce code that behaves properly when multiple hardware threads try to share the tensor cores (is that a thing?). In this case I'm guessing the macros make for code that neatly interleaves the work whilst fully respecting register file, cache-hierarchy and off-die bandwidths. MMM efficiency is solely about making the entire memory hierarchy sing your tune, so an analysis of RF bandwidth, alone, doesn't provide a full picture.

NVidia may have chosen to deploy macros in order to prevent people from wasting time trying to get the right performance. Apart from anything else it would hurt their marketing if a core with "just a single instruction" was hard to get working at the declared performance.

Jawed
 
The closest answer you will get is the statement by Ziganshin that a senior engineer confirmed as being correct.
It seems that each FP32/INT32 instruction sheduled occupy sub-SM for a two cycles, so on the next sycle other type of instruction should be issued - pretty similar to LD/ST instructions on all NVidia GPUs as well as sheduling on SM 1.x GPUs

So, the new architecture allows to run FP32 instructions at full rate, and use remaining 50% of issue slots to execute all other type of istructions - INT32 for index/cycle calculations, load/store, branches, SFU, FP64 and so on. And unlike Maxwell/Pascal, full GPU utilization doesn't need to pack pairs of coissued instructions into the same thread - each next cycle can execute instructions from differemnt thread, so one thread perfroming series of FP32 instructions and other thread perfroming series of INT32 instructions, will load both blocks by 100%
 
The closest answer you will get is the statement by Ziganshin that a senior engineer confirmed as being correct.
That's not very helpful though as it mentions FP32 and INT32 concurrently which is known to be the case with Volta. It's akin to vector and scalar issue on AMD hardware with scalar/INT32 performing addressing/control and not floating math. Then throwing in a LDS access to the register file as a tensor. In the case of Tensors neither may be available as FP logic could be driving the MUL and INT logic slightly modified for the adders, or the registers shared where contention exists.

Even if they can all issue at once, bandwidth will be very limited. It just seems like it would create a lot of ported registers that aren't all that useful if not sustained. Just consider the RF bandwidth to sustain all of that compared to memory bandwidth. It would require extremely ALU heavy code to maintain utilization. Tensors alone can burn all that bandwidth.
 
That's not very helpful though as it mentions FP32 and INT32 concurrently which is known to be the case with Volta. It's akin to vector and scalar issue on AMD hardware with scalar/INT32 performing addressing/control and not floating math. Then throwing in a LDS access to the register file as a tensor. In the case of Tensors neither may be available as FP logic could be driving the MUL and INT logic slightly modified for the adders, or the registers shared where contention exists.

Even if they can all issue at once, bandwidth will be very limited. It just seems like it would create a lot of ported registers that aren't all that useful if not sustained. Just consider the RF bandwidth to sustain all of that compared to memory bandwidth. It would require extremely ALU heavy code to maintain utilization. Tensors alone can burn all that bandwidth.
It also mentions other functions-instructions including FP64, I think you are looking at it in the wrong context.
load/store, branches, SFU, FP64 and so on.
 
I should had looked at the CUTLASS documentation/articles closer as they have a nice section on WMMA API used by Tensor cores; CUTLASS utilises WMMA when it comes to Tensor Cores related maths.
One caveat is that it mentions Cuda 9.0 where the WMMA was restricted to 16x16 (Tensor can do 4x4 but outside of this access) and was like I mentioned under Preview Classification meaning subject to change; as pointed out by xmas 9.1 has added additional size options and will probably continue to evolve.
Scroll down to WMMA GEMM section 1/3 to half way down; https://devblogs.nvidia.com/parallelforall/cutlass-linear-algebra-cuda/
The 1st section deals more with CUTLASS generally outside of Tensor Cores.

Edit:
Just to help the relevant section in the link says:
"In effect, the WMMA API is an alternative to the thread tile structure described in the previous section for warp-wide matrix multiply-accumulate operations. Rather than decomposing the warp tile structure into scalar and vector elements owned by individual threads, the WMMA API provides an abstraction to the programmer for warp-cooperative matrix fragment load / store and multiply-accumulate math operations.

Figure 7 shows the warp tile structure that targets the CUDA WMMA API. Calls to wmma::load_matrix_sync load fragments of A and B into instances of the nvcuda::wmma::fragment<> template, and the accumulator elements for the warp tile are structured as an array of nvcuda::wmma::fragment<accumulator> objects. These fragments store a 2D matrix distributed among the threads of the warp. Finally, calls to nvcuda::wmma::mma_sync()for each accumulator fragment (and corresponding fragments from A and B) compute the warp-wide matrix multiply-accumulate operation using Tensor Cores."
 
Last edited:
I'm going away for a week, don't have the time to finish the analysis and write a proper blog post about this now, and who knows if anyone will care post-GTC where NVIDIA might or might not announce a new architecture...

When using cuBLAS, there are a small number of handwritten assembly kernels that are used - they can be extracted with NVIDIA's visual profiler.

FP32 (basic SGEMM): https://www.dropbox.com/s/4ih9z461z97l2xg/volta_sgemm_128x64_nn?dl=0
Tensor Cores FP32 accumulation: https://www.dropbox.com/s/ggt67malq...a_fp16_h884gemm_fp16_128x128_ldg8_f2f_nn?dl=0
Tensor Cores FP16 accumulation: https://www.dropbox.com/s/v8ma78c747kxdw7/Tensor-FP16-volta_h884gemm_128x128_ldg8_nn?dl=0

If you look at this bit of the FP32 tensor cores kernel:
Code:
       HMMA.884.F32.F32.STEP3 R118, R188.COL, R206.reuse.COL, R118;
       HMMA.884.F32.F32.STEP0 R104, R186.reuse.COL, R206.reuse.COL, R104;
       HMMA.884.F32.F32.STEP1 R106, R186.reuse.COL, R206.reuse.COL, R106;
       HMMA.884.F32.F32.STEP2 R108, R186.reuse.COL, R206.reuse.COL, R108;
       HMMA.884.F32.F32.STEP3 R110, R186.COL, R206.reuse.COL, R110;
       HMMA.884.F32.F32.STEP0 R96, R184.reuse.COL, R206.reuse.COL, R96;
       HMMA.884.F32.F32.STEP1 R98, R184.reuse.COL, R206.reuse.COL, R98;
       HMMA.884.F32.F32.STEP2 R100, R184.reuse.COL, R206.reuse.COL, R100;
       HMMA.884.F32.F32.STEP3 R102, R184.COL, R206.COL, R102;
       HMMA.884.F32.F32.STEP0 R0, R176.reuse.COL, R192.reuse.COL, R0;

You can see that each "HMMA" instruction uses pairs of registers (e.g. R104 and R105 when it refers to R104) for both inputs and outputs. In FP32 mode, there are 4 instructions for a given set of inputs (only difference is the "STEPx" number and the accumulation register); while in FP16 mode, there are only 2 steps/instructions. So with a 32-wide warp and 32-bit registers, that's 32x2 = 128xFP16 or 64xFP32. Over the 2 or 4 steps, that's 256 outputs (i.e. enough for a 16x16 matrix).

So it's a little bit more complicated than each instruction just doing a 4x4x4 matrix multiply-accumulate, because the "swizzling" of the operands (i.e. what part of the input matrix is in which thread) is "interesting". I was going to write a python script inside cuda-gdb to automate the process of figuring it out, but haven't had the time, so this is what I tentatively (maybe incorrectly) determined so far (for a 16x16x16 matrix multiply using the WMMA API):

https://www.dropbox.com/s/upj4jz618jkadbo/summary.xlsx?dl=0

Sorry it's not very clear and I haven't had the time to clean it up better... each of the grid colours is one of the "STEPs" i.e. one of the accumulation register pairs. So the light grey is for STEP0. And I've written in the inputs present for Step0. I've used "X" for Matrix A and "Y" for Matrix B a bit confusingly; while "A" and "B" refer to the 1st or 2nd channel of the register, sorry that doesn't make as much sense now that I look at it again...! For example, "X1B 2 | X1B 10" means "this element of the 1st matrix is present in *both* the 2nd channel of the 2nd register of Thread 2, and in the 2nd channel of the 2nd register of Thread 10".

So... yeah... you've read that right: the same input data is present in *TWO* separate locations in the register file, i.e. it is written TWICE to the register file (and shared memory) by the load/store instructions, and read TWICE by the HMMA instructions. I have verified this to be true for both the WMMA API *and* the cuBLAS kernels via cuda gdb.

I've got a number of theories as to why (e.g. the execution units inside a "sub-core" might be split as 4 x 4-wide ALUs with their own register file in order to improve locality / reduce the wiring distance, and they wanted to keep the data inside each subset of the register file rather than swizzling across the entire SM every instruction... just one of several possibilities), but either way, it makes everything a bit complicated...

It's not 100% clear to me whether the instruction set supports "simple" 4x4x4 matrix multiplies without those "steps"; if you look just at the instruction set, it looks like there's enough register file bandwidth to support it... but actually, there's a catch: NVIDIA's Visual Profiler clearly shows stalls for HMMA instructions that don't reuse either of the input registers of the previous HMMA instruction; e.g. above, there will be a stall for the final instruction as neither R176 nor R192 is being reused from the previous instruction (it's the previous instructions that sets whether it needs to be kept in the register cache). This implies there's only enough register bandwidth to read 2 register pairs per instruction (i.e. 4x32-bit per thread, or 128x32-bit per warp) and one is already used by the accumulation input...
e.g.: https://www.dropbox.com/s/ljpgqr909u25ket/Inst-Issued-FP16-vs-FP32.png?dl=0

Perhaps it would be possible to do a full-speed 4x4x4 matrix multiply at full speed if there was no accumulation step... I'm considering whether that could be useful for raytracing where FP16 might be enough precision for parts of the Bounding Volume Hierarchy?

---

There's a bunch of other data I have which is somewhat interesting, including power consumption, but I think the biggest question is... what's NVIDIA going to do next? (in terms of tensor cores - not overall architecture or memory bandwidth, which is a huge question mark!)

My guess: change tensor cores completely to be per-SM (rather than per sub-core) and handle larger 16x16x16 matrices per instruction. That per-SM unit would read directly from shared memory instead of the register file (although it would probably still be able to read/write the latter - hard to say). That would massively reduce the amount of register file and shared memory read/writes and move the amount of data movement closer to a traditional fixed-function architecture. The downside is that they couldn't be used for smaller 4x4 matrices anymore, but the fact it seems like the HW *might* maybe support 4x4 but CUDA doesn't expose it also hints that this is the direction they're going in...

---

BTW, I got some interesting results in terms of power consumption as well. NVIDIA definitely optimises their ALUs to take very little power when e.g. multiplying by 0.0 or 1.0, which makes it possible to separate the power consumption of the ALUs vs the power consumption of everything else. It varied a lot; in the case of the tensor cores, the ALUs are typically a higher % of the total than for normal SGEMM, but it was still below 50% in my simple test... There's probably not much NVIDIA can do to significantly improve the power efficiency of the ALUs themselves, so there's a limit to what perf/watt improvement they could get on the same process, but enough of the power consumption is outside the ALUs it's theoretically possible to greatly improve perf/watt for tensors!
 
Back
Top