Spinoff - NVIDIA dispatch concurrency

Ext3h

Regular
The way I read this it seams that while the workloads are executed concurrently, they are still not dispatched concurrently (unlike on some other architectures).
They are. I was initially under the same assumption, but then an Nvidia employee told me that within https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities the capability "Maximum number of resident grids per device(Concurrent KernelExecution)" also applies to the number of different kernels that can be dispatched to a single SM (within the limits of "Maximum number of resident blocks per SM"). It's just that any program that depends on concurrent kernel execution for correctness is by definition malformed as the dispatch order between different kernel launches is completely undefined, and may simply not happen for any undisclosed reason (mostly: I don't believe kernels spanning different command buffers ever end up overlapping...).

How likely that is? That's another can of worms, and the most probable answer would be: Once in a blue moon. Because you'd need to get lucky and end up with blocks from two different (but not conflicting) kernels on the same SM partition in the first place. Which is neither likely, nor preferred by the coarse scheduler, because it causes all sorts of issues with instruction caches, L1 hit rate etc.

NVIDIA definitely runs tensor and fp32 ops concurrently, especially now with their tensor cores busy almost 100% of the time (doing upscaling, frame generation, denoising, HDR post processing, and in the future neural rendering).
You are mixing up workloads here. Neural textures e.g. that are embedded in the fragment shader (or far more likely in an uber shader for deferred shading) have a good chance to dispatch instructions to multiple core types concurrently. Especially so if those even happen to be interleaved in the same execution flow without barriers in between. Only by chance if a stall is required to get two warps so far out of sync that it can occur naturally in case instructions are too far batched by type within a single kernel. And pretty much never if the different types of workload end up belonging to different kernels...

So none of upscaling, frame generation nor HDR postprocessing are likely to run concurrently. And for the whole neural network related stuff, the only thing that's realistically running in parallel are the activation functions together with the tensors.
 
Last edited:
They are. I was initially under the same assumption, but then an Nvidia employee told me that within https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities the capability "Maximum number of resident grids per device(Concurrent KernelExecution)" also applies to the number of different kernels that can be dispatched to a single SM (within the limits of "Maximum number of resident blocks per SM"). It's just that any program that depends on concurrent kernel execution for correctness is by definition malformed as the dispatch order between different kernel launches is completely undefined, and may simply not happen for any undisclosed reason (mostly: I don't believe kernels spanning different command buffers ever end up overlapping...).

How likely that is? That's another can of worms, and the most probable answer would be: Once in a blue moon. Because you'd need to get lucky and end up with blocks from two different (but not conflicting) kernels on the same SM partition in the first place. Which is neither likely, nor preferred by the coarse scheduler, because it causes all sorts of issues with instruction caches, L1 hit rate etc.

Isn’t that a different topic? I was referring to the ability of a single scheduler (SM partition) to dispatch multiple instructions per clock. As far as I understand, Blackwell - just like previous architectures only does a single one. This appears to be corroborated by what Nvidia employees say. We have some other architectures (Apple, possibly Intel) that can issue two instructions of different types per partition. Which means that in a mixed workload instructions can often be issued as quickly as the backend can process them. I am not aware of any claims that Nvidia can do the same.

Regarding different kernels running concurrently on the same SM, which your post seems to be about — that certainly happens ALL the time. Interleaving instructions from as many programs as possible is the primary way to hide latency and achieve good utilization.
 
The way I read this it seams that while the workloads are executed concurrently, they are still not dispatched concurrently (unlike on some other architectures). So some pipes will be underutilized.
Yes, Nvidia's Warp Schedulers (SMSP) dispatch one instruction per cycle. And given the throughput of all the available pipelines it means they can never be fully utilised at the same time. There isn't the register bandwidth to achieve that, either. But (depending on the generation) it's usually only the float pipeline which can execute 1 instruction/cycle. If a HMMA.16816.F16 (f16 tensor core MMA with f16 accumulator) takes 16 cycles, you might be able to issue 13 or 14 FMAs in that time. Not full utilisation, but quite close.
 
Yes, Nvidia's Warp Schedulers (SMSP) dispatch one instruction per cycle. And given the throughput of all the available pipelines it means they can never be fully utilised at the same time. There isn't the register bandwidth to achieve that, either. But (depending on the generation) it's usually only the float pipeline which can execute 1 instruction/cycle. If a HMMA.16816.F16 (f16 tensor core MMA with f16 accumulator) takes 16 cycles, you might be able to issue 13 or 14 FMAs in that time. Not full utilisation, but quite close.

Are you saying that the tensor cores are not pipelined? It seems to me that you’d need to launch much more than one HMMA instruction per 16 cycles to achieve the declared throughput.
 
Why do you think that? Each TC instruction does a lot of work. HMMA.16816.F16 performs 16*8*16=2048 multiply-accumulate operations, or 4096 flops.

Take GB202, it has a boost clock of 2407MHz, 680 Tensor Cores (4 per SM, one per Warp Scheduler/SMSP), and 419 peak dense FP16 Tensor Tflops with FP16 accumulate.
419T/2407M/680=~256, so each TC can do 256 F16 flops/cycle. Thus HMMA.16816.F16 has a throughput of 1 instruction per 16 cycles.

I'm pretty sure it's pipelined, in the sense that fetching the input registers, doing the matmul, and writing the accumulator registers each happen in separate phases that can overlap. But you still can't (and don't need to) issue more than one per 16 cycles (on average).
 
Thank you! I am trying to think how this works in practice and what are the potential consequences for executing other instructions. From what I understand, the register file is not wide enough to fetch the entire matrix at once. One likely needs 4 reads. So that is additional slots where other instructions cannot be executed, right? By pipelined what I meant specifically is the ability to dispatch multiple such instructions back to back. We know that this is possible for FP instructions for example. You can launch the next FMA while the previous one is still executing. From what you are saying this doesn’t appear to be the case with the HMMA instructions. I wonder whether it is implemented as multiple smaller partially concurrent instructions in hardware, or if it is something else.
 
HMMA.16816.F16 needs to read 4 registers for A, 2 registers for B, 2 registers for C, and writes 2 registers for D. So it reads a total of 8 registers and writes 2.

Registers are distributed over banks (Maxwell had 4 banks, not sure if that is still the current architecture), with consecutive registers in different banks. If each bank allows one read and one write per cycle, then collecting the source operands might take two or three cycles (A is guaranteed to be spread over all banks, but B and C could conflict). Writing the outputs might be one cycle, which can be overlapped with reads for a different instruction.

HMMA is almost certainly pipelined in the sense that you can have one instruction start to fetch operands while another is doing the matmul while another is writing its outputs. But it still takes 16 cycles in the matmul part, so there is no point in issuing more instructions.
 
Back
Top