Ext3h
Regular
They are. I was initially under the same assumption, but then an Nvidia employee told me that within https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities the capability "Maximum number of resident grids per device(Concurrent KernelExecution)" also applies to the number of different kernels that can be dispatched to a single SM (within the limits of "Maximum number of resident blocks per SM"). It's just that any program that depends on concurrent kernel execution for correctness is by definition malformed as the dispatch order between different kernel launches is completely undefined, and may simply not happen for any undisclosed reason (mostly: I don't believe kernels spanning different command buffers ever end up overlapping...).The way I read this it seams that while the workloads are executed concurrently, they are still not dispatched concurrently (unlike on some other architectures).
How likely that is? That's another can of worms, and the most probable answer would be: Once in a blue moon. Because you'd need to get lucky and end up with blocks from two different (but not conflicting) kernels on the same SM partition in the first place. Which is neither likely, nor preferred by the coarse scheduler, because it causes all sorts of issues with instruction caches, L1 hit rate etc.
You are mixing up workloads here. Neural textures e.g. that are embedded in the fragment shader (or far more likely in an uber shader for deferred shading) have a good chance to dispatch instructions to multiple core types concurrently. Especially so if those even happen to be interleaved in the same execution flow without barriers in between. Only by chance if a stall is required to get two warps so far out of sync that it can occur naturally in case instructions are too far batched by type within a single kernel. And pretty much never if the different types of workload end up belonging to different kernels...NVIDIA definitely runs tensor and fp32 ops concurrently, especially now with their tensor cores busy almost 100% of the time (doing upscaling, frame generation, denoising, HDR post processing, and in the future neural rendering).
So none of upscaling, frame generation nor HDR postprocessing are likely to run concurrently. And for the whole neural network related stuff, the only thing that's realistically running in parallel are the activation functions together with the tensors.
Last edited: