NVIDIA definitely runs tensor and fp32 ops concurrently, especially now with their tensor cores busy almost 100% of the time (doing upscaling, frame generation, denoising, HDR post processing, and in the future neural rendering).
Latest NVIDIA generations have become exceedingly better at mixing all 3 workloads (tensor+ray+fp32) concurrently, I read somewhere (I can't find the source now) that ray tracing + tensor are the most common concurrent ops, followed by ray tracing + fp32/tensor + fp32.
![]()
Concurrent execution of CUDA and Tensor cores
Yes, that is what it means. I don’t know where you got that. If the compiler did not schedule tensor core instructions along with other instructions, what else would it be doing? NOP? Empty space? Maybe you are mixing up what the compiler does and what the warp scheduler does. The warp...forums.developer.nvidia.com
![]()
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada?
There isn’t much difference between Turing, Ampere and Ada in this area. This question in various forms comes up from time to time, here is a recent thread. It’s also necessary to have a basic understanding of how instructions are issued and how work is scheduled in CUDA GPUs, unit 3 of this...forums.developer.nvidia.com
The way I read this it seams that while the workloads are executed concurrently, they are still not dispatched concurrently (unlike on some other architectures). So some pipes will be underutilized.