When writing compute shaders, it’s often necessary to communicate values between threads. This is typically done via shared memory. Kepler GPUs introduced “shuffle” intrinsics, which allow threads of a warp to directly read each other's registers avoiding memory access and synchronization. Shared memory is relatively fast but instructions that operate without using memory of any kind are significantly faster still.
This article discusses those “warp shuffle” and “warp vote” intrinsics and how you can take advantage of them in your DirectX, OpenGL, and Vulkan applications in addition to CUDA. We also provide the ShuffleIntrinsicsVk sample which illustrates basic use cases of those intrinsics.