View Full Version : Warp/wavefront SIMT in DirectCompute?
Since I got my DX11 card, I started to do experiments with DirectCompute. One thing I still haven't figured out properly is the mapping of CUDA warp (and AMD wavefront) to compute shader semantics.
In a single CUDA warp it's guaranteed that all instructions are executed in SIMT manner (single instruction multiple threads). This means that all instructions are implicitly data/code barriers inside the warp. All threads inside the same warp are executing the same instruction at the same time. No thread can advance to the next instruction until all other threads inside warp have executed the current instruction. This property helps a lot with thread synchronization inside a warp.
I haven't noticed DirectCompute documentation mentioning anything about SIMT or any instructions being executed in lockstep. Is adding GroupMemoryBarrierWithGroupSync() after every instruction the only way to guarantee SIMT-style execution of threads in DirectCompute? This seems really inefficient as I would only want to synchronize the execution inside a single warp/wavefront, not inside the whole group/block.
rpg.314
02-Dec-2009, 13:31
In a single CUDA warp it's guaranteed that all instructions are executed in SIMT manner (single instruction multiple threads). This means that all instructions are implicitly data/code barriers inside the warp. All threads inside the same warp are executing the same instruction at the same time. No thread can advance to the next instruction until all other threads inside warp have executed the current instruction. This property helps a lot with thread synchronization inside a warp.
This is incorrect. The GroupMemoryBarrerWithGroupSync() you speak of is just the dxcs 's version of __syncthreads(). Everything else is just like it's in cuda.
This is incorrect. The GroupMemoryBarrerWithGroupSync() you speak of is just the dxcs 's version of __syncthreads(). Everything else is just like it's in cuda.
Yes I know that both of those synchronization primitives are the same (a barrier). However in CUDA I don't need to add a manual barrier in my code if all the threads are in a same warp as it's guaranteed that all threads inside a single warp are executing the same instruction (proceed in lockstep). Is there any way to instruct compute shader to do the same thing?
rpg.314
02-Dec-2009, 14:32
Ok. long answer then.
In a single CUDA warp it's guaranteed that all instructions are executed in SIMT manner (single instruction multiple threads).
This is kinda vague/wrong. First of all, warp is a concept which you worry about for efficiency and not for correctness. And secondly, when a warp diverges, this "property" you speak of does not hold at all.
This means that all instructions are implicitly data/code barriers inside the warp. All threads inside the same warp are executing the same instruction at the same time.
Again, does not hold for not for divergent warps, so it doesn't hold in general, ergo, forget this line of logic.
No thread can advance to the next instruction until all other threads inside warp have executed the current instruction. This property helps a lot with thread synchronization inside a warp.
Again, think of divergent warps. As for intra-warp synchronization, I am really not sure it helps with anything.
I haven't noticed DirectCompute documentation mentioning anything about SIMT or any instructions being executed in lockstep.
It is not mentioned to avoid hw/ihv specific hacks at the API level, otherwise the execution is still just like it is in cuda for nv gpu's and like brook+ for amd gpu's
Is adding GroupMemoryBarrierWithGroupSync() after every instruction the only way to guarantee SIMT-style execution of threads in DirectCompute? This seems really inefficient as I would only want to synchronize the execution inside a single warp/wavefront, not inside the whole group/block.
Since there is no concept of warp/wavefront in dxcs, so what you are asking for cannot be done in the standard API atleast.
You could try running the group size as the warp size to do what you are looking for. :roll:
NVIDIA OpenCL programming guide states (chapter 3.4: Warp-Level Synchronization):
"Because a warp executes one common instruction at a time, threads within a warp are implicitly synchronized and this can be used to omit calls to the barrier() function for better performance."
Also if you read the following paper of efficient CUDA implementation of segmented scan, you'll notice that they use warp-level synchronization a lot:
http://mgarland.org/files/papers/nvr-2008-003.pdf
From the article - Chapter 3.1: Intra-Warp Scan Algorithm:
"We begin by dening a routine to perform a scan over a warp of 32 threads, shown in Figure 3. It uses precisely the same algorithm as shown in Figure 2, but with a few basic optimizations. First, we take advantage of the synchronous execution of threads in a warp to eliminate the need for barriers."
This algorithm they describe is a part of CUDA Data Parallel Primitives (CUDPP) library. That library also has many other highly optimized algorithms using barrierless warp-level synchronization.
I just wanted to know if there is a way to do barrierless warp level synchronization in compute shaders as well. CUDA and OpenCL both support it (on NVIDIA hardware at least). Naturally you have to know the warp size of the hardware to exploit the warp level synchronization. Maybe I just need to test this with a simple DX11 shader. If I remember correctly HD5850 warp/wavefront size was 64 threads...
rpg.314
02-Dec-2009, 14:46
NVIDIA OpenCL programming guide states (chapter 3.4: Warp-Level Synchronization):
"Because a warp executes one common instruction at a time, threads within a warp are implicitly synchronized and this can be used to omit calls to the barrier() function for better performance."
For divergent warps, there are no common instructions.
Intra-warp sync is performed in hw for you, you don't have to do anything funny to get it done.
As long as the threadID/workgroupID match up in the cuda/ocl/dxcs world, the result would be same. The underlying hw is the samething after all.
If I remember correctly HD5850 warp/wavefront size was 64 threads...
Yes.
Also if you read the following paper of efficient CUDA implementation of segmented scan, you'll notice that they use warp-level synchronization a lot:
http://mgarland.org/files/papers/nvr-2008-003.pdf
Ok. it may be useful, but do they need any fancy sw tricks to do it? They don't.
Ok. it may be useful, but do they need any fancy sw tricks to do it? They don't.
It's really useful when you are squeezing the last bits of performance out of your CUDA application. And yes they do need to manually make sure to split the work correctly to the warps (32 threads to do the same intra scan), so that the (barrierless) warp-level synchronization holds.
It's an optimization technique officially supported by NVIDIA and used in their CUDA and OpenCL libraries. All I wanted to know if this technique is available in DirectCompute compute shaders as well. I know the basics (warp/wavefront is completely invisible to the programmer when performance is not concerned). No need to get so offensive there :)
rpg.314
02-Dec-2009, 15:51
It's an optimization technique officially supported by NVIDIA and used in their CUDA and OpenCL libraries. All I wanted to know if this technique is available in DirectCompute compute shaders as well.
Warp is a micro-architectural feature of nv gpu's, and hence exposed only in cuda and only in nv's ocl drivers. You won't find warp metioned in amd's gpu's or larrabee documentation. Though there will likely be vendor specific extensions to ocl for that. MS has taken the stand of forbidding extensions to dx spec, so you won't see this concept exposed in dx, unless inte;/amd/nv can standardize on "warp size" or it's equivalents.
I know the basics (warp/wavefront is completely invisible to the programmer when performance is not concerned). No need to get so offensive there :)
I am sorry if I got/appeared offensive there.
Warp is a micro-architectural feature of nv gpu's, and hence exposed only in cuda and only in nv's ocl drivers. You won't find warp metioned in amd's gpu's or larrabee documentation. Though there will likely be vendor specific extensions to ocl for that. MS has taken the stand of forbidding extensions to dx spec, so you won't see this concept exposed in dx, unless inte;/amd/nv can standardize on "warp size" or it's equivalents.
Yes, AMD uses the term wavefront instead of warp. According to AMD Stream Computing documentation (http://developer.amd.com/gpu_assets/Stream_Computing_Overview.pdf):
"Wavefronts are hardware threads that execute N number of threads in parallel, where N is specific to the hardware chip (for example, on the ATI Radeon HD4870 it is 64). A wavefront processes a single instruction over all of the threads at the same time."
Wavefront synchronization works in a same way as warp synchronization. But different ATI hardware has different wavefront sizes (16, 32 and 64). So creating a code that works correctly on all the ATI chips is pretty difficult (unless you only assume synchronization inside chunks of 16 threads -- luckily this configuration works on NVIDIA chips also -- but sadly will most likely break when Intel gets it's compute shader capable chips ready). So using wavefront/warp synchronization inside OpenCL/DirectCompute is not a good programming choice, if the application needs to work on various computer hardware configuration.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.