Does volta no longer need masking for threads in a warp ?

https://www.anandtech.com/show/12170/nvidia-titan-v-preview-titanomachy/2
https://www.anandtech.com/show/12170/nvidia-titan-v-preview-titanomachy/2


Hi all, i am curious about how this works. I posted this very same question in the anandtech forum but there are no replies.
Does all the text below mean for volta that while all the 32 threads run in lockstep and some have different IF ELSE results to execute that no longer masking is needed ?

Finally, and admittedly getting into the even more esoteric aspects of GPU design, NVIDIA has reworked how SIMT works for Volta. The individual CUDA cores within a 32-thread warp now have a limited degree of autonomy; threads can now be synchronized at a fine-grain level, and while the SIMT paradigm is still alive and well, it means greater overall efficiency. Importantly, individual threads can now yield, and then be rescheduled together. This also means that a limited amount of scheduling hardware is back in NV’s GPUs.

voltasimt_575px.png


https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
A downside of SIMT execution is the fact that thread-specific control-flow is performed using "masking", leading to poor utilization where a processor's threads follow different control-flow paths. For instance, to handle an IF-ELSE block where various threads of a processor execute different paths, all threads must actually process both paths (as all threads of a processor always execute in lock-step), but masking is used to disable and enable the various threads as appropriate. Masking is avoided when control flow is coherent for the threads of a processor, i.e. they all follow the same path of execution. The masking strategy is what distinguishes SIMT from ordinary SIMD, and has the benefit of inexpensive synchronization between the threads of a processor
 
The hardware is still SIMD. The difference with Volta is that the scheduler may decide to execute instructions from the IF or ELSE side in a mixed fashion, rather than having to execute all the way down one path first before going back to the other. They would be masked as appropriate when executing.
 
Thank you for replying. :)
I am confused now a bit, because how does the scheduler do that ?
It is all simd / simt, so i get the impression that the scheduler can do this when the same instruction is present in either the IF block or ELSE block.
Then it would be possible to mix instructions that are the same and as many threads as possible in the warp can be grouped to run.
Or am i seeing this all wrong ?
 
I am confused now a bit, because how does the scheduler do that ?
A portion of the work is already done in standard SIMT branching. When lanes diverge at a conditional check, the hardware determines the valid mask for the path it chooses to execute immediately and it stores the mask and instruction pointer for the other path it will get back to.
Prior to Volta, the stored information for the other path was left unused, but the way it can be used is the same as it is for the active path: fetch the instruction and apply the mask.

It's simpler to only worry about one IP and mask set at a time, but in some ways it's an easier task for the scheduler. In a divergent case, it's known that there is no interaction or dependence between the paths since from their perspective the other path is masked off.

It is all simd / simt, so i get the impression that the scheduler can do this when the same instruction is present in either the IF block or ELSE block.
The instruction that reaches the execution stage is different every clock, even without branching. If the hardware already handles a different instruction every clock, there's not much difference if that different instruction happens to come from the IF or ELSE path.

Then it would be possible to mix instructions that are the same and as many threads as possible in the warp can be grouped to run.
Or am i seeing this all wrong ?
The scheduler issues one instruction at a time. Per Nvidia's additional diagrams and information, the paths are treated as being separate until the different paths rejoin or execution hits explicit sync points in the code.
 
Back
Top