I don't understand what you're saying but I'm pretty sure that divergence will be an issue no matter the workload. So smaller wavefronts = better.
1024 work items running on hardware with a hardware thread size of 32 means 32 hardware threads are required to support that workgroup.
In the limit: if a single ray follows a very long path e.g. traversing more nodes of the BVH than all the other rays in the workgroup, then that ray "drags along" 31 others, assuming there's 1 ray per work item. So, the entire hardware thread runs at the speed of the slowest ray. The 31 other
hardware threads can run other code (e.g. hit shader), provided that there's an uber shader wrapping traversal and hit/miss.
In reality, of course, there'll be averaging. Some hardware threads might have a very narrow range of ALU execution cycles for BVH traversal, e.g. highly coherent rays or rays that travel a short distance or rays that terminate upon their first bounce. Other hardware threads will see a large range of ALU execution cycles for BVH traversal, meaning that those hardware threads will see many ALU cycles wasted to "finish off" the last one or two rays.
It appears AMD took the decision to use uber ray-trace shaders to combine traversal with miss/hit shaders, and then combined rays into workgroups that are larger than the hardware thread size.
Yes, smaller hardware threads are always better in terms of ALU utilisation. Intel with 16 work item hardware threads should be rewarded with less wastage. The trade-off is that there'll be more hardware:
- scheduling for each SIMD
- register file mechanics (banking, porting control)
- data paths (to and from the rest of the GPU
Moving (possibly a lot of..) compute state around to extract coherency can be very expensive. It doesn't automatically give you a win and it can only help with execution divergence, not data divergence.
Cache is for data divergence.
Execution divergence can be mitigated by keeping work on the same SIMD, but moving it to a different hardware thread. If you do that right, then pretty much all the work is in the register file and operand collector.
But yes, I'm not going to pretend that this is easy. Nested control flow wrecks performance pretty much regardless (though BVH traversal is a kind of uniform nested loop). I'm still sceptical about it in the end.
Intel looks like it's mitigating the problem with a hardware thread size of 16.