OpenGL guy
Veteran
Demonstrably false. DirectCompute requires support for work group sizes of at least 1024 threads, for example.I believe GCN won't accept more than a 256 work group size. I suspect larger is illusory.
This has been recommended on GPU for ages. See the huge progress made with LuxRenderer for another example.Jawed said:A few months back I wrote about apocryphal VGPR allocations in code I'm working on. Since then I've cleaved my kernel in two, with the rationale that I can't avoid storing data off-die for possible run times of 1+ seconds, since there's no way I can construct a pure on-die pipeline (which would also use multiple kernels).
Running two kernels compartmentalises VGPR and shared memory allocation. This enables me to re-code the two halves without fighting their joint VGPR+LDS allocation, which ultimately leads to more performance.
The two halves are strongly asymmetric in their use of VGPRs and LDS. The first half has a giant cache in LDS and moderate VGPR allocation, the second uses a small amount of LDS for 8-way work item sharing with a huge VGPR allocation including a large array in VGPRs. Luckily it has very little latency to hide (and it has huge arithmetic intensity), so 3 hardware threads isn't really a problem. The first kernel is LDS limited to 5 hardware threads. An improvement from 3 hardware threads for the uber kernel, which is where the performance gain came from as far as I can tell.
And now I have the freedom to re-work the first kernel now that it isn't bound 1:1 by the logic and in-register capacity constraints of the second kernel, e.g. I can run half the instances of kernel 1 for each instance of kernel 2, reaping more performance from the huge cache.
It helps that global memory access latency incurred by writing and reading data across the split is hidden, with only about 14GB/s usage. I now have the opportunity to re-code each half, which means I'll get substantially more performance than the original "uber kernel" (rather than the 3% gain I got doing the split).
Overall, it seems to me there's a lot of mileage in atomising kernels.