Each wave (64 threads) has it's own set of scalar registers (and vector registers). Each wave loads the scalars (including resources descriptors) from the memory separately (usually at the beginning of the shader code).Are these registers not wiped/unreliable between shader invocations? i.e. that would only be useful for multiple sampling of the same texture in one shader?
The registers of each wave are wiped out after the wave is executed. GPU can be executing multiple waves from different shaders at the same time. The resource descriptors (just like any other data) are cached by the GPU L1 and L2 caches. If the resource descriptor is in the CU's scalar L1 cache when a wave tries to load it to a scalar register (no matter if the wave is from the save draw call than the previous one, or from another) it will be loaded very quickly to the register. If that resource descriptor wasn't recently loaded by any wave of the same CU, it will likely be found in the GPU L2 cache (or from the memory).
So basically it doesn't matter how many draw calls you have (or if the draw calls are big or small). If the next draw call uses the same resource descriptors, the descriptors will be likely in the L1 caches (of each CU), and the waves of the next draw call will load it from there without any stalls (just like the additional waves of the current draw call are loading them).
Constants also use the scalar loads / L1 scalar cache in the same way. If you use the same constants in the next draw call, the scalar loads (of each wave) will hit the L1 and get the constants quickly to registers.
This kind of resource handling is very efficient even if the average batch size is small. GPU doesn't need to setup lots of state before it can start executing a draw call.
Last edited by a moderator: