Originally Posted by Mintmaster
The same thing goes for shaders with texture heavy parts and math heavy parts. The overall ratio is all that matters because there are enough batches in flight to statistically even this out.
No. If you have a shader that uses 10 registers then you only have 25 batches in flight, which is 100 clocks of latency hiding - about half the number of threads required to hide memory latency. So if there's a section of the shader with 2-level dependent texturing coupled with a low ALU:TEX ratio, then that part of the shader is going to bottleneck in a way that's not represented by the shader as a whole - the cluster simply runs out of threads.