If you have L2 misses and have to go out to memory, then the DRAM controller, very likely optimized to attain a high bandwidth at the expense of some latency, will incur a heavy cost, before it even goes to the DRAM itself. This penalty is likely to be lower for SRAM, as you don't need to care that much about the optimal sequence of opening and closing banks to get a high utilization. It won't remove the latency of the memory hierarchy of the GPU itself, but it will definitely be faster (but I can't put a number on it by how much).