When you only have ~1M pixels to texture (the 1280x720 framebuffer, there's no overdraw) and your data source provides a precise list of how to texture every single pixel with only a single pass (though texture filtering itself implies dynamic loops), it seems to me that texturing is fairly easy to do very efficiently.
You don't have to build a software cache in LS, you simply need to create a texel-fetch queue with enough lead-time that the entire final phase of rendering is pipelined with a start-up lag of ~5000 cycles, say (at 3.2GHz). Obviously, I'm just guessing and glossing over the fact that texel coordinates need to be calculated before texels are fetched.
I'm sure Mintmaster could run the numbers for us, but 16x anistropy for 2 blended textures at 60fps, say, per pixel (1M total) would require xGB/s of bandwidth. Erm... I dunno, I'm useless at working that stuff out. Fingers-crossed someone will work it out.
Anyway, the lack of overdraw is one of the key things that always helps with deferred lighting, and it'll definitely cut texturing bandwidth. The latency problem can be attacked simply by constructing a long shader execution pipeline. If the shader is arithmetic intensive then there's a reasonable chance that you've got some free latency hiding right there. After that, the predictable texel fetching should mean that the shader execution pipeline can proceed without stalling.
Put simply: it's a streaming problem. Well, that's my guess, anyway.
Jawed