Actually, there's both instruction and thread level parallelism going on. Consider that the latency of TMU instructions can be hidden by executing ALU instructions meanwhile in both R5xx and G8x; if that's not ILP, I don't know what is! But it's a quite simplistic (->cheap) approach to ILP, and GPUs are much more dependent on TLP than CPUs.
It is true, however, that the trend is that ILP extraction mechanisms have been improving, and this is already the case in G80 AFAIK - I have yet to properly test this, but I believe based on Lindholm's patent that multiple ALU instructions of the same thread can execute simultaneously in the same ALU, and that there actually is a small instruction window in addition to the scoreboard. The main goal of ILP in GPUs is to reduce the number of simultaneous threads required, thus saving transistors by reducing the size of the register file.
If you are more familiar with CPU architectures, think of how Sun's Niagara architecture helps to hide memory latency. But instead of having 4 threads per core, you have 16 to 64 threads, and each of them works on a Vec8 instruction. The programming model allows graphics programmers to think of it as a scalar architecture with a branching coherence requirement of 32 (8x4) 'threads'.
As for the future of caching on GPUs and if there's any place for larger or more advanced caches... First, let's consider how it works right now. You've got two dedicated L1 caches per multiprocessor: one for textures and one for constants. Both are tuned for different access patterns. The texture cache at the L1 level works on uncompressed data.
At the L2 level, public information is a bit more scarce... I would tend to believe it's an unified cache for most kinds of data. It is known, however, that at this level the texture data is kept in cache compressed, thus in 4x4 64-bits or 128-bits blocks (S3TC/DXTC)... The logical reason why data isn't kept compressed in L1 is that too much decoding logic would be required.
Anyway, right now as I said the cache is mostly there to save bandwidth AFAIK, not hide latency - I'm not aware of any modern GPU architecture that relies on caching to hide latency. For example, the original 3DFX Voodoo didn't work like that iirc, even though it was the first to introduce a texture cache to the mainstream market. But that texture cache was incredibly important anyway, because it allowed them to have incredibly good bilinear filtering performance. And that was really what made it such a "killer product", from my point of view.
So, let's consider G80 again from that perspective. It has 64 filtering units, which is an incredible increase compared to NVIDIA's previous architectures, and an even bigger one compared to ATI's current architecture. There was a substantial bandwidth increase for G80, but nothing quite that dramatic; so, why did NVIDIA increase the number of TMUs so much, if you'd expect them to be bandwidth limited? Because their texture cache is saving the day.
Anisotropic filtering is what G80's TMU architecture is engineered around, and it tends to be fairly cache-coherent. The G80's texture caches are also significantly larger than those of any GPU that preceded it. These two factors combined largely explain why NVIDIA felt they could put so many TMUs/filtering units on the chip and not be massively bandwidth limited, as far as I can see.
Obviously, your focus for texture caches has been more on the latency side of things. They're also interesting there on GPUs, but much less so than on CPUs. As I explained in my previous post, all the threads (=pixels) in a warp (=batch) need to have no cache misses for a given memory access, or you're not saving anything on latency. Texturing over a batch of pixels tends to be fairly coherent, so it can happen. But you won't see it happening anywhere near 90%+ of the time, like you'd expect from a CPU.
In the future, I would expect coherence to stabilize at around 16 threads/batch. R520 was already there, but G80 and R580 are at 32 and 48 respectively for pixel shaders. Because pixels work in quads, that's only four distinct texture operations. So, I would expect the caches to increasingly so affect effective latency tolerance, but it's hard to say for sure without having hard data - which you won't have unless you work in NVIDIA or AMD's GPU divisions.
So, given your position, I would assume you're mostly interested in what kind of research would be useful for GPU texture caches, and GPU caches in general (constants, instructions, z, colours, etc.) - but I fear I can't give definitive answers to that question. Here are some of the things that have likely been implemented in the last 5 years at NVIDIA and ATI:
- Tuning the texture cache's behaviour for anistropic filtering (and trilinear filtering, obviously)
- Implementing cheap and efficient decompression of S3TC data located in the L2 cache.
- Implementing different behaviours for different kinds of caches, including constant caches.
- Achieving a good balance of locality and associativity in all major kinds of cache systems.
- And, obviously, adding caches where they didn't need any in the past (instructions, colour, etc.)
Achieving even just part of these things would already be a fairly substantial research project, I'd imagine. There obviously are other areas, though. Even excluding related areas (such as researching new algorithms to replace or complement S3TC - just ask Simon F on these very forums if you think it's easy!), I'm sure some things could be figured out.
It is important to understand, however, that anything that could likely be figured out at this point is probably evolutionary, not revolutionary. The cache system in a modern GPU is hardly naive; it delivers excellent results given the chip's overall architecture already, and at fairly minimal costs. It certainly could be improved upon, but I fail to imagine anything that'd change things drastically. Of course, that's always what people say before something revolutionary comes up - but for now, I'll have to remain skeptical!