I wasn't talking about the latency caused by the number of threads, but the sheer number of cycles you're waiting for memory accesses. Even if a thread has a GPU shader unit all for its own it's going to take tens of thousands of clock cycles to complete execution. With a cache and a reasonable hit rate it migth take only a few hundred cycles. If you need to do multiple 'passes' of data processing the total latency till the results are returned to the host CPU can be unacceptable.
While that is correct in theory, I am very skeptical it's the problem in practice. Let us assume that it takes 50K cycles, and that I need 6 passes. That's 300K cycles, which corresponds to 0.5ms on a G92. If that was the only latency between the CPU and the GPU, you could do dozens of roundtrips per frame! And 300K cycles is just an insanely pessimistic number for *any* workload.
Not necessarily. I agree that prefetching a line that is never used is a waste, but first of all prefetches only happen on idle cycles.
But that's precisely my point: what if there aren't enough idle cycles because your application is very bandwidth-intensive? I have a very hard time believing your system-wide efficiency will remain good in that kind of circumstance.
I'd refer you to some of Bob's points in his posts on memory controllers in this thread. They also imply something else if you can read between the lines: by not being as sensitive to latency, you can achieve greater bandwidth efficiency for the overall system. This is another key advantage of having that many registers and threads, AFAICT.
So as TEX:ALU ratios keep decreasing and caches get bigger, current GPU architectures get less interesting.
I don't disagree with that overall rule of thumb, but I suspect you're significantly overestimating that effect's magnitude. YMMV, however.
R600 has 256 kB of shared L2 texture cache, touted by some as revolutionary...
I honestly wouldn't call that revolutionary, but once again YMMV. I'd be very curious as to whether AMD kept that texdture cache size in RV670; I wouldn't be surprised at all if that was one of the many ways they lowered the transistor count.
In the CPU world we see 1.5 GHz server CPUs beat 3 GHz Celerons for the sole reason that the working set fits in the cache.
I know that. The working set in GPUs is obviously huge compared to all that though.
The tipping point for graphics is likely around a few dozen MB. But you're going to start to see some effect from a few MB too.
I'm a big fan of embedded memory (but not SRAM due to the low density) myself, but I don't think it'll ever help as much for texturing as for the framebuffer.
Even if you can store just a repetitive detail texture in the cache it's going to save a lot of bandwidth. And with some smart task decomposition and scheduling you can maximize reuse.
Indeed, that can't hurt. I'd point out that textures are only a major bandwidth problem when they're uncompressed though, and I would suspect that detail textures would tend to be DXTC or 3DC. I'd be much more interested in caching large LUTs myself (there aren't many of those in current games, although STALKER has one for lighting).
That's indeed impressive. I wonder how much data that is between a round-trip (i.e. the register space you need for hiding the latency of one access, assuming full bandwidth utilization). I just think there's a point where you want to start keeping things on-chip to lower average lateny and save register space.
See my above points (including memory controller efficiency) as to why trying to 'save register space' might not be the best strategy.
Anyway, as I said above, I'm a big fan of both eDRAM and TBDR architectures (and even combinations of both). Yes, it's more expensive, but it still makes sense economically: your chip costs more, but your RAM costs less. So you increase your ASPs for a given segment of the market to the detriment of Samsung's.
However, I am not convinced even using a huge eDRAM-based L3 cache for texturing could deliver a significant performance boost, or would allow current architectures to get away with fewer registers. That's just my informed opinion though, I don't have raw data at my disposal obviously.