Is the other reason thread synchronization?
I only read the abstract, but I don't see how this patent states anything contradicting what I've said. Register files are large while cache and FIFOs can use custom memories that are smaller. It makes sense to use a combination of techniques to hide latency.You certainly need a FIFO for some things, but the G80 for example (and I would assume recent AMD architectures, but I don't know for sure) use its register files creatively to reduce its necessary size. Search for "Register based queuing for texture requests" here: http://appft1.uspto.gov/netahtml/PTO/search-bool.html
And here's the abstract, since nearly everything you need to know is there:
No. Pascal had the correct answer.Is the other reason thread synchronization?
I thought there was some obscure reason you were thinking of because bandwidth had already been mentioned as a reason.No. Pascal had the correct answer.
Sorry if I was unclear - I didn't mean to imply it was contradicting what you said! I simply wanted to add some extra colour to it with an extra real-world implementation example that highlights this combination of techniques.I only read the abstract, but I don't see how this patent states anything contradicting what I've said. Register files are large while cache and FIFOs can use custom memories that are smaller. It makes sense to use a combination of techniques to hide latency.
My apologies, I missed Arun's post. Although I was thinking that there are two ways it helps.I thought there was some obscure reason you were thinking of because bandwidth had already been mentioned as a reason.
Caching in GPUs tends to be trade secret. See Victor Moya's investigations:
http://personals.ac.upc.edu/vmoya/log.html
starting at 23/9/04 and older (actually, scroll to the very bottom!). He posts here as RoOoBo.
Here's some patent applications that I've been looking at recently:
Two level cache memory architecture
3-D rendering texture caching scheme
It's pretty hard to be sure how advanced the caching schemes in current GPUs are. One thing it's definitely worth remembering is that GPUs use very small amounts of cache memory - they dedicate far more memory to per-thread state (i.e. the registers and, in older GPUs, latency-hiding FIFOs within the shading pipeline). L2 is typically measured in low KB and it seems the newest GPUs distribute L2, rather than using a monolithic blob.
As for the future of caching on GPUs and if there's any place for larger or more advanced caches... First, let's consider how it works right now. You've got two dedicated L1 caches per multiprocessor: one for textures and one for constants. Both are tuned for different access patterns. The texture cache at the L1 level works on uncompressed data.
At the L2 level, public information is a bit more scarce... I would tend to believe it's an unified cache for most kinds of data. It is known, however, that at this level the texture data is kept in cache compressed, thus in 4x4 64-bits or 128-bits blocks (S3TC/DXTC)... The logical reason why data isn't kept compressed in L1 is that too much decoding logic would be required.
Anyway, right now as I said the cache is mostly there to save bandwidth AFAIK, not hide latency - I'm not aware of any modern GPU architecture that relies on caching to hide latency. For example, the original 3DFX Voodoo didn't work like that iirc, even though it was the first to introduce a texture cache to the mainstream market. But that texture cache was incredibly important anyway, because it allowed them to have incredibly good bilinear filtering performance. And that was really what made it such a "killer product", from my point of view.
So, let's consider G80 again from that perspective. It has 64 filtering units, which is an incredible increase compared to NVIDIA's previous architectures, and an even bigger one compared to ATI's current architecture. There was a substantial bandwidth increase for G80, but nothing quite that dramatic; so, why did NVIDIA increase the number of TMUs so much, if you'd expect them to be bandwidth limited? Because their texture cache is saving the day.
Anisotropic filtering is what G80's TMU architecture is engineered around, and it tends to be fairly cache-coherent. The G80's texture caches are also significantly larger than those of any GPU that preceded it. These two factors combined largely explain why NVIDIA felt they could put so many TMUs/filtering units on the chip and not be massively bandwidth limited, as far as I can see.
Obviously, your focus for texture caches has been more on the latency side of things. They're also interesting there on GPUs, but much less so than on CPUs. As I explained in my previous post, all the threads (=pixels) in a warp (=batch) need to have no cache misses for a given memory access, or you're not saving anything on latency. Texturing over a batch of pixels tends to be fairly coherent, so it can happen. But you won't see it happening anywhere near 90%+ of the time, like you'd expect from a CPU.
In the future, I would expect coherence to stabilize at around 16 threads/batch. R520 was already there, but G80 and R580 are at 32 and 48 respectively for pixel shaders. Because pixels work in quads, that's only four distinct texture operations. So, I would expect the caches to increasingly so affect effective latency tolerance, but it's hard to say for sure without having hard data - which you won't have unless you work in NVIDIA or AMD's GPU divisions.
So, given your position, I would assume you're mostly interested in what kind of research would be useful for GPU texture caches, and GPU caches in general (constants, instructions, z, colours, etc.) - but I fear I can't give definitive answers to that question. Here are some of the things that have likely been implemented in the last 5 years at NVIDIA and ATI:
- Tuning the texture cache's behaviour for anistropic filtering (and trilinear filtering, obviously)
- Implementing cheap and efficient decompression of S3TC data located in the L2 cache.
- Implementing different behaviours for different kinds of caches, including constant caches.
- Achieving a good balance of locality and associativity in all major kinds of cache systems.
- And, obviously, adding caches where they didn't need any in the past (instructions, colour, etc.)
Achieving even just part of these things would already be a fairly substantial research project, I'd imagine. There obviously are other areas, though. Even excluding related areas (such as researching new algorithms to replace or complement S3TC - just ask Simon F on these very forums if you think it's easy!), I'm sure some things could be figured out.
It is important to understand, however, that anything that could likely be figured out at this point is probably evolutionary, not revolutionary. The cache system in a modern GPU is hardly naive; it delivers excellent results given the chip's overall architecture already, and at fairly minimal costs. It certainly could be improved upon, but I fail to imagine anything that'd change things drastically. Of course, that's always what people say before something revolutionary comes up - but for now, I'll have to remain skeptical!
I don't really see why you'd need a cache in your example to hide latency. When the fetch latency increases, you just have to increase the FIFO accordingly, but the presence of a cache isn't a requirement. Unless your external bandwidth is much lower than the one provided by the cache, but that's a different story.
I'm not sure I'm remembering this right, but I think the G80 has <=128KiB L2 and <= 8x8KiB L1. As I said, I can't remember the exact numbers, and this might just be for textures.Are you sure the L2 cache size is in KBs? Both the following papers suggest its a good idea for the L1 to be in KBs, and second architecture uses an L2.
Think of it this way: in order to hide more effective latency on a GPU, you need more threads. That means a larger register file. So, the question is, how big is the register file on the latest GPUs? It's probably not a very large percentage of the chip. Thus, what needs to be achieved is an ideal balance between minimizing effective latency and hiding it. I would assume current implementations to already be very good at that, but I don't have the insider knowledge necessary to make sure of that.1. Is latency really of no (little) importance? I mean, design that helps reduce its effective value must certainly help..
Yes, it can safely be assumed that L1 is uncompressed and L2 is compressed. I'm pretty sure NVIDIA has publicly confirmed this, and I would assume it to also be the case for AMD...2. Can it be assumed that texture data in L2 is in most cases S3TC compressed?
Reducing effective or average latency helps, and as I said above, having a balanced architecture is the goal. Adding a multi-MiBs wouldn't make sense though, considering the diminishing returns it would deliver.I find it hard to understand.. I'm sorry if my questions are persistent and naive, but I really want to understand the difference between "hiding latency" and "reducing latency". Agreed that the former is more important and efficient parallelism is the way to go for it, but does the latter hold any significance in design?
For DX9 GPUs and older, yes. L1 is in the range of hundreds of bytes, at most.Are you sure the L2 cache size is in KBs?
Yep. An unfortunate side effect of striving for higher and higher amounts of bandwidth. It's a good thing the typical GPU workload doesn't fetch random pixels from all over the screen.The second is that the external bus read granularity is so large these days that you have to read dozens of texels just to access the few you might need for a particular pixel.
I find it hard to understand.. I'm sorry if my questions are persistent and naive, but I really want to understand the difference between "hiding latency" and "reducing latency". Agreed that the former is more important and efficient parallelism is the way to go for it, but does the latter hold any significance in design?
There's also the "average" when G80 is scaled down to make a G84 or G86 and when low-bandwidth DDR2 is used.My guess that the number from NVidia must be some average latency with average or heavy traffic.