GPU Memory Latency

You certainly need a FIFO for some things, but the G80 for example (and I would assume recent AMD architectures, but I don't know for sure) use its register files creatively to reduce its necessary size. Search for "Register based queuing for texture requests" here: http://appft1.uspto.gov/netahtml/PTO/search-bool.html

And here's the abstract, since nearly everything you need to know is there:
I only read the abstract, but I don't see how this patent states anything contradicting what I've said. Register files are large while cache and FIFOs can use custom memories that are smaller. It makes sense to use a combination of techniques to hide latency.
 
I only read the abstract, but I don't see how this patent states anything contradicting what I've said. Register files are large while cache and FIFOs can use custom memories that are smaller. It makes sense to use a combination of techniques to hide latency.
Sorry if I was unclear - I didn't mean to imply it was contradicting what you said! :) I simply wanted to add some extra colour to it with an extra real-world implementation example that highlights this combination of techniques.
 
I thought there was some obscure reason you were thinking of because bandwidth had already been mentioned as a reason.
My apologies, I missed Arun's post. Although I was thinking that there are two ways it helps.

The first is because texels (especially when performing filtering on neighbouring pixels) are frequently re-used (unless you turned MIP mapping off :devilish: ). The second is that the external bus read granularity is so large these days that you have to read dozens of texels just to access the few you might need for a particular pixel.
 
Caching in GPUs tends to be trade secret. See Victor Moya's investigations:

http://personals.ac.upc.edu/vmoya/log.html

starting at 23/9/04 and older (actually, scroll to the very bottom!). He posts here as RoOoBo.

Here's some patent applications that I've been looking at recently:

Two level cache memory architecture

3-D rendering texture caching scheme

It's pretty hard to be sure how advanced the caching schemes in current GPUs are. One thing it's definitely worth remembering is that GPUs use very small amounts of cache memory - they dedicate far more memory to per-thread state (i.e. the registers and, in older GPUs, latency-hiding FIFOs within the shading pipeline). L2 is typically measured in low KB and it seems the newest GPUs distribute L2, rather than using a monolithic blob.

Thanks for the links! I've already read most of the papers Victor mentioned. I'll take a look at the patents too..

Are you sure the L2 cache size is in KBs? Both the following papers suggest its a good idea for the L1 to be in KBs, and second architecture uses an L2.

http://graphics.stanford.edu/papers/texture_cache/

http://ieeexplore.ieee.org/xpl/abs_free.jsp?arNumber=997855

As for the future of caching on GPUs and if there's any place for larger or more advanced caches... First, let's consider how it works right now. You've got two dedicated L1 caches per multiprocessor: one for textures and one for constants. Both are tuned for different access patterns. The texture cache at the L1 level works on uncompressed data.

At the L2 level, public information is a bit more scarce... I would tend to believe it's an unified cache for most kinds of data. It is known, however, that at this level the texture data is kept in cache compressed, thus in 4x4 64-bits or 128-bits blocks (S3TC/DXTC)... The logical reason why data isn't kept compressed in L1 is that too much decoding logic would be required.

Anyway, right now as I said the cache is mostly there to save bandwidth AFAIK, not hide latency - I'm not aware of any modern GPU architecture that relies on caching to hide latency. For example, the original 3DFX Voodoo didn't work like that iirc, even though it was the first to introduce a texture cache to the mainstream market. But that texture cache was incredibly important anyway, because it allowed them to have incredibly good bilinear filtering performance. And that was really what made it such a "killer product", from my point of view.

So, let's consider G80 again from that perspective. It has 64 filtering units, which is an incredible increase compared to NVIDIA's previous architectures, and an even bigger one compared to ATI's current architecture. There was a substantial bandwidth increase for G80, but nothing quite that dramatic; so, why did NVIDIA increase the number of TMUs so much, if you'd expect them to be bandwidth limited? Because their texture cache is saving the day.

Anisotropic filtering is what G80's TMU architecture is engineered around, and it tends to be fairly cache-coherent. The G80's texture caches are also significantly larger than those of any GPU that preceded it. These two factors combined largely explain why NVIDIA felt they could put so many TMUs/filtering units on the chip and not be massively bandwidth limited, as far as I can see.

Obviously, your focus for texture caches has been more on the latency side of things. They're also interesting there on GPUs, but much less so than on CPUs. As I explained in my previous post, all the threads (=pixels) in a warp (=batch) need to have no cache misses for a given memory access, or you're not saving anything on latency. Texturing over a batch of pixels tends to be fairly coherent, so it can happen. But you won't see it happening anywhere near 90%+ of the time, like you'd expect from a CPU.

In the future, I would expect coherence to stabilize at around 16 threads/batch. R520 was already there, but G80 and R580 are at 32 and 48 respectively for pixel shaders. Because pixels work in quads, that's only four distinct texture operations. So, I would expect the caches to increasingly so affect effective latency tolerance, but it's hard to say for sure without having hard data - which you won't have unless you work in NVIDIA or AMD's GPU divisions.

So, given your position, I would assume you're mostly interested in what kind of research would be useful for GPU texture caches, and GPU caches in general (constants, instructions, z, colours, etc.) - but I fear I can't give definitive answers to that question. Here are some of the things that have likely been implemented in the last 5 years at NVIDIA and ATI:
- Tuning the texture cache's behaviour for anistropic filtering (and trilinear filtering, obviously)
- Implementing cheap and efficient decompression of S3TC data located in the L2 cache.
- Implementing different behaviours for different kinds of caches, including constant caches.
- Achieving a good balance of locality and associativity in all major kinds of cache systems.
- And, obviously, adding caches where they didn't need any in the past (instructions, colour, etc.)

Achieving even just part of these things would already be a fairly substantial research project, I'd imagine. There obviously are other areas, though. Even excluding related areas (such as researching new algorithms to replace or complement S3TC - just ask Simon F on these very forums if you think it's easy!), I'm sure some things could be figured out.

It is important to understand, however, that anything that could likely be figured out at this point is probably evolutionary, not revolutionary. The cache system in a modern GPU is hardly naive; it delivers excellent results given the chip's overall architecture already, and at fairly minimal costs. It certainly could be improved upon, but I fail to imagine anything that'd change things drastically. Of course, that's always what people say before something revolutionary comes up - but for now, I'll have to remain skeptical! :)

Thanks, your observations and ideas are really nice. I have some questions though:-

1. Is latency really of no (little) importance? I mean, design that helps reduce its effective value must certainly help..

2. Can it be assumed that texture data in L2 is in most cases S3TC compressed?

I don't really see why you'd need a cache in your example to hide latency. When the fetch latency increases, you just have to increase the FIFO accordingly, but the presence of a cache isn't a requirement. Unless your external bandwidth is much lower than the one provided by the cache, but that's a different story.

I find it hard to understand.. I'm sorry if my questions are persistent and naive, but I really want to understand the difference between "hiding latency" and "reducing latency". Agreed that the former is more important and efficient parallelism is the way to go for it, but does the latter hold any significance in design?

Thanks again

Anjul
 
Are you sure the L2 cache size is in KBs? Both the following papers suggest its a good idea for the L1 to be in KBs, and second architecture uses an L2.
I'm not sure I'm remembering this right, but I think the G80 has <=128KiB L2 and <= 8x8KiB L1. As I said, I can't remember the exact numbers, and this might just be for textures.
1. Is latency really of no (little) importance? I mean, design that helps reduce its effective value must certainly help..
Think of it this way: in order to hide more effective latency on a GPU, you need more threads. That means a larger register file. So, the question is, how big is the register file on the latest GPUs? It's probably not a very large percentage of the chip. Thus, what needs to be achieved is an ideal balance between minimizing effective latency and hiding it. I would assume current implementations to already be very good at that, but I don't have the insider knowledge necessary to make sure of that.
2. Can it be assumed that texture data in L2 is in most cases S3TC compressed?
Yes, it can safely be assumed that L1 is uncompressed and L2 is compressed. I'm pretty sure NVIDIA has publicly confirmed this, and I would assume it to also be the case for AMD...
I find it hard to understand.. I'm sorry if my questions are persistent and naive, but I really want to understand the difference between "hiding latency" and "reducing latency". Agreed that the former is more important and efficient parallelism is the way to go for it, but does the latter hold any significance in design?
Reducing effective or average latency helps, and as I said above, having a balanced architecture is the goal. Adding a multi-MiBs wouldn't make sense though, considering the diminishing returns it would deliver.

Clearly, prefetching with a texture cache of a reasonable size helps. But you need to make sure your prefetching isn't too aggressive, otherwise you'll waste bandwidth. Once again, that's all about balance. Googling around, I found a good presentation that has some information on this subject, and many others: http://www-csl.csres.utexas.edu/use...ics_Arch_Tutorial_Micro2004_BillMarkParts.pdf

Pages 64 to 71 are likely what will interest you most. It is noted that the cache miss rate is ">10%", which is actually a bit lower than I would have assume it to be in practice. I would expect that this could definitely help for effective latency, even with a batch size of 16-32 pixels (4-8 quads, from a miss rate perspective...) since nearly pixels/quads should be fairly coherent.

So, yes, the caches can help for latency on a modern GPU - but you shouldn't overoptimize them for that, either. You should ideally optimize the sizes of your register file and of your texture cache together - and keep in mind that performance even with massive miss rates needs to be acceptable, because some workloads may have much poorer memory coherence.
 
Are you sure the L2 cache size is in KBs?
For DX9 GPUs and older, yes. L1 is in the range of hundreds of bytes, at most.

D3D10 and newer GPUs place much more emphasis on vertex and constant data (as well as upping the limits on texture data). The access patterns end up being more complex because of the high limits (e.g. 128 textures per pixel or support for thousands of constants bound to a single shader program) and it seems that caching in these newest GPUs has to be significantly more complex in order to cope.

Basically the shader pipeline needs to be able to fetch arbitrary constant, vertex and texture data with the latency for each fetch being entirely hidden.

In general, the amount of data goes: constant < vertex < texture - there's perhaps an order of magnitude difference from one to the next. The access patterns for constants are the most complex, since there's little reason for spatial locality in memory fetches. Vertex data will tend to consist of serially fetched streams, with up to 8 in parallel (I think). While texture data will tend to be fetched in localised tiles from each texture in memory.

So the cache treatment for each of these types of data needs to be different.

Looking at G80 it seems that it uses much larger amounts of cache than previous generations of GPU, because :
  1. TLP has the relatively unpleasant side-effect of increased cache-thrashing, so increased size and set-associativity will soften the blow
  2. cache has to cope with the much harsher demands of D3D10
This is how I think G80's caches are configured:
  • 6 memory channels (of 64-bits each), each of which has 16KB of L2 cache (vague on quantity - perhaps used by the ROPs, as well - hard to tell)
  • 16 processors, each has:
    • 16KB of parallel data cache
    • 8KB of constant cache
    • 8KB of 1D (vertex) cache
    • an unknown amount of TMU cache
G80 seems to have a 64KB block of constant memory (cached by the 8KB per processor constant cache described above).

Jawed
 
The second is that the external bus read granularity is so large these days that you have to read dozens of texels just to access the few you might need for a particular pixel.
Yep. An unfortunate side effect of striving for higher and higher amounts of bandwidth. It's a good thing the typical GPU workload doesn't fetch random pixels from all over the screen.
 
I find it hard to understand.. I'm sorry if my questions are persistent and naive, but I really want to understand the difference between "hiding latency" and "reducing latency". Agreed that the former is more important and efficient parallelism is the way to go for it, but does the latter hold any significance in design?

You hide latency by having multiple active threads and multiple outstanding reads. If you can schedule enough threads, you should be able to avoid idle cycles on your execution engine. Your latency doesn't reduce at all.

You can reduce latency by storing data in the cache, so when you have a hit, the data returns quicker. The latency is really shorter. But as noted before, this won't work very well if you have hits mixed with misses and you need your data in order.

Both techniques require memory and, as Arun wrote, for each type of application, there must be some kind of optimum, but latency hiding is probably more effective since the memory required scales linearly with the latency that you want to cover.

As an aside, in CPU's there are almost no independent fetching threads, so they have to rely on latency reduction with a cache. And since the data in a CPU is also needed in-order, a miss results in huge performance drops, so that's why the caches have to be so big.
 
Last edited by a moderator:
The main component of memory latency is the cycles wasted waiting on queues before the memory request is actually serviced by the GDDR and returned back to the requesting unit. At least when 'interesting things' are happening (when there is a significative amount of memory traffic). So the way of reducing latency is working around those queues. And those queues are the memory controller. And when taking into account the related penalties not directly related with 'latency' but that increase the number of cycles when there is no data traffic with the GDDR chip (opening pages, scheduling write and read comands , etc) a bad implementation of the memory controller can really increase the service wait time quite a lot.

Latency point to point (GDDR to requesting unit) with no memory traffic (and no penalties) in a GPU is likely to be in the two digits (getting data from GDDR into the chip must already be ~10) rather than three digits. My guess that the number from NVidia must be some average latency with average or heavy traffic.
 
My guess that the number from NVidia must be some average latency with average or heavy traffic.
There's also the "average" when G80 is scaled down to make a G84 or G86 and when low-bandwidth DDR2 is used.

As for wasted cycles, this patent application is interesting:

METHOD AND APPARATUS FOR DATA TRANSFER

basically it sends an "excess" of write data to the memory chip, and splits the write into two time periods, with a read in the middle - arbitrated by the command (addressing) data sent on the address/control bus. The excess written data gets "buffered" at the memory chip, and can hang around until the relevant commands are received, after the read has been initiated. This way the data bus incurs lowered turn-around latency.

But, as I wrote in the R600 thread, I'm doubtful this will work with GDDR4, unless this buffering is already inside the memory chips. Erm...

Jawed
 
Back
Top