GPU Memory Latency

anjulpa

Newcomer
This is for my senior year project.

Does anyone have any idea what is the order of latency in a memory operation on a GPU? I mean, on a cache miss and in terms of clock cycles? Lets say its for one texture fetch..

And if someone can lead me to a place where I can find more of such measures, it'll be great.

PS. If I'm in the wrong forum, sorry. Kindly move.

Thanks
 
The NVIDIA CUDA docs indicate that there is "200 to 300 clock cycles of memory latency" for memory operations when no caching is involved. If I'm understanding the doc properly, that is the latency as seen from the shader core scheduler, at a clock rate of 600 to 675MHz. It does seem rather high to me, so I'm not sure I'm basing that on the right clocks - I think I am, though... :)
 
Okay, that means NVIDIA must be getting a huge hit rate on their cache. Well maybe L2 covers it all, but isn't this too high?
 
Well, GPUs are all about hiding memory latency, not just reducing the effective latency through caching. The texture caches certainly help for latency, but it could be argued that their primary advantage is to reduce bandwidth requirements.

It's pretty much the size of the register file which helps to hide latency, along with Instruction Level Parallelism - at least on modern GPUs, I won't pretend to know the tricks that might have worked in the pre-DX9 era. Even though some of these might still be used today, I'd suspect.

Anyway, let's take the GeForce 8800GTX's case. One 'warp' (32 threads) needs 2 clock cycles (at 675MHz) to complete one instruction. The chip is divided in 16 'multiprocessors' - one possible value is that 512 threads are in flight at any given time on one multiprocessor. That means 32 clock cycles can be hidden for any independent instruction that can be scheduled after the texture instruction.

If you don't want to run into bandwidth bottlenecks, you'd likely want a fairly high ratio of arithmetic instructions to texture instructions. Keep in mind the ALUs are scalar for G80. That means two Vec4 instructions that are independent from the texture operation (which also returns up to a Vec4!) would correspond to 8 native instructions, and that would already hide up to 256 cycles of memory latency.

Now, let us consider caching. Let us assume that out of the 16 warps, 2 of them (2x32 consecutive threads, or 2x8 quads!) manage to have no cache misses whatsoever. At the same time, the 14 other warps haven't gotten their texturing results back yet and have no more independent instructions to run. That means these 2 warps are going to be helping to hide the memory latency for the other warps.

But there are only 1/8th as many warps ready to issue new instructions then, in this specific example. That means that 8 independent scalar instructions from these warps would only hide as much memory latency as 1 independent scalar instruction from the entire set of warps.

There is one extra and very important thing to point out: it's very unlikely that all the warps are running the same instruction at nearly exactly the same time as supposed above. An intelligent scheduler will try to have different parts of the program running (or ready to run) at the same time, in order to balance a multitude of factors.

For example, if the ALU-to-TMU ratio required at one point in the program is 4:1 while it is 1:1 in another part of the program, this would reduce bursts and you might have an effective ratio of 3:1. The same reasoning holds for latency tolerance and the number of independent instructions available at any given time.

Okay, so I hope this is going to help you somehow... This explanation is mostly valid for the G80, but also for the ATI R5xx Series. The G7x and R4xx are a bit different, I believe. Let me know if there's anything I wasn't very clear on - and if there's anything I'm mistaken upon, hopefully somebody will chime in to correct me.
 
  • Like
Reactions: Rys
All R5xx GPUs have 128 batches in flight per shader unit. The shader unit takes 4 clocks executing each instruction, so the total latency hiding available is 512 clocks.

In R580, for example, there's 48 threads in a batch. That's 6144 threads for one shader unit. R580 has four of these units, so in total that's 24576 threads in flight.

Threads are locked to one shader unit. This means if one shader unit runs out of threads to hide latency, the other shader units are still able to execute.

In R5xx GPUs thread-switching is also used to hide the latency associated with branching (i.e. there's no pipeline flush nor branch prediction, the scheduler simply swaps threads).

This means that there's a "battle" between texturing complexity and branching complexity in eating up available clock cycles of "latency" hiding - latency being either texture operations (fetch from memory plus filtering which takes time, too) or branching.

---

The "128 batches in flight" is nominal, though. Since the register file has limited capacity, complex threads with, say, 32 registers assigned per thread, will cause a huge reduction in the number of batches in flight - the number of threads assigned to a batch is effectively constant. The size of the register file isn't publically known.

If the register file is nominally organised to support 2 registers per thread (i.e. 12288 for a shader unit in R580), then 32 registers per thread would mean only 8 batches can fit their registers into the register file, which is 32 clocks of latency hiding per texture instruction :cry:

Luckily, if you have such complex code (requiring such a high number of registers), it's likely there are tracts of it with no texturing instructions. e.g. there might be only 5 texture instructions in the program, but 80 ALU instructions. That 16:1 ratio means that the shader program, taken as a whole, retains the 512:1 latency-hiding we started with.

---

Some R5xx GPUs complicate things because a shader unit has 3x the ALU instruction rate as the texture instruction rate. R580 can process 12 threads' ALU instructions per clock, but only 4 texture instructions per clock. So prefetching means that doing the first 4 texture instructions (fetching the first quad of texture data) will populate the cache for the remaining 8 texture instructions. So the memory latency cost is significantly ameliorated by texture data being localised in memory and fetched in relatively large lumps.

So, in R580 for example, there's an underlying 3:1 ratio between ALU instructions and texture instructions. This discussion of RV530 should help (R580 and RV530 are both "3:1"):

http://www.beyond3d.com/reviews/ati/rv5xx/index.php?p=01#arch

Jawed
 
  • Like
Reactions: Rys
Thanks for these wonderful responses! I learned a lot just trying to understand them..

I'm not sure I'm clear on everything, but I agree ILP has been very effectively exploited to hide most of the delays. Still, I'm positive some aggressive caching techniques must be used, and better ones should be welcome. A good cache response would definitely relax the limits on the ALU-to-TMU ratio, among other things. Am I right?

My real question is, is there room for an improved cache architecture (or something similar)?
 
I'm not sure I'm clear on everything, but I agree ILP has been very effectively exploited to hide most of the delays.

I wouldn't say it's ILP being exploited. That's parallelism within a single thread, and thus far the processing units involved do not try much in the way of superscalar execution or OoO execution.

It's more thread-level parallelism being exploited.

Still, I'm positive some aggressive caching techniques must be used, and better ones should be welcome. A good cache response would definitely relax the limits on the ALU-to-TMU ratio, among other things. Am I right?

My real question is, is there room for an improved cache architecture (or something similar)?

I'd say there's always room for improvement in one aspect of performance, but it comes at a cost.
The best way to increase cache effectiveness is to make it bigger, which means it expands at the expense of other parts of the chip.

I can't speak to what aggressive caching techniques you are talking about. GPU caches are already highly associative (fully?), and they are heavily multiported. There's not much else that can make them more aggressive, besides size.

Speculative caching, such as prefetching, is risky. In a highly data-parallel environment, it's easier to just grab for more non-speculative work, rather than choke the pipeline with stuff you might throw away.
 
It's really thread level parallelism that's going on here. In G80, a SIMD instruction is issued for 8 threads in parallel (e.g. pixels), while in R580 it's 12. In both architectures an instruction runs for four consecutive ALU-clocks, effectively dealing with 32 pixels in G80 and 48 in R580. After that, it's likely that the next instruction will be for a completely new set of threads. In fact in R5xx it's guaranteed, since instructions are scheduled AAAABBBBCCCCDDDD etc. (each letter being 12 threads/pixels executing in parallel).

---

Caching in GPUs tends to be trade secret. See Victor Moya's investigations:

http://personals.ac.upc.edu/vmoya/log.html

starting at 23/9/04 and older (actually, scroll to the very bottom!). He posts here as RoOoBo.

Here's some patent applications that I've been looking at recently:

Two level cache memory architecture

3-D rendering texture caching scheme

It's pretty hard to be sure how advanced the caching schemes in current GPUs are. One thing it's definitely worth remembering is that GPUs use very small amounts of cache memory - they dedicate far more memory to per-thread state (i.e. the registers and, in older GPUs, latency-hiding FIFOs within the shading pipeline). L2 is typically measured in low KB and it seems the newest GPUs distribute L2, rather than using a monolithic blob.

G80's cache architecture is a major departure (the entire architecture is a huge shift from previous NVidia GPUs) and it's still got some mystery to it. You can glean a fair amount from this thread and NVidia's CUDA documentation referenced there:

http://www.beyond3d.com/forum/showthread.php?t=38770

My guess is that L2 in G80 is shared by multiple clients: texture, constant, vertex and colour/z/stencil. The latter may actually have dedicated caches though, since it's read-write. And there's also the question of how streamout is supported. It's write-only but it seems pretty likely it's cached.

Jawed
 
Actually, there's both instruction and thread level parallelism going on. Consider that the latency of TMU instructions can be hidden by executing ALU instructions meanwhile in both R5xx and G8x; if that's not ILP, I don't know what is! But it's a quite simplistic (->cheap) approach to ILP, and GPUs are much more dependent on TLP than CPUs.

It is true, however, that the trend is that ILP extraction mechanisms have been improving, and this is already the case in G80 AFAIK - I have yet to properly test this, but I believe based on Lindholm's patent that multiple ALU instructions of the same thread can execute simultaneously in the same ALU, and that there actually is a small instruction window in addition to the scoreboard. The main goal of ILP in GPUs is to reduce the number of simultaneous threads required, thus saving transistors by reducing the size of the register file.

If you are more familiar with CPU architectures, think of how Sun's Niagara architecture helps to hide memory latency. But instead of having 4 threads per core, you have 16 to 64 threads, and each of them works on a Vec8 instruction. The programming model allows graphics programmers to think of it as a scalar architecture with a branching coherence requirement of 32 (8x4) 'threads'.

As for the future of caching on GPUs and if there's any place for larger or more advanced caches... First, let's consider how it works right now. You've got two dedicated L1 caches per multiprocessor: one for textures and one for constants. Both are tuned for different access patterns. The texture cache at the L1 level works on uncompressed data.

At the L2 level, public information is a bit more scarce... I would tend to believe it's an unified cache for most kinds of data. It is known, however, that at this level the texture data is kept in cache compressed, thus in 4x4 64-bits or 128-bits blocks (S3TC/DXTC)... The logical reason why data isn't kept compressed in L1 is that too much decoding logic would be required.

Anyway, right now as I said the cache is mostly there to save bandwidth AFAIK, not hide latency - I'm not aware of any modern GPU architecture that relies on caching to hide latency. For example, the original 3DFX Voodoo didn't work like that iirc, even though it was the first to introduce a texture cache to the mainstream market. But that texture cache was incredibly important anyway, because it allowed them to have incredibly good bilinear filtering performance. And that was really what made it such a "killer product", from my point of view.

So, let's consider G80 again from that perspective. It has 64 filtering units, which is an incredible increase compared to NVIDIA's previous architectures, and an even bigger one compared to ATI's current architecture. There was a substantial bandwidth increase for G80, but nothing quite that dramatic; so, why did NVIDIA increase the number of TMUs so much, if you'd expect them to be bandwidth limited? Because their texture cache is saving the day.

Anisotropic filtering is what G80's TMU architecture is engineered around, and it tends to be fairly cache-coherent. The G80's texture caches are also significantly larger than those of any GPU that preceded it. These two factors combined largely explain why NVIDIA felt they could put so many TMUs/filtering units on the chip and not be massively bandwidth limited, as far as I can see.

Obviously, your focus for texture caches has been more on the latency side of things. They're also interesting there on GPUs, but much less so than on CPUs. As I explained in my previous post, all the threads (=pixels) in a warp (=batch) need to have no cache misses for a given memory access, or you're not saving anything on latency. Texturing over a batch of pixels tends to be fairly coherent, so it can happen. But you won't see it happening anywhere near 90%+ of the time, like you'd expect from a CPU.

In the future, I would expect coherence to stabilize at around 16 threads/batch. R520 was already there, but G80 and R580 are at 32 and 48 respectively for pixel shaders. Because pixels work in quads, that's only four distinct texture operations. So, I would expect the caches to increasingly so affect effective latency tolerance, but it's hard to say for sure without having hard data - which you won't have unless you work in NVIDIA or AMD's GPU divisions.

So, given your position, I would assume you're mostly interested in what kind of research would be useful for GPU texture caches, and GPU caches in general (constants, instructions, z, colours, etc.) - but I fear I can't give definitive answers to that question. Here are some of the things that have likely been implemented in the last 5 years at NVIDIA and ATI:
- Tuning the texture cache's behaviour for anistropic filtering (and trilinear filtering, obviously)
- Implementing cheap and efficient decompression of S3TC data located in the L2 cache.
- Implementing different behaviours for different kinds of caches, including constant caches.
- Achieving a good balance of locality and associativity in all major kinds of cache systems.
- And, obviously, adding caches where they didn't need any in the past (instructions, colour, etc.)

Achieving even just part of these things would already be a fairly substantial research project, I'd imagine. There obviously are other areas, though. Even excluding related areas (such as researching new algorithms to replace or complement S3TC - just ask Simon F on these very forums if you think it's easy!), I'm sure some things could be figured out.

It is important to understand, however, that anything that could likely be figured out at this point is probably evolutionary, not revolutionary. The cache system in a modern GPU is hardly naive; it delivers excellent results given the chip's overall architecture already, and at fairly minimal costs. It certainly could be improved upon, but I fail to imagine anything that'd change things drastically. Of course, that's always what people say before something revolutionary comes up - but for now, I'll have to remain skeptical! :)
 
Actually, there's both instruction and thread level parallelism going on. Consider that the latency of TMU instructions can be hidden by executing ALU instructions meanwhile in both R5xx and G8x; if that's not ILP, I don't know what is! But it's a quite simplistic (->cheap) approach to ILP, and GPUs are much more dependent on TLP than CPUs.

Hiding latency of multi-cycle operations in the pipeline by compiling code that pads the following slots with non-dependent instructions is ILP in the sense that pipelining exploits ILP.
A GPU doesn't differentiate itself from any architecture with a pipeline and instructions with latencies>1.

Technically, that is still ILP, which is usually glossed over in favor of more explicit hardware techniques.
Has it been shown that G80 actually runs ahead within a thread? I was under the impression that if the driver can't find a way to pad around a texture op that it results in that single thread stalling.
 
Hiding latency of multi-cycle operations in the pipeline by compiling code that pads the following slots with non-dependent instructions is ILP in the sense that pipelining exploits ILP.
Yeah, it could definitely be said that it's not very different from any pipelined design that doesn't hide latency exclusively through TLP. I would tend to believe some old GPUs work like that, but good luck knowing for sure. I would, at the very least, believe that some (most/all?) vertex shader implementations work like that.

Has it been shown that G80 actually runs ahead within a thread? I was under the impression that if the driver can't find a way to pad around a texture op that it results in that single thread stalling.
Can you clarify on exactly what you mean by "run ahead within a thread"? I want to make sure I'm not misunderstanding that. Is your question roughly similar to "How would a single warp reserving 16KiB of shared memory behave in CUDA"? The performance characteristics of that would indeed be quite interesting!

The Lindholm patents actually make me think there is a window of 2 instructions (at least), but just creating a test case for that in CUDA would be a pretty good way to know for sure. I'll have to do that tommorow or in the coming days, definitely.
 
Hiding latency of multi-cycle operations in the pipeline by compiling code that pads the following slots with non-dependent instructions is ILP in the sense that pipelining exploits ILP.
The key point with R5xx and G8x is that the parallel instructions (parallel to the texturing) can come from any batch. Older GPUs were stuck using threads from the current batch.

I suppose you can argue that: LOD/BIAS, Address, Fetch, Filter for a single thread constitutes a "macro instruction" that lasts several hundred clocks - but is capable of running in much less if the texture cache obliges.

Jawed
 
Can you clarify on exactly what you mean by "run ahead within a thread"? I want to make sure I'm not misunderstanding that. Is your question roughly similar to "How would a single warp reserving 16KiB of shared memory behave in CUDA"? The performance characteristics of that would indeed be quite interesting!
I was trying to clarify the distinction between ILP and TLP. With a heavily threaded GPU, the lines get a little blurry.
What I was trying to clarify was that using the ALU ops to hide texture operation latency would be ILP if G80 were pulling them from the same thread.
If reserving 16KiB per warp is sufficient to isolate what happens for a single SIMD group, so that G80 can't try to overlay that warp's execution with another's, then I think that's a roughly equivalent situation.

In that case, the only way G80 could extract ILP is if it is able to run ahead in the code stream and pluck out ALU ops from further in the instruction stream.
If it pulled from another warp, then it would be TLP at work.


The key point with R5xx and G8x is that the parallel instructions (parallel to the texturing) can come from any batch. Older GPUs were stuck using threads from the current batch.

Then that would be closer to TLP, not ILP in the point I was addressing.
 
I was trying to clarify the distinction between ILP and TLP. With a heavily threaded GPU, the lines get a little blurry.
What I was trying to clarify was that using the ALU ops to hide texture operation latency would be ILP if G80 were pulling them from the same thread.
Ahhh, okay. And yes, my understanding is that G80 can pull it from the same thread. But it warrants getting tested properly to make sure.
If reserving 16KiB per warp is sufficient to isolate what happens for a single SIMD group, so that G80 can't try to overlay that warp's execution with another's, then I think that's a roughly equivalent situation.
Yeah, it hopefully should be. For CUDA, the number of warps running in a single multiprocessor is limited by the number of registers *and* shared memory necessary for a single thread. If you issue a grid with many blocks, but each block representing only one warp and reserving 16KiB of shared memory per block... Then you should effectively get rid of TLP.
 
Anyway, right now as I said the cache is mostly there to save bandwidth AFAIK, not hide latency - I'm not aware of any modern GPU architecture that relies on caching to hide latency.
It's both really. You need a certain size cache/FIFO to hide memory latency. Say your latency to main memory is 200 cycles. A thread executes a texture fetch and is placed in a thread buffer/reservation station to await the data's return. In the meantime you issue many more texture fetches. As many as your internal FIFOs and thread buffer allow. If the cache is not large enough to store 200 cycles worth of texture data then execution will stall.

Of course, additional latency can be hidden by ALU work, etc. as mentioned above, but you'll want to operate at the peak rate of your memory bandwidth so you want a cache to be large enough to hide latency.

A simple conclusion is at one time FIFOs were used to hide memory latency, but once they got big enough to save bandwidth as well using a cache made sense.
 
Anyway, right now as I said the cache is mostly there to save bandwidth AFAIK, not hide latency.

Agreed, with some nuance: a big factor in whether or not a cache can help in hiding latency depends on the capability of the system to handle out-of-order transactions. OoO makes it possible for hits that come after the miss to get in front of the miss and thus still reduce the overall latency. The extent to which this happens is very much dependent on the traffic pattern and hit rate though.

You need a certain size cache/FIFO to hide memory latency. Say your latency to main memory is 200 cycles. A thread executes a texture fetch and is placed in a thread buffer/reservation station to await the data's return. In the meantime you issue many more texture fetches. As many as your internal FIFOs and thread buffer allow. If the cache is not large enough to store 200 cycles worth of texture data then execution will stall.

I don't really see why you'd need a cache in your example to hide latency. When the fetch latency increases, you just have to increase the FIFO accordingly, but the presence of a cache isn't a requirement. Unless your external bandwidth is much lower than the one provided by the cache, but that's a different story.
 
I don't really see why you'd need a cache in your example to hide latency. When the fetch latency increases, you just have to increase the FIFO accordingly, but the presence of a cache isn't a requirement. Unless your external bandwidth is much lower than the one provided by the cache, but that's a different story.
I agree a cache isn't a requirement, but they are used to hide latency.
 
It's both really. You need a certain size cache/FIFO to hide memory latency. [...] If the cache is not large enough to store 200 cycles worth of texture data then execution will stall.
You certainly need a FIFO for some things, but the G80 for example (and I would assume recent AMD architectures, but I don't know for sure) use its register files creatively to reduce its necessary size. Search for "Register based queuing for texture requests" here: http://appft1.uspto.gov/netahtml/PTO/search-bool.html

And here's the abstract, since nearly everything you need to know is there:
A graphics processing unit can queue a large number of texture requests to balance out the variability of texture requests without the need for a large texture request buffer. A dedicated texture request buffer queues the relatively small texture commands and parameters. Additionally, for each queued texture command, an associated set of texture arguments, which are typically much larger than the texture command, are stored in a general purpose register. The texture unit retrieves texture commands from the texture request buffer and then fetches the associated texture arguments from the appropriate general purpose register. The texture arguments may be stored in the general purpose register designated as the destination of the final texture value computed by the texture unit. Because the destination register must be allocated for the final texture value as texture commands are queued, storing the texture arguments in this register does not consume any additional registers.
 
Back
Top