GPU Memory Latency

Discussion in 'Architecture and Products' started by anjulpa, Mar 4, 2007.

  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Is the other reason thread synchronization?
     
  2. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    I only read the abstract, but I don't see how this patent states anything contradicting what I've said. Register files are large while cache and FIFOs can use custom memories that are smaller. It makes sense to use a combination of techniques to hide latency.
     
  3. Simon F

    Simon F Tea maker
    Moderator Veteran

    Joined:
    Feb 8, 2002
    Messages:
    4,563
    Likes Received:
    171
    Location:
    In the Island of Sodor, where the steam trains lie
    No. Pascal had the correct answer.
     
  4. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    I thought there was some obscure reason you were thinking of because bandwidth had already been mentioned as a reason.
     
  5. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Sorry if I was unclear - I didn't mean to imply it was contradicting what you said! :) I simply wanted to add some extra colour to it with an extra real-world implementation example that highlights this combination of techniques.
     
  6. Simon F

    Simon F Tea maker
    Moderator Veteran

    Joined:
    Feb 8, 2002
    Messages:
    4,563
    Likes Received:
    171
    Location:
    In the Island of Sodor, where the steam trains lie
    My apologies, I missed Arun's post. Although I was thinking that there are two ways it helps.

    The first is because texels (especially when performing filtering on neighbouring pixels) are frequently re-used (unless you turned MIP mapping off :evil: ). The second is that the external bus read granularity is so large these days that you have to read dozens of texels just to access the few you might need for a particular pixel.
     
  7. anjulpa

    Newcomer

    Joined:
    Oct 21, 2006
    Messages:
    19
    Likes Received:
    1
    Thanks for the links! I've already read most of the papers Victor mentioned. I'll take a look at the patents too..

    Are you sure the L2 cache size is in KBs? Both the following papers suggest its a good idea for the L1 to be in KBs, and second architecture uses an L2.

    http://graphics.stanford.edu/papers/texture_cache/

    http://ieeexplore.ieee.org/xpl/abs_free.jsp?arNumber=997855

    Thanks, your observations and ideas are really nice. I have some questions though:-

    1. Is latency really of no (little) importance? I mean, design that helps reduce its effective value must certainly help..

    2. Can it be assumed that texture data in L2 is in most cases S3TC compressed?

    I find it hard to understand.. I'm sorry if my questions are persistent and naive, but I really want to understand the difference between "hiding latency" and "reducing latency". Agreed that the former is more important and efficient parallelism is the way to go for it, but does the latter hold any significance in design?

    Thanks again

    Anjul
     
  8. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    I'm not sure I'm remembering this right, but I think the G80 has <=128KiB L2 and <= 8x8KiB L1. As I said, I can't remember the exact numbers, and this might just be for textures.
    Think of it this way: in order to hide more effective latency on a GPU, you need more threads. That means a larger register file. So, the question is, how big is the register file on the latest GPUs? It's probably not a very large percentage of the chip. Thus, what needs to be achieved is an ideal balance between minimizing effective latency and hiding it. I would assume current implementations to already be very good at that, but I don't have the insider knowledge necessary to make sure of that.
    Yes, it can safely be assumed that L1 is uncompressed and L2 is compressed. I'm pretty sure NVIDIA has publicly confirmed this, and I would assume it to also be the case for AMD...
    Reducing effective or average latency helps, and as I said above, having a balanced architecture is the goal. Adding a multi-MiBs wouldn't make sense though, considering the diminishing returns it would deliver.

    Clearly, prefetching with a texture cache of a reasonable size helps. But you need to make sure your prefetching isn't too aggressive, otherwise you'll waste bandwidth. Once again, that's all about balance. Googling around, I found a good presentation that has some information on this subject, and many others: http://www-csl.csres.utexas.edu/use...ics_Arch_Tutorial_Micro2004_BillMarkParts.pdf

    Pages 64 to 71 are likely what will interest you most. It is noted that the cache miss rate is ">10%", which is actually a bit lower than I would have assume it to be in practice. I would expect that this could definitely help for effective latency, even with a batch size of 16-32 pixels (4-8 quads, from a miss rate perspective...) since nearly pixels/quads should be fairly coherent.

    So, yes, the caches can help for latency on a modern GPU - but you shouldn't overoptimize them for that, either. You should ideally optimize the sizes of your register file and of your texture cache together - and keep in mind that performance even with massive miss rates needs to be acceptable, because some workloads may have much poorer memory coherence.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    For DX9 GPUs and older, yes. L1 is in the range of hundreds of bytes, at most.

    D3D10 and newer GPUs place much more emphasis on vertex and constant data (as well as upping the limits on texture data). The access patterns end up being more complex because of the high limits (e.g. 128 textures per pixel or support for thousands of constants bound to a single shader program) and it seems that caching in these newest GPUs has to be significantly more complex in order to cope.

    Basically the shader pipeline needs to be able to fetch arbitrary constant, vertex and texture data with the latency for each fetch being entirely hidden.

    In general, the amount of data goes: constant < vertex < texture - there's perhaps an order of magnitude difference from one to the next. The access patterns for constants are the most complex, since there's little reason for spatial locality in memory fetches. Vertex data will tend to consist of serially fetched streams, with up to 8 in parallel (I think). While texture data will tend to be fetched in localised tiles from each texture in memory.

    So the cache treatment for each of these types of data needs to be different.

    Looking at G80 it seems that it uses much larger amounts of cache than previous generations of GPU, because :
    1. TLP has the relatively unpleasant side-effect of increased cache-thrashing, so increased size and set-associativity will soften the blow
    2. cache has to cope with the much harsher demands of D3D10
    This is how I think G80's caches are configured:
    • 6 memory channels (of 64-bits each), each of which has 16KB of L2 cache (vague on quantity - perhaps used by the ROPs, as well - hard to tell)
    • 16 processors, each has:
      • 16KB of parallel data cache
      • 8KB of constant cache
      • 8KB of 1D (vertex) cache
      • an unknown amount of TMU cache
    G80 seems to have a 64KB block of constant memory (cached by the 8KB per processor constant cache described above).

    Jawed
     
  10. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    Yep. An unfortunate side effect of striving for higher and higher amounts of bandwidth. It's a good thing the typical GPU workload doesn't fetch random pixels from all over the screen.
     
  11. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    You hide latency by having multiple active threads and multiple outstanding reads. If you can schedule enough threads, you should be able to avoid idle cycles on your execution engine. Your latency doesn't reduce at all.

    You can reduce latency by storing data in the cache, so when you have a hit, the data returns quicker. The latency is really shorter. But as noted before, this won't work very well if you have hits mixed with misses and you need your data in order.

    Both techniques require memory and, as Arun wrote, for each type of application, there must be some kind of optimum, but latency hiding is probably more effective since the memory required scales linearly with the latency that you want to cover.

    As an aside, in CPU's there are almost no independent fetching threads, so they have to rely on latency reduction with a cache. And since the data in a CPU is also needed in-order, a miss results in huge performance drops, so that's why the caches have to be so big.
     
    #31 silent_guy, Mar 8, 2007
    Last edited by a moderator: Mar 8, 2007
  12. anjulpa

    Newcomer

    Joined:
    Oct 21, 2006
    Messages:
    19
    Likes Received:
    1
    Thank you guys.. it was great reading all your opinions. I think I get the point now...


    Thanks

    Anjul
     
  13. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    The main component of memory latency is the cycles wasted waiting on queues before the memory request is actually serviced by the GDDR and returned back to the requesting unit. At least when 'interesting things' are happening (when there is a significative amount of memory traffic). So the way of reducing latency is working around those queues. And those queues are the memory controller. And when taking into account the related penalties not directly related with 'latency' but that increase the number of cycles when there is no data traffic with the GDDR chip (opening pages, scheduling write and read comands , etc) a bad implementation of the memory controller can really increase the service wait time quite a lot.

    Latency point to point (GDDR to requesting unit) with no memory traffic (and no penalties) in a GPU is likely to be in the two digits (getting data from GDDR into the chip must already be ~10) rather than three digits. My guess that the number from NVidia must be some average latency with average or heavy traffic.
     
    Jawed likes this.
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    There's also the "average" when G80 is scaled down to make a G84 or G86 and when low-bandwidth DDR2 is used.

    As for wasted cycles, this patent application is interesting:

    METHOD AND APPARATUS FOR DATA TRANSFER

    basically it sends an "excess" of write data to the memory chip, and splits the write into two time periods, with a read in the middle - arbitrated by the command (addressing) data sent on the address/control bus. The excess written data gets "buffered" at the memory chip, and can hang around until the relevant commands are received, after the read has been initiated. This way the data bus incurs lowered turn-around latency.

    But, as I wrote in the R600 thread, I'm doubtful this will work with GDDR4, unless this buffering is already inside the memory chips. Erm...

    Jawed
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...