Minimum fetch from DRAM on RV770?

Discussion in 'Architecture and Products' started by dkanter, Aug 22, 2009.

  1. dkanter

    Regular

    Joined:
    Jan 19, 2008
    Messages:
    360
    Likes Received:
    20
    OK, so I spent a while trying to find out how much data is fetched for a load from memory, but couldn't find jack shit (thanks for the lack of any documentation ATI!).

    The GT200 fetches 64B from DRAM for each memory request (http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=10), which is the same as an x86 processor.

    How much data does the RV770 fetch from DRAM for each memory request? I don't think it's the same, since RV770 has GDDR5 which bursts more data than GDDR3 (I believe).

    David
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    GDDR5 has an 8-bit burst (as stated in the Qimonda whitepaper you have) and there are four controllers that are 64 bits wide, so 64B.

    Jawed
     
  3. bowman

    Newcomer

    Joined:
    Apr 24, 2008
    Messages:
    141
    Likes Received:
    0
    Ooh, RV770 article.

    You played your hand, now I'll be F5ing RWT until my fingers bleed. Thanks a bunch.:lol:
     
  4. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,561
    Likes Received:
    601
    Location:
    New York
    Or you could just follow dk on Twitter :lol:
     
  5. dkanter

    Regular

    Joined:
    Jan 19, 2008
    Messages:
    360
    Likes Received:
    20
    I don't think that's how it's calculated. A memory request goes to a single memory controller and should only access 1-4 DRAMs (for locality's sake), not all four controllers.

    At least that was my understanding of how it works for GT200, and I'd assume it works the same way for RV770. Again, I don't know for sure, but that's my belief based upon what I read.

    David
     
  6. dkanter

    Regular

    Joined:
    Jan 19, 2008
    Messages:
    360
    Likes Received:
    20
    LOL....twitter : )

    David
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    On RV770, 2 chips are addressed by a single memory controller simultaneously on GDDR5, 32-bits each. Or, 4 chips are addressed simultaneously (clamshell configuration), 16-bits each. The burst length is always 8.

    GDDR3 has burst lengths of either 4 or 8, a completely free choice in the mode of operation of the controller, as far as I can tell.

    Jawed
     
  8. dkanter

    Regular

    Joined:
    Jan 19, 2008
    Messages:
    360
    Likes Received:
    20
    That's not what I'm asking...

    If I issue a load instruction on RV770, how many bytes of data are fetched from memory? In CPUs, you have 64B cache lines and each fetch from memory is for 1 cache line (or possibly more).

    DK
     
  9. Bob

    Bob
    Regular

    Joined:
    Apr 22, 2004
    Messages:
    424
    Likes Received:
    47
    It's not hard to write a test to figure it out.
     
  10. dkanter

    Regular

    Joined:
    Jan 19, 2008
    Messages:
    360
    Likes Received:
    20
    Not being much of a programmer myself, I'm not sure I'd agree.

    Rather - it might be relatively easy for you, I doubt it would be for me.

    David
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Define a "load instruction".

    On a GPU a load instruction in a kernel "runs simultaneously" for every strand.

    Within each RV770 cluster 64 strands are fetching, "per instruction", but, the texture fetch hardware can only do a fetch for 4 strands per cycle. The minimum fetch per strand is 4 bytes. So 16B per clock per thread is the absolute minimum, and there are 10 threads able to fetch per clock.

    The fetch goes through L1 in this case and the maximum fetch rate is 480GB/s theoretically, although observed to be 444GB/s:

    http://forum.beyond3d.com/showthread.php?t=54842

    480GB/s results from 10 clusters * 4 strands * 16 bytes (vec4 fp32 = 128-bits) * 750MHz.

    L2->L1 bandwidth is only 384GB/s.

    As I already explained the burst size from GDDR5 is 64B, so it makes sense that L2 has 64B lines. I don't know what the line size is in L1, but since a single fetch is, at most, 4 * 128 bits per clock, a 64B line seems logical.

    prunedtree was able to obtain 880GFLOPs for matrix multiply by carefully tuning the fetch pattern, so that all of a fetch line is used, and needed only once by a cluster's L1 and so that the entire latency of fetching from memory (using 8 threads per cluster) is hidden. His optimisation accounts for the Z-ordering (space filling curve) that is used for texture storage in memory, where texels are tiled for best locality both in the 2D of texture space and in the extra dimension of colour channels (1, 2 or 4 channels).

    So logically a 64B cache line holds 4 128-bit texels, when a texel is defined as vec4 fp32. When a texel is defined as one of the standard compressed formats, then you get a huge number of texels into one cache line :smile:

    http://msdn.microsoft.com/en-us/library/bb694531(VS.85).aspx

    Jawed
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...