Minimum fetch from DRAM on RV770?

dkanter · Aug 22, 2009

OK, so I spent a while trying to find out how much data is fetched for a load from memory, but couldn't find jack shit (thanks for the lack of any documentation ATI!).

The GT200 fetches 64B from DRAM for each memory request (http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=10), which is the same as an x86 processor.

How much data does the RV770 fetch from DRAM for each memory request? I don't think it's the same, since RV770 has GDDR5 which bursts more data than GDDR3 (I believe).

David

Jawed · Aug 22, 2009

GDDR5 has an 8-bit burst (as stated in the Qimonda whitepaper you have) and there are four controllers that are 64 bits wide, so 64B.

Jawed

bowman · Aug 22, 2009

Ooh, RV770 article.

You played your hand, now I'll be F5ing RWT until my fingers bleed. Thanks a bunch.

trinibwoy · Aug 22, 2009

Or you could just follow dk on Twitter

dkanter · Aug 22, 2009

Jawed said:
GDDR5 has an 8-bit burst (as stated in the Qimonda whitepaper you have) and there are four controllers that are 64 bits wide, so 64B.

Jawed

I don't think that's how it's calculated. A memory request goes to a single memory controller and should only access 1-4 DRAMs (for locality's sake), not all four controllers.

At least that was my understanding of how it works for GT200, and I'd assume it works the same way for RV770. Again, I don't know for sure, but that's my belief based upon what I read.

David

dkanter · Aug 22, 2009

bowman said:
Ooh, RV770 article.

You played your hand, now I'll be F5ing RWT until my fingers bleed. Thanks a bunch.

LOL....twitter : )

David

Jawed · Aug 22, 2009

dkanter said:
I don't think that's how it's calculated. A memory request goes to a single memory controller and should only access 1-4 DRAMs (for locality's sake), not all four controllers.

At least that was my understanding of how it works for GT200, and I'd assume it works the same way for RV770. Again, I don't know for sure, but that's my belief based upon what I read.

On RV770, 2 chips are addressed by a single memory controller simultaneously on GDDR5, 32-bits each. Or, 4 chips are addressed simultaneously (clamshell configuration), 16-bits each. The burst length is always 8.

GDDR3 has burst lengths of either 4 or 8, a completely free choice in the mode of operation of the controller, as far as I can tell.

Jawed

dkanter · Aug 23, 2009

Jawed said:
On RV770, 2 chips are addressed by a single memory controller simultaneously on GDDR5, 32-bits each. Or, 4 chips are addressed simultaneously (clamshell configuration), 16-bits each. The burst length is always 8.

GDDR3 has burst lengths of either 4 or 8, a completely free choice in the mode of operation of the controller, as far as I can tell.

Jawed

That's not what I'm asking...

If I issue a load instruction on RV770, how many bytes of data are fetched from memory? In CPUs, you have 64B cache lines and each fetch from memory is for 1 cache line (or possibly more).

DK

Bob · Aug 23, 2009

It's not hard to write a test to figure it out.

dkanter · Aug 23, 2009

Bob said:
It's not hard to write a test to figure it out.

Not being much of a programmer myself, I'm not sure I'd agree.

Rather - it might be relatively easy for you, I doubt it would be for me.

David

Jawed · Aug 23, 2009

dkanter said:
If I issue a load instruction on RV770, how many bytes of data are fetched from memory? In CPUs, you have 64B cache lines and each fetch from memory is for 1 cache line (or possibly more).

Define a "load instruction".

On a GPU a load instruction in a kernel "runs simultaneously" for every strand.

Within each RV770 cluster 64 strands are fetching, "per instruction", but, the texture fetch hardware can only do a fetch for 4 strands per cycle. The minimum fetch per strand is 4 bytes. So 16B per clock per thread is the absolute minimum, and there are 10 threads able to fetch per clock.

The fetch goes through L1 in this case and the maximum fetch rate is 480GB/s theoretically, although observed to be 444GB/s:

http://forum.beyond3d.com/showthread.php?t=54842

480GB/s results from 10 clusters * 4 strands * 16 bytes (vec4 fp32 = 128-bits) * 750MHz.

L2->L1 bandwidth is only 384GB/s.

As I already explained the burst size from GDDR5 is 64B, so it makes sense that L2 has 64B lines. I don't know what the line size is in L1, but since a single fetch is, at most, 4 * 128 bits per clock, a 64B line seems logical.

prunedtree was able to obtain 880GFLOPs for matrix multiply by carefully tuning the fetch pattern, so that all of a fetch line is used, and needed only once by a cluster's L1 and so that the entire latency of fetching from memory (using 8 threads per cluster) is hidden. His optimisation accounts for the Z-ordering (space filling curve) that is used for texture storage in memory, where texels are tiled for best locality both in the 2D of texture space and in the extra dimension of colour channels (1, 2 or 4 channels).

So logically a 64B cache line holds 4 128-bit texels, when a texel is defined as vec4 fp32. When a texel is defined as one of the standard compressed formats, then you get a huge number of texels into one cache line :smile:

http://msdn.microsoft.com/en-us/library/bb694531(VS.85).aspx

Jawed

Minimum fetch from DRAM on RV770?

dkanter

Jawed

bowman

trinibwoy

Meh

dkanter

dkanter

Jawed

dkanter

Bob

dkanter

Jawed

Similar threads