If I issue a load instruction on RV770, how many bytes of data are fetched from memory? In CPUs, you have 64B cache lines and each fetch from memory is for 1 cache line (or possibly more).
Define a "load instruction".
On a GPU a load instruction in a kernel "runs simultaneously" for every strand.
Within each RV770 cluster 64 strands are fetching, "per instruction", but, the texture fetch hardware can only do a fetch for 4 strands per cycle. The minimum fetch per strand is 4 bytes. So 16B per clock per thread is the absolute minimum, and there are 10 threads able to fetch per clock.
The fetch goes through L1 in this case and the maximum fetch rate is 480GB/s theoretically, although observed to be 444GB/s:
http://forum.beyond3d.com/showthread.php?t=54842
480GB/s results from 10 clusters * 4 strands * 16 bytes (vec4 fp32 = 128-bits) * 750MHz.
L2->L1 bandwidth is only 384GB/s.
As I already explained the burst size from GDDR5 is 64B, so it makes sense that L2 has 64B lines. I don't know what the line size is in L1, but since a single fetch is, at most, 4 * 128 bits per clock, a 64B line seems logical.
prunedtree was able to obtain 880GFLOPs for matrix multiply by carefully tuning the fetch pattern, so that all of a fetch line is used, and needed only once by a cluster's L1 and so that the entire latency of fetching from memory (using 8 threads per cluster) is hidden. His optimisation accounts for the Z-ordering (space filling curve) that is used for texture storage in memory, where texels are tiled for best locality both in the 2D of texture space and in the extra dimension of colour channels (1, 2 or 4 channels).
So logically a 64B cache line holds 4 128-bit texels, when a texel is defined as vec4 fp32. When a texel is defined as one of the standard compressed formats, then you get a huge number of texels into one cache line :smile:
http://msdn.microsoft.com/en-us/library/bb694531(VS.85).aspx
Jawed