on prefetching in x86 cpu's

rpg.314

Veteran
I just came across the _mm_prefetch intrinsic in intel's sse reference. I have a small doubt which I would like clarified.

It says that this intrinsic loads a "cacheline of data" at into caches. And there are no alignment requirements on the address. My understanding was, (and is, till the matter is cleared), that the cpu fetches data in multiple of 64 bytes (my cpu's cacheline size, checked with CPU-Z), and aligned at 64 bytes. In view of the statement in the reference, either

1) this intrinsic would load a[64] into the cache.

2) this intrinsic would throw away the lower 6 bits of a, (ie align it to 64 bytes) and then fetch the cache line.

I think former is the case but I want the latter. So when I am giving prefetch hints, should I manually align it to 64 by throwing away the last 6 bits. I'll probably will have to, but I thought it's better to confirm first.

Thanks in advance.
 
2)

Cachelines always represent aligned memory-areas and prefetch doesnt access them directly, but just says "try to get that address cached".
And data is typically fetched in 64-128 bit chunks from main memory.
 
And data is typically fetched in 64-128 bit chunks from main memory.

That's odd. If data is accessed in 64-128 bit chunks, then what does the cacheline size mean? I thought that the cache line size meant that min data that will be loaded into cache in one go. Or are data fetch and cacheline size are independent (ie, data is fetched in 8-16 bytes, but the cpu will make 8-4 fetches per cacheline)
 
That's odd. If data is accessed in 64-128 bit chunks, then what does the cacheline size mean? I thought that the cache line size meant that min data that will be loaded into cache in one go. Or are data fetch and cacheline size are independent (ie, data is fetched in 8-16 bytes, but the cpu will make 8-4 fetches per cacheline)
It might be that the paths between (L1,L2,L3) Caches are bigger, but from main memory you need multiple fetches per cacheline (bus is certainly not bigger than 128bit). Some CPUs can use Cachelines even if they arent "full".. ie. you only need to wait for the relevant data to be loaded and the CPU will load the rest afterwards.

CPUs use relatively big cachelines as this seems to be more performance/cost efficient. Its my naive understanding that adding cachelines is more complex and might hurt latency, while increasing the size of the lines is rather easy.
 
Or are data fetch and cacheline size are independent (ie, data is fetched in 8-16 bytes, but the cpu will make 8-4 fetches per cacheline)

Yes. Filling a line typically takes multiple fetches, but modern memory sub-systems are tuned around this sort of bursty behaviour.

The sizes of the cache lines in L1, L2 and L3 can differ too.

As for your first question as Npl said the range of addresses fetched will be aligned but the address you give to pre-fetch doesn't need to be. It just means "get the cache line that contains this address". A smart memory controller might get the address you want first then fill in the rest of the cache line.

[Well, this is how it works on the big boys chips anyway]
 
The sizes of the cache lines in L1, L2 and L3 can differ too.

That's ok. CPU-Z is helpful with that. :)

It was important to me because in my app i have organized my data layout so that most of the memory requirements can be served from the same cacheline. And for this reason, I would much rather have the entire cacheline available when I need it. But I doubt if that sort of behaviour can be requested by the programmer.
 
On x86, prefetch will prefetch the cache line specified by the operand. On Intel, the lower 7-bits of the operand will be ignored, as a L2 cache line is 128-byte wide (L1 cache line is 64-byte).
 
Back
Top