on prefetching in x86 cpu's

rpg.314 · Sep 27, 2008

I just came across the _mm_prefetch intrinsic in intel's sse reference. I have a small doubt which I would like clarified.

It says that this intrinsic loads a "cacheline of data" at into caches. And there are no alignment requirements on the address. My understanding was, (and is, till the matter is cleared), that the cpu fetches data in multiple of 64 bytes (my cpu's cacheline size, checked with CPU-Z), and aligned at 64 bytes. In view of the statement in the reference, either

1) this intrinsic would load a[64] into the cache.

2) this intrinsic would throw away the lower 6 bits of a, (ie align it to 64 bytes) and then fetch the cache line.

I think former is the case but I want the latter. So when I am giving prefetch hints, should I manually align it to 64 by throwing away the last 6 bits. I'll probably will have to, but I thought it's better to confirm first.

Thanks in advance.

Npl · Sep 27, 2008

2)

Cachelines always represent aligned memory-areas and prefetch doesnt access them directly, but just says "try to get that address cached".
And data is typically fetched in 64-128 bit chunks from main memory.

rpg.314 · Sep 27, 2008

And data is typically fetched in 64-128 bit chunks from main memory.

That's odd. If data is accessed in 64-128 bit chunks, then what does the cacheline size mean? I thought that the cache line size meant that min data that will be loaded into cache in one go. Or are data fetch and cacheline size are independent (ie, data is fetched in 8-16 bytes, but the cpu will make 8-4 fetches per cacheline)

Npl · Sep 27, 2008

rpg.314 said:
That's odd. If data is accessed in 64-128 bit chunks, then what does the cacheline size mean? I thought that the cache line size meant that min data that will be loaded into cache in one go. Or are data fetch and cacheline size are independent (ie, data is fetched in 8-16 bytes, but the cpu will make 8-4 fetches per cacheline)

It might be that the paths between (L1,L2,L3) Caches are bigger, but from main memory you need multiple fetches per cacheline (bus is certainly not bigger than 128bit). Some CPUs can use Cachelines even if they arent "full".. ie. you only need to wait for the relevant data to be loaded and the CPU will load the rest afterwards.

CPUs use relatively big cachelines as this seems to be more performance/cost efficient. Its my naive understanding that adding cachelines is more complex and might hurt latency, while increasing the size of the lines is rather easy.

nutball · Sep 27, 2008

rpg.314 said:
Or are data fetch and cacheline size are independent (ie, data is fetched in 8-16 bytes, but the cpu will make 8-4 fetches per cacheline)

Yes. Filling a line typically takes multiple fetches, but modern memory sub-systems are tuned around this sort of bursty behaviour.

The sizes of the cache lines in L1, L2 and L3 can differ too.

As for your first question as Npl said the range of addresses fetched will be aligned but the address you give to pre-fetch doesn't need to be. It just means "get the cache line that contains this address". A smart memory controller might get the address you want first then fill in the rest of the cache line.

[Well, this is how it works on the big boys chips anyway]

rpg.314 · Sep 28, 2008

The sizes of the cache lines in L1, L2 and L3 can differ too.

That's ok. CPU-Z is helpful with that.

It was important to me because in my app i have organized my data layout so that most of the memory requirements can be served from the same cacheline. And for this reason, I would much rather have the entire cacheline available when I need it. But I doubt if that sort of behaviour can be requested by the programmer.

Sulik · Sep 28, 2008

On x86, prefetch will prefetch the cache line specified by the operand. On Intel, the lower 7-bits of the operand will be ignored, as a L2 cache line is 128-byte wide (L1 cache line is 64-byte).

on prefetching in x86 cpu's

rpg.314

Npl

rpg.314

Npl

nutball

rpg.314

Sulik

Similar threads