cpu cache?

zchieply

Newcomer
I have a couple of questions about how the CPU accesses the cache

1) why do cups use large cache lines rather then a 8 bytes to represent a 64 bit variable

2)when a multi leveled cache is mapped does the l1 cache map to the l2 or does it map straight to the main memory

3) i know that smaller caches have less cycle latency is the cycle time calculated added every time for a cache line or is it added on for the bus length (EX) if a request was for a 64byte cache line on a 64 bit bus and the latency was 4 cycles would the latency be 32 cycles to transfer the whole line or just 4 cycles
 
1. caches by their nature exploit locality of data. Therefor you want to balance reading what you need with amount of accesses. Even though catch is extreme fast by memory standards it till takes multiple cycles to read data. Hence if you can do more ops with that data the better it is. Think of it like this <cache, alu op, cache compare, cache, alu op, cache alu op> yes I realize there maybe stores etc there. Versus reading chance once and cache, alu op, compare, alu op, alu op>... It's all tradeoffs though in size of cache read size associativity etc.

2. I think you are thinking about this backwards the way cache works in layered is if it doesn't find it in the first layer it looks to the next. l1, then l2, then l3, finally main memory. I'd have to refresh myself on how where it is mapped after a miss but I think that depends on the algorithms as well.

(Feel free to correct me if I'm miss remembering it's been 2/3 years since my arch class on this and I haven't been using it to keep it fresh)
 
Last edited by a moderator:
1) why do cups use large cache lines rather then a 8 bytes to represent a 64 bit variable
I'm not a CPU designer, but it's probably for a bunch of different reasons including probability, in that you're likely to want to access the next 64-bit chunk after you've accessed the first 64-bit chunk, so it'd save you time if it was already in the cache when you need it. Also, efficiency, as modern DRAM is faster if you burst-access big chunks of it rather than single 64-bit snack-sized bites all over the place and across page breaks and so on. In fact, I believe DRAM will always transfer at least 2, and maybe even 4 64-bit words regardless of how much you actually need due to the nature of how the DRAM chips are pipelined internally and the DDR bus and related things. And third, convenience, as it's probably easier to build a cache that handles larger units of data like lines than smaller ones.

2)when a multi leveled cache is mapped does the l1 cache map to the l2 or does it map straight to the main memory
The caches have tags that are pointers to main memory locations I believe, or else I can imagine it'd be difficult to maintain coherency between what's stored in cache and what's in main memory...

3) i know that smaller caches have less cycle latency is the cycle time calculated added every time for a cache line or is it added on for the bus length (EX) if a request was for a 64byte cache line on a 64 bit bus and the latency was 4 cycles would the latency be 32 cycles to transfer the whole line or just 4 cycles
Cache/memory controllers are typically able to transfer the "critical word" as soon as it comes in, just to not hold the CPU up for the entire cache line to fill, so if the piece of data you need is in the L1 you should only get the 4-cycle (or whatever) penalty for accessing it. This is from the perspective of the CPU core accessing L1. How long it takes for a L1 line to fill from memory (in cycles) depends on huge numbers of factors, and can have a latency of a thousand, or maybe even thousands of cycles these days. I do believe the same 4-cycle access latency would not apply in that case, as the cache controller has some kind of algorithm that decides what lines to eject/fill and should be able to just access them straight off without having to search for them first. :)
 
I'm not a CPU designer, but it's probably for a bunch of different reasons including probability, in that you're likely to want to access the next 64-bit chunk after you've accessed the first 64-bit chunk, so it'd save you time if it was already in the cache when you need it. Also, efficiency, as modern DRAM is faster if you burst-access big chunks of it rather than single 64-bit snack-sized bites all over the place and across page breaks and so on. In fact, I believe DRAM will always transfer at least 2, and maybe even 4 64-bit words regardless of how much you actually need due to the nature of how the DRAM chips are pipelined internally and the DDR bus and related things. And third, convenience, as it's probably easier to build a cache that handles larger units of data like lines than smaller ones.


The caches have tags that are pointers to main memory locations I believe, or else I can imagine it'd be difficult to maintain coherency between what's stored in cache and what's in main memory...
On top of that, the tags take space as well, so if you went from 32-byte cache lines to 8-byte then you would need 4 times as many tags for the same cache size. This would likely increase the latency as well as increase the area and power costs.
 
2)when a multi leveled cache is mapped does the l1 cache map to the l2 or does it map straight to the main memory
Both L1 and L2 are mapped directly to memory addresses (by a modulo operation). We must be able to fetch data quickly directly from L1 cache based on addresses (pointers) calculated in code. Assuming power of 2 cache size, you can take bits 6 to14 (= 8 bits) from the address (assuming 16k L1 cache, and 64 byte cache line), and you have got the cache line address. This is a no-op for hardware (wire only these bits). This is how a simple direct mapped cache works, but associative cache works very similarly (you just got N lines under each bucket, and you have to choose the correct one).

If L1 would map to L2 (instead of directly to memory), you would need to do an additional indirection. Memory address would first need to be translated (modulo) to L2 address (this is no-op, since L2 is mapped directly to memory addresses), and then you would need to "ask L2" where the cache line is in L1. Or alternatively you could do an extra hash lookup (but that of course costs extra as well).
1) why do cups use large cache lines rather then a 8 bytes to represent a 64 bit variable
The larger cache line use you, the less requests you have to the memory subsystem, and the less bookkeeping in the cache you need to do (as cache contains less cache lines). For each cache line you need to keep track of it's memory address (bits 7 to 64 = 57 bits). You also need to store some flags, for example a modified flag (= do I need to write this back into memory on eviction?). If you had just 8 byte payload per cache line, you would have over 50% overhead (only half of the cache memory would be usable for real data).

8 byte cache lines would also be too small for vector registers. AVX vector register contains 32 bytes. CPUs are very slow in accessing data that crosses cache line boundaries (multiple requests and combining is required = big stall). So 32 bytes per cache line is practically the smallest possible for a modern CPU that supports AVX.

Haswell is going to double the cache line size to 128 bytes. I am sure this is done both to reduce cache (processing and memory) overhead, and to allow support for future 1024 bit AVX (128 byte per register). And I am pretty sure memory controllers and bus (and memory chips) work more efficiently if you stream large blocks of data instead of single 64 bit values.
 
8 byte cache lines would also be too small for vector registers. AVX vector register contains 32 bytes. CPUs are very slow in accessing data that crosses cache line boundaries (multiple requests and combining is required = big stall). So 32 bytes per cache line is practically the smallest possible for a modern CPU that supports AVX.
Didn't Intel fixed the cache-line boundary cross performance drop way back in Nehalem?
 
Didn't Intel fixed the cache-line boundary cross performance drop way back in Nehalem?
Yes.But even in Sandy Bridge still there are some corner cases (for example store forwarding stalls when 32 byte vectors cross a cache line boundary). And it wouldn't be very effective to fetch four (8 byte) cache lines to fill a single (32 byte) vector register. Basically in the worst case you would have four cache misses for each move instruction (actually five if the vector is unaligned). Haswell is going to have a gather instruction, and soon we will see how fast it is (gather can also result in four cache misses). It's going to be interesting :)
 
On top of that, the tags take space as well, so if you went from 32-byte cache lines to 8-byte then you would need 4 times as many tags for the same cache size. This would likely increase the latency as well as increase the area and power costs.

Not to mention that your tag size would be close to or over half the cacheline size. You are looking at around 32b just for the addr portion of the tag, plus 3-4 bits for tag state plus parity or ecc given modern cpus physical/virtual address space capabilities.
 
I see various rumors wrt L1/L2 cache sizes and line sizes for Haswell, with absolutely zero backing of a real source.
Haven't seen any concrete proof either, but the 128 byte cache line size is mentioned in many places. Might of course originate from same false information (take as grain of salt).

1024 bit (128 byte) AVX was mentioned (as a future plan) in Intel official AVX documents, but isn't present in Haswell yet. Haswell AVX2 documentation doesn't mention anything about 512 bit or 1024 bit operations or registers. So it's pretty certain that we don't see these extra wide vectors until Skylake.
 
P4's L2 cache was organized in 128-Byte lines (two 64B lines per sector), so it's not unreasonable at least for Intel.
 
Haven't seen any concrete proof either, but the 128 byte cache line size is mentioned in many places. Might of course originate from same false information (take as grain of salt).
That's why I was asking. The only source seems to be some rumor somewhere. If you look at the wikipedia haswell entry, the L1/L2 cache side switches from 2x32KB/256KB to 2x64KB/1024KB and back every two days, and neither side has anything to back it up - I believe the cache line size came from the same source initially which just originated as random speculation.

P4's L2 cache was organized in 128-Byte lines (two 64B lines per sector), so it's not unreasonable at least for Intel.
It isn't unreasonable, but at least as long as the wider vector instructions aren't a reality I don't really see why it should change neither (though I guess with gather being expected to have per-cacheline throughput penalties there'd be some benefits there at least). Well maybe for ddr4 if that has longer burst mode or something (since at least some later haswell-ep/ex? versions might presumably support ddr4).
 
In fact, I believe DRAM will always transfer at least 2, and maybe even 4 64-bit words regardless of how much you actually need due to the nature of how the DRAM chips are pipelined internally and the DDR bus and related things.

Actually, modern DDR3 DRAM have a 8-cycle prefetch buffer, so every transfer from ram is at least 64 bytes.
 
why do cups use large cache lines rather then a 8 bytes to represent a 64 bit variable
Intel currently has eight 8-byte banks, with each 64 byte cache line evenly split across them. So reading an aligned 64-bit variable accesses only a single bank. Sandy Bridge can sustain two unaligned 128-bit reads per cycle (involving 6 banks) if they don't cross a cache line and there's no bank conflict.
 
I don't see why it would.
Haswell is strongly assumed to support two 256-bit loads per cycle. They could use eight 16-byte cache banks, or sixteen 8-byte banks, or stick with eight 8-byte banks. Note that the first two options likely require doubling the cache line length. So I wouldn't be surprised if Haswell did have 128-byte cache lines.

We'd have to evaluate the advantages/disadvantages of each option to see which one is most likely...
 
Haswell is strongly assumed to support two 256-bit loads per cycle.
I disagree with your notion that Haswell is "strongly assumed" to support two 256-bit loads per cycle. Nothing really hints that it will have improved load/store (aside from what's really necessary for gather) - could be but I wouldn't be surprised if not neither.
 
Back
Top