Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 06-Jul-2012, 06:05   #1
zchieply
Registered
 
Join Date: Dec 2011
Posts: 6
Default cpu cache?

I have a couple of questions about how the CPU accesses the cache

1) why do cups use large cache lines rather then a 8 bytes to represent a 64 bit variable

2)when a multi leveled cache is mapped does the l1 cache map to the l2 or does it map straight to the main memory

3) i know that smaller caches have less cycle latency is the cycle time calculated added every time for a cache line or is it added on for the bus length (EX) if a request was for a 64byte cache line on a 64 bit bus and the latency was 4 cycles would the latency be 32 cycles to transfer the whole line or just 4 cycles
zchieply is offline   Reply With Quote
Old 06-Jul-2012, 06:26   #2
Xenus
Senior Member
 
Join Date: Nov 2004
Location: Ohio
Posts: 1,208
Default

1. caches by their nature exploit locality of data. Therefor you want to balance reading what you need with amount of accesses. Even though catch is extreme fast by memory standards it till takes multiple cycles to read data. Hence if you can do more ops with that data the better it is. Think of it like this <cache, alu op, cache compare, cache, alu op, cache alu op> yes I realize there maybe stores etc there. Versus reading chance once and cache, alu op, compare, alu op, alu op>... It's all tradeoffs though in size of cache read size associativity etc.

2. I think you are thinking about this backwards the way cache works in layered is if it doesn't find it in the first layer it looks to the next. l1, then l2, then l3, finally main memory. I'd have to refresh myself on how where it is mapped after a miss but I think that depends on the algorithms as well.

(Feel free to correct me if I'm miss remembering it's been 2/3 years since my arch class on this and I haven't been using it to keep it fresh)

Last edited by Xenus; 06-Jul-2012 at 06:39.
Xenus is offline   Reply With Quote
Old 06-Jul-2012, 06:51   #3
Grall
Invisible Member
 
Join Date: Apr 2002
Location: La-la land
Posts: 5,030
Default

Quote:
Originally Posted by zchieply View Post
1) why do cups use large cache lines rather then a 8 bytes to represent a 64 bit variable
I'm not a CPU designer, but it's probably for a bunch of different reasons including probability, in that you're likely to want to access the next 64-bit chunk after you've accessed the first 64-bit chunk, so it'd save you time if it was already in the cache when you need it. Also, efficiency, as modern DRAM is faster if you burst-access big chunks of it rather than single 64-bit snack-sized bites all over the place and across page breaks and so on. In fact, I believe DRAM will always transfer at least 2, and maybe even 4 64-bit words regardless of how much you actually need due to the nature of how the DRAM chips are pipelined internally and the DDR bus and related things. And third, convenience, as it's probably easier to build a cache that handles larger units of data like lines than smaller ones.

Quote:
2)when a multi leveled cache is mapped does the l1 cache map to the l2 or does it map straight to the main memory
The caches have tags that are pointers to main memory locations I believe, or else I can imagine it'd be difficult to maintain coherency between what's stored in cache and what's in main memory...

Quote:
3) i know that smaller caches have less cycle latency is the cycle time calculated added every time for a cache line or is it added on for the bus length (EX) if a request was for a 64byte cache line on a 64 bit bus and the latency was 4 cycles would the latency be 32 cycles to transfer the whole line or just 4 cycles
Cache/memory controllers are typically able to transfer the "critical word" as soon as it comes in, just to not hold the CPU up for the entire cache line to fill, so if the piece of data you need is in the L1 you should only get the 4-cycle (or whatever) penalty for accessing it. This is from the perspective of the CPU core accessing L1. How long it takes for a L1 line to fill from memory (in cycles) depends on huge numbers of factors, and can have a latency of a thousand, or maybe even thousands of cycles these days. I do believe the same 4-cycle access latency would not apply in that case, as the cache controller has some kind of algorithm that decides what lines to eject/fill and should be able to just access them straight off without having to search for them first.
__________________
"If I were a science teacher and a student said the Universe is 6000 years old, I would mark that answer as wrong (why? Because it is)."
-Phil Plait
Grall is offline   Reply With Quote
Old 06-Jul-2012, 07:47   #4
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,291
Send a message via ICQ to OpenGL guy
Default

Quote:
Originally Posted by Grall View Post
I'm not a CPU designer, but it's probably for a bunch of different reasons including probability, in that you're likely to want to access the next 64-bit chunk after you've accessed the first 64-bit chunk, so it'd save you time if it was already in the cache when you need it. Also, efficiency, as modern DRAM is faster if you burst-access big chunks of it rather than single 64-bit snack-sized bites all over the place and across page breaks and so on. In fact, I believe DRAM will always transfer at least 2, and maybe even 4 64-bit words regardless of how much you actually need due to the nature of how the DRAM chips are pipelined internally and the DDR bus and related things. And third, convenience, as it's probably easier to build a cache that handles larger units of data like lines than smaller ones.


The caches have tags that are pointers to main memory locations I believe, or else I can imagine it'd be difficult to maintain coherency between what's stored in cache and what's in main memory...
On top of that, the tags take space as well, so if you went from 32-byte cache lines to 8-byte then you would need 4 times as many tags for the same cache size. This would likely increase the latency as well as increase the area and power costs.
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 06-Jul-2012, 09:37   #5
sebbbi
Member
 
Join Date: Nov 2007
Posts: 945
Default

Quote:
Originally Posted by zchieply View Post
2)when a multi leveled cache is mapped does the l1 cache map to the l2 or does it map straight to the main memory
Both L1 and L2 are mapped directly to memory addresses (by a modulo operation). We must be able to fetch data quickly directly from L1 cache based on addresses (pointers) calculated in code. Assuming power of 2 cache size, you can take bits 6 to14 (= 8 bits) from the address (assuming 16k L1 cache, and 64 byte cache line), and you have got the cache line address. This is a no-op for hardware (wire only these bits). This is how a simple direct mapped cache works, but associative cache works very similarly (you just got N lines under each bucket, and you have to choose the correct one).

If L1 would map to L2 (instead of directly to memory), you would need to do an additional indirection. Memory address would first need to be translated (modulo) to L2 address (this is no-op, since L2 is mapped directly to memory addresses), and then you would need to "ask L2" where the cache line is in L1. Or alternatively you could do an extra hash lookup (but that of course costs extra as well).
Quote:
Originally Posted by zchieply View Post
1) why do cups use large cache lines rather then a 8 bytes to represent a 64 bit variable
The larger cache line use you, the less requests you have to the memory subsystem, and the less bookkeeping in the cache you need to do (as cache contains less cache lines). For each cache line you need to keep track of it's memory address (bits 7 to 64 = 57 bits). You also need to store some flags, for example a modified flag (= do I need to write this back into memory on eviction?). If you had just 8 byte payload per cache line, you would have over 50% overhead (only half of the cache memory would be usable for real data).

8 byte cache lines would also be too small for vector registers. AVX vector register contains 32 bytes. CPUs are very slow in accessing data that crosses cache line boundaries (multiple requests and combining is required = big stall). So 32 bytes per cache line is practically the smallest possible for a modern CPU that supports AVX.

Haswell is going to double the cache line size to 128 bytes. I am sure this is done both to reduce cache (processing and memory) overhead, and to allow support for future 1024 bit AVX (128 byte per register). And I am pretty sure memory controllers and bus (and memory chips) work more efficiently if you stream large blocks of data instead of single 64 bit values.
sebbbi is offline   Reply With Quote
Old 06-Jul-2012, 09:59   #6
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,819
Send a message via Skype™ to fellix
Default

Quote:
Originally Posted by sebbbi View Post
8 byte cache lines would also be too small for vector registers. AVX vector register contains 32 bytes. CPUs are very slow in accessing data that crosses cache line boundaries (multiple requests and combining is required = big stall). So 32 bytes per cache line is practically the smallest possible for a modern CPU that supports AVX.
Didn't Intel fixed the cache-line boundary cross performance drop way back in Nehalem?
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 06-Jul-2012, 10:33   #7
sebbbi
Member
 
Join Date: Nov 2007
Posts: 945
Default

Quote:
Originally Posted by fellix View Post
Didn't Intel fixed the cache-line boundary cross performance drop way back in Nehalem?
Yes.But even in Sandy Bridge still there are some corner cases (for example store forwarding stalls when 32 byte vectors cross a cache line boundary). And it wouldn't be very effective to fetch four (8 byte) cache lines to fill a single (32 byte) vector register. Basically in the worst case you would have four cache misses for each move instruction (actually five if the vector is unaligned). Haswell is going to have a gather instruction, and soon we will see how fast it is (gather can also result in four cache misses). It's going to be interesting
sebbbi is offline   Reply With Quote
Old 06-Jul-2012, 12:16   #8
aaronspink
Senior Member
 
Join Date: Jun 2003
Posts: 2,570
Default

Quote:
Originally Posted by OpenGL guy View Post
On top of that, the tags take space as well, so if you went from 32-byte cache lines to 8-byte then you would need 4 times as many tags for the same cache size. This would likely increase the latency as well as increase the area and power costs.
Not to mention that your tag size would be close to or over half the cacheline size. You are looking at around 32b just for the addr portion of the tag, plus 3-4 bits for tag state plus parity or ecc given modern cpus physical/virtual address space capabilities.
__________________
Aaron Spink
speaking for myself inc.
aaronspink is offline   Reply With Quote
Old 06-Jul-2012, 13:16   #9
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,437
Default

Quote:
Originally Posted by sebbbi View Post
Haswell is going to double the cache line size to 128 bytes.
Source? I see various rumors wrt L1/L2 cache sizes and line sizes for Haswell, with absolutely zero backing of a real source.
mczak is offline   Reply With Quote
Old 06-Jul-2012, 14:35   #10
sebbbi
Member
 
Join Date: Nov 2007
Posts: 945
Default

Quote:
Originally Posted by mczak View Post
I see various rumors wrt L1/L2 cache sizes and line sizes for Haswell, with absolutely zero backing of a real source.
Haven't seen any concrete proof either, but the 128 byte cache line size is mentioned in many places. Might of course originate from same false information (take as grain of salt).

1024 bit (128 byte) AVX was mentioned (as a future plan) in Intel official AVX documents, but isn't present in Haswell yet. Haswell AVX2 documentation doesn't mention anything about 512 bit or 1024 bit operations or registers. So it's pretty certain that we don't see these extra wide vectors until Skylake.
sebbbi is offline   Reply With Quote
Old 06-Jul-2012, 14:53   #11
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,819
Send a message via Skype™ to fellix
Default

P4's L2 cache was organized in 128-Byte lines (two 64B lines per sector), so it's not unreasonable at least for Intel.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 06-Jul-2012, 20:59   #12
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,437
Default

Quote:
Originally Posted by sebbbi View Post
Haven't seen any concrete proof either, but the 128 byte cache line size is mentioned in many places. Might of course originate from same false information (take as grain of salt).
That's why I was asking. The only source seems to be some rumor somewhere. If you look at the wikipedia haswell entry, the L1/L2 cache side switches from 2x32KB/256KB to 2x64KB/1024KB and back every two days, and neither side has anything to back it up - I believe the cache line size came from the same source initially which just originated as random speculation.

Quote:
Originally Posted by fellix View Post
P4's L2 cache was organized in 128-Byte lines (two 64B lines per sector), so it's not unreasonable at least for Intel.
It isn't unreasonable, but at least as long as the wider vector instructions aren't a reality I don't really see why it should change neither (though I guess with gather being expected to have per-cacheline throughput penalties there'd be some benefits there at least). Well maybe for ddr4 if that has longer burst mode or something (since at least some later haswell-ep/ex? versions might presumably support ddr4).
mczak is offline   Reply With Quote
Old 06-Jul-2012, 23:45   #13
tunafish
Member
 
Join Date: Aug 2011
Posts: 370
Default

Quote:
Originally Posted by Grall View Post
In fact, I believe DRAM will always transfer at least 2, and maybe even 4 64-bit words regardless of how much you actually need due to the nature of how the DRAM chips are pipelined internally and the DDR bus and related things.
Actually, modern DDR3 DRAM have a 8-cycle prefetch buffer, so every transfer from ram is at least 64 bytes.
tunafish is offline   Reply With Quote
Old 07-Jul-2012, 04:27   #14
3dcgi
Senior Member
 
Join Date: Feb 2002
Posts: 2,019
Default

Quote:
Originally Posted by zchieply View Post
2)when a multi leveled cache is mapped does the l1 cache map to the l2 or does it map straight to the main memory
Others answered this question, but I interpreted it differently. I think you're asking if the L1 is inclusive or exclusive of the L2 and the answer is it depends. Here's Wikipedia's explanation of the two.
http://en.wikipedia.org/wiki/CPU_cac...rsus_inclusive
3dcgi is offline   Reply With Quote
Old 07-Jul-2012, 20:21   #15
Triskaine
Junior Member
 
Join Date: Mar 2010
Posts: 28
Default

The CPUID instruction provides information about cache size and associativity. CPU-Z readouts from Haswell ES already leaked backed in April:
http://www.chiphell.com/thread-451483-1-1.html

Haswell continues to use the Cache-Hierarchy of Sandy Bridge. No changes in size or associativity.
Triskaine is offline   Reply With Quote
Old 07-Jul-2012, 21:02   #16
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by zchieply View Post
why do cups use large cache lines rather then a 8 bytes to represent a 64 bit variable
Intel currently has eight 8-byte banks, with each 64 byte cache line evenly split across them. So reading an aligned 64-bit variable accesses only a single bank. Sandy Bridge can sustain two unaligned 128-bit reads per cycle (involving 6 banks) if they don't cross a cache line and there's no bank conflict.
Nick is offline   Reply With Quote
Old 09-Jul-2012, 00:22   #17
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,437
Default

Quote:
Originally Posted by Triskaine View Post
The CPUID instruction provides information about cache size and associativity. CPU-Z readouts from Haswell ES already leaked backed in April:
http://www.chiphell.com/thread-451483-1-1.html

Haswell continues to use the Cache-Hierarchy of Sandy Bridge. No changes in size or associativity.
Hmm yes looks like it. If it's legit...
mczak is offline   Reply With Quote
Old 09-Jul-2012, 01:37   #18
Homeles
Member
 
Join Date: May 2012
Posts: 142
Default

Quote:
Originally Posted by Triskaine View Post
The CPUID instruction provides information about cache size and associativity. CPU-Z readouts from Haswell ES already leaked backed in April:
http://www.chiphell.com/thread-451483-1-1.html

Haswell continues to use the Cache-Hierarchy of Sandy Bridge. No changes in size or associativity.
I don't see why it would.
Homeles is offline   Reply With Quote
Old 09-Jul-2012, 14:38   #19
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by Homeles View Post
I don't see why it would.
Haswell is strongly assumed to support two 256-bit loads per cycle. They could use eight 16-byte cache banks, or sixteen 8-byte banks, or stick with eight 8-byte banks. Note that the first two options likely require doubling the cache line length. So I wouldn't be surprised if Haswell did have 128-byte cache lines.

We'd have to evaluate the advantages/disadvantages of each option to see which one is most likely...
Nick is offline   Reply With Quote
Old 09-Jul-2012, 18:04   #20
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,437
Default

Quote:
Originally Posted by Nick View Post
Haswell is strongly assumed to support two 256-bit loads per cycle.
I disagree with your notion that Haswell is "strongly assumed" to support two 256-bit loads per cycle. Nothing really hints that it will have improved load/store (aside from what's really necessary for gather) - could be but I wouldn't be surprised if not neither.
mczak is offline   Reply With Quote
Old 10-Jul-2012, 04:55   #21
Homeles
Member
 
Join Date: May 2012
Posts: 142
Default

Quote:
Originally Posted by Nick View Post
Haswell is strongly assumed to support two 256-bit loads per cycle. They could use eight 16-byte cache banks, or sixteen 8-byte banks, or stick with eight 8-byte banks. Note that the first two options likely require doubling the cache line length. So I wouldn't be surprised if Haswell did have 128-byte cache lines.

We'd have to evaluate the advantages/disadvantages of each option to see which one is most likely...
Well, I hadn't really been paying attention to the rest of the thread. What I meant was that Nehalem and Sandy Bridge seem to have nailed to cache configuration: 32 KB L1 data and integer cache, 256 KB L2, 2MB/core L3. This seems to be the optimal setup. If they do change it though, I have no doubt it will be for the best.
Homeles is offline   Reply With Quote
Old 10-Jul-2012, 14:25   #22
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by mczak View Post
I disagree with your notion that Haswell is "strongly assumed" to support two 256-bit loads per cycle. Nothing really hints that it will have improved load/store (aside from what's really necessary for gather) - could be but I wouldn't be surprised if not neither.
Haswell doubles both the integer and floating-point SIMD throughput, and further increases the processing rate with gather. Sandy/Ivy Bridge only supports one 256-bit read and 128-bit write per cycle. So it would be nothing short of insane to provide that much more processing power and still leave it starved for data.
Nick is offline   Reply With Quote
Old 10-Jul-2012, 15:24   #23
CRoland
Member
 
Join Date: Jan 2010
Posts: 114
Default

Quote:
Originally Posted by Nick View Post
Haswell doubles both the integer and floating-point SIMD throughput, and further increases the processing rate with gather. Sandy/Ivy Bridge only supports one 256-bit read and 128-bit write per cycle. So it would be nothing short of insane to provide that much more processing power and still leave it starved for data.
Doubled throughput is due to FMA3, right? That doesn't necessarily require more data bandwidth for the most common cases where it helps (like matrix multiplies or dot products).
CRoland is offline   Reply With Quote
Old 10-Jul-2012, 16:01   #24
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,123
Default

It would still be an improvement without using FMA. The gains from using AVX-256 on Sandy Bridge were reduced in kernels that bottlenecked on the load/store units.
There were benchmarks showing improvment of 30% or so, when it really could have been higher if the memory ports could have supported it.

Wider accesses would hit more banks and increase the chance of a conflict if facing unaligned or irregular access.

Doubling line length would have some effect down the pipeline on things that work on line granularity, like the prefetchers, coherency, the stages and latencies for cache fills, or Intel's TSX functionality, which research and profiling would determine where the costs and benefits cross over.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is online now   Reply With Quote
Old 10-Jul-2012, 16:51   #25
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by CRoland View Post
Doubled throughput is due to FMA3, right? That doesn't necessarily require more data bandwidth for the most common cases where it helps (like matrix multiplies or dot products).
Sure it does. Why would processing code with many multiplications and additions almost twice as fast not require twice the bandwidth? Note again that Sandy/Ivy Bridge are already running into severe bandwidth bottlenecks, before FMA or 256-bit integer operations. And even twice the bandwidth isn't excessive. Haswell should have three 256-bit vector ALUs with three input operands. That's a peak of nine input and three output operands per cycle, so two 256-bit memory read ports and one 256-bit write port is a very safe assumption.
Nick is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 18:09.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.