If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 |
|
Registered
Join Date: Dec 2011
Posts: 6
|
I have a couple of questions about how the CPU accesses the cache
1) why do cups use large cache lines rather then a 8 bytes to represent a 64 bit variable 2)when a multi leveled cache is mapped does the l1 cache map to the l2 or does it map straight to the main memory 3) i know that smaller caches have less cycle latency is the cycle time calculated added every time for a cache line or is it added on for the bus length (EX) if a request was for a 64byte cache line on a 64 bit bus and the latency was 4 cycles would the latency be 32 cycles to transfer the whole line or just 4 cycles |
|
|
|
|
|
#2 |
|
Senior Member
Join Date: Nov 2004
Location: Ohio
Posts: 1,208
|
1. caches by their nature exploit locality of data. Therefor you want to balance reading what you need with amount of accesses. Even though catch is extreme fast by memory standards it till takes multiple cycles to read data. Hence if you can do more ops with that data the better it is. Think of it like this <cache, alu op, cache compare, cache, alu op, cache alu op> yes I realize there maybe stores etc there. Versus reading chance once and cache, alu op, compare, alu op, alu op>... It's all tradeoffs though in size of cache read size associativity etc.
2. I think you are thinking about this backwards the way cache works in layered is if it doesn't find it in the first layer it looks to the next. l1, then l2, then l3, finally main memory. I'd have to refresh myself on how where it is mapped after a miss but I think that depends on the algorithms as well. (Feel free to correct me if I'm miss remembering it's been 2/3 years since my arch class on this and I haven't been using it to keep it fresh) Last edited by Xenus; 06-Jul-2012 at 06:39. |
|
|
|
|
|
#3 | |||
|
Invisible Member
Join Date: Apr 2002
Location: La-la land
Posts: 5,030
|
Quote:
Quote:
Quote:
__________________
"If I were a science teacher and a student said the Universe is 6000 years old, I would mark that answer as wrong (why? Because it is)." -Phil Plait |
|||
|
|
|
|
|
#4 | |
|
Senior Member
|
Quote:
__________________
I speak only for myself. |
|
|
|
|
|
|
#5 | ||
|
Member
Join Date: Nov 2007
Posts: 945
|
Quote:
If L1 would map to L2 (instead of directly to memory), you would need to do an additional indirection. Memory address would first need to be translated (modulo) to L2 address (this is no-op, since L2 is mapped directly to memory addresses), and then you would need to "ask L2" where the cache line is in L1. Or alternatively you could do an extra hash lookup (but that of course costs extra as well). Quote:
8 byte cache lines would also be too small for vector registers. AVX vector register contains 32 bytes. CPUs are very slow in accessing data that crosses cache line boundaries (multiple requests and combining is required = big stall). So 32 bytes per cache line is practically the smallest possible for a modern CPU that supports AVX. Haswell is going to double the cache line size to 128 bytes. I am sure this is done both to reduce cache (processing and memory) overhead, and to allow support for future 1024 bit AVX (128 byte per register). And I am pretty sure memory controllers and bus (and memory chips) work more efficiently if you stream large blocks of data instead of single 64 bit values. |
||
|
|
|
|
|
#6 | |
|
Senior Member
|
Quote:
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
|
#7 | |
|
Member
Join Date: Nov 2007
Posts: 945
|
Quote:
|
|
|
|
|
|
|
#8 |
|
Senior Member
Join Date: Jun 2003
Posts: 2,570
|
Not to mention that your tag size would be close to or over half the cacheline size. You are looking at around 32b just for the addr portion of the tag, plus 3-4 bits for tag state plus parity or ecc given modern cpus physical/virtual address space capabilities.
__________________
Aaron Spink speaking for myself inc. |
|
|
|
|
|
#9 |
|
Senior Member
Join Date: Oct 2002
Posts: 2,437
|
|
|
|
|
|
|
#10 | |
|
Member
Join Date: Nov 2007
Posts: 945
|
Quote:
1024 bit (128 byte) AVX was mentioned (as a future plan) in Intel official AVX documents, but isn't present in Haswell yet. Haswell AVX2 documentation doesn't mention anything about 512 bit or 1024 bit operations or registers. So it's pretty certain that we don't see these extra wide vectors until Skylake. |
|
|
|
|
|
|
#11 |
|
Senior Member
|
P4's L2 cache was organized in 128-Byte lines (two 64B lines per sector), so it's not unreasonable at least for Intel.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#12 | |
|
Senior Member
Join Date: Oct 2002
Posts: 2,437
|
Quote:
It isn't unreasonable, but at least as long as the wider vector instructions aren't a reality I don't really see why it should change neither (though I guess with gather being expected to have per-cacheline throughput penalties there'd be some benefits there at least). Well maybe for ddr4 if that has longer burst mode or something (since at least some later haswell-ep/ex? versions might presumably support ddr4). |
|
|
|
|
|
|
#13 |
|
Member
Join Date: Aug 2011
Posts: 370
|
Actually, modern DDR3 DRAM have a 8-cycle prefetch buffer, so every transfer from ram is at least 64 bytes.
|
|
|
|
|
|
#14 | |
|
Senior Member
Join Date: Feb 2002
Posts: 2,019
|
Quote:
http://en.wikipedia.org/wiki/CPU_cac...rsus_inclusive |
|
|
|
|
|
|
#15 |
|
Junior Member
Join Date: Mar 2010
Posts: 28
|
The CPUID instruction provides information about cache size and associativity. CPU-Z readouts from Haswell ES already leaked backed in April:
http://www.chiphell.com/thread-451483-1-1.html Haswell continues to use the Cache-Hierarchy of Sandy Bridge. No changes in size or associativity. |
|
|
|
|
|
#16 |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Intel currently has eight 8-byte banks, with each 64 byte cache line evenly split across them. So reading an aligned 64-bit variable accesses only a single bank. Sandy Bridge can sustain two unaligned 128-bit reads per cycle (involving 6 banks) if they don't cross a cache line and there's no bank conflict.
|
|
|
|
|
|
#17 | |
|
Senior Member
Join Date: Oct 2002
Posts: 2,437
|
Quote:
|
|
|
|
|
|
|
#18 | |
|
Member
Join Date: May 2012
Posts: 142
|
Quote:
|
|
|
|
|
|
|
#19 |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Haswell is strongly assumed to support two 256-bit loads per cycle. They could use eight 16-byte cache banks, or sixteen 8-byte banks, or stick with eight 8-byte banks. Note that the first two options likely require doubling the cache line length. So I wouldn't be surprised if Haswell did have 128-byte cache lines.
We'd have to evaluate the advantages/disadvantages of each option to see which one is most likely... |
|
|
|
|
|
#20 |
|
Senior Member
Join Date: Oct 2002
Posts: 2,437
|
I disagree with your notion that Haswell is "strongly assumed" to support two 256-bit loads per cycle. Nothing really hints that it will have improved load/store (aside from what's really necessary for gather) - could be but I wouldn't be surprised if not neither.
|
|
|
|
|
|
#21 | |
|
Member
Join Date: May 2012
Posts: 142
|
Quote:
|
|
|
|
|
|
|
#22 |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Haswell doubles both the integer and floating-point SIMD throughput, and further increases the processing rate with gather. Sandy/Ivy Bridge only supports one 256-bit read and 128-bit write per cycle. So it would be nothing short of insane to provide that much more processing power and still leave it starved for data.
|
|
|
|
|
|
#23 | |
|
Member
Join Date: Jan 2010
Posts: 114
|
Quote:
|
|
|
|
|
|
|
#24 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,123
|
It would still be an improvement without using FMA. The gains from using AVX-256 on Sandy Bridge were reduced in kernels that bottlenecked on the load/store units.
There were benchmarks showing improvment of 30% or so, when it really could have been higher if the memory ports could have supported it. Wider accesses would hit more banks and increase the chance of a conflict if facing unaligned or irregular access. Doubling line length would have some effect down the pipeline on things that work on line granularity, like the prefetchers, coherency, the stages and latencies for cache fills, or Intel's TSX functionality, which research and profiling would determine where the costs and benefits cross over.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#25 |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Sure it does. Why would processing code with many multiplications and additions almost twice as fast not require twice the bandwidth? Note again that Sandy/Ivy Bridge are already running into severe bandwidth bottlenecks, before FMA or 256-bit integer operations. And even twice the bandwidth isn't excessive. Haswell should have three 256-bit vector ALUs with three input operands. That's a peak of nine input and three output operands per cycle, so two 256-bit memory read ports and one 256-bit write port is a very safe assumption.
|
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|