Processor cache architectures

Does anybody have a good article or resource on CPU cache architectures? Basically I want to know about bus's (ala P6 backside bus) and data flow. (if L1 miss then L2 check, if L2 miss then... and what the max latency for a memory access is) Also if anyone knows a good article on DDR operation and timing that would be appreciated as well. Thanks.
 
Is the following of any help? It's a reasonable introduction.
?

I did have a link to a page which was a "superset" in the sense that this talk was referenced, but for the moment I can't find where I've stored it!
 
Thanks sebbbi forgot about RWT, it answered some of my questions. That is the level of article I was looking for but jaguar doesn't have an L3, which is one of the things I was curious about (namely intel's L3 slice arch). I checked their Haswell microarchitecture article but it glances over the L3 since its a part of the system architecture, it talks about it as if either you have prior knowledge or they discussed it before. Guess I'll dig some more for that specifically.

Thanks once again... Roderic I thought your paper would be useful so I posted it in a thread on gamedev.net, sorry if I beat you to it.
 
Thanks once again... Roderic I thought your paper would be useful so I posted it in a thread on gamedev.net, sorry if I beat you to it.
Knowledge should be shared freely :) (And I was on holiday so didn't check emails or forums at all ^^)
 
what questions about the l3 do you have? maybe some forum's member has the answers
Basically I was wondering about the implementation of L3's in AMD and Intel cpu's. In addition Intel's modern L3's are hooked up as cache slices on a ring bus, and I was wondering what the latency penalty for a non-local slice is and how the tags are handled? Is the tag ram local to the slice? Are tags replicated across slices?...? Stuff like that and any tidbits on layout constraints in regards to bus's.
 
Basically I was wondering about the implementation of L3's in AMD and Intel cpu's.

The systems AMD and Intel use are very different.

In addition Intel's modern L3's are hooked up as cache slices on a ring bus, and I was wondering what the latency penalty for a non-local slice is

One cycle per hop in the ring bus, both on the request and on the response. Since the ring bus is unidirectional, with requests flowing in the opposite direction, this often means you need to go over the system manager/memory controller hops too.

and how the tags are handled? Is the tag ram local to the slice? Are tags replicated across slices?...? Stuff like that and any tidbits on layout constraints in regards to bus's.

Tags are located on the same slice as the cache lines they cover.

Intel uses the the L3 as a cache that is fully inclusive of all the levels below it. Physical addresses are striped evenly across all L3 slices -- that is, you cannot only access the local slice, all cores access all L3 slices evenly. Each tag entry in L3 has a bit for every possible consumer of the cache.

For example, if core #1 wanted to read a cache line, whose address naturally resides in the L3 slice #3 and which was currently held as modified in L1 of core #2, the following would happen:
  • Core #1 sends a request upstream with the address. It will cross the system agent, then hop back down the other side until it reaches L3 #3.
  • The line is currently held in some cache, so L3 has a valid tag entry for it. That tag entry currently lists it as modified by Core #2, so the data currently held locally in L3 is not valid. L3 #3 sends another request upstream, addressed to Core #2.
  • Core #2 sets the line as shared, sends the the dirty data downstream to L3 #3.
  • L3 #3 receives the data, updates it's own copy, sets tag bits shared, held by core #2 and held by core 1, sends data downstream to core #1.
  • Data hops it's way around the ring bus until it reaches core #1, is read into L1 and used.
 
Last edited:
Back
Top