Processor cache architectures

Infinisearch · Sep 17, 2015

Does anybody have a good article or resource on CPU cache architectures? Basically I want to know about bus's (ala P6 backside bus) and data flow. (if L1 miss then L2 check, if L2 miss then... and what the max latency for a memory access is) Also if anyone knows a good article on DDR operation and timing that would be appreciated as well. Thanks.

Simon F · Sep 18, 2015

Is the following of any help? It's a reasonable introduction.

?

I did have a link to a page which was a "superset" in the sense that this talk was referenced, but for the moment I can't find where I've stored it!

Rodéric · Sep 18, 2015

"What Every Programmer Should Know About Memory" : https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

Infinisearch · Sep 20, 2015

Thank you both. I watched the video but it doesn't have the depth I'd like, in middle of looking at the pdf.

sebbbi · Sep 21, 2015

Realworldtech has some nice CPU analysis articles.

Example (Jaguar cache hierarchy):
http://www.realworldtech.com/jaguar/6/

Infinisearch · Sep 26, 2015

Thanks sebbbi forgot about RWT, it answered some of my questions. That is the level of article I was looking for but jaguar doesn't have an L3, which is one of the things I was curious about (namely intel's L3 slice arch). I checked their Haswell microarchitecture article but it glances over the L3 since its a part of the system architecture, it talks about it as if either you have prior knowledge or they discussed it before. Guess I'll dig some more for that specifically.

Thanks once again... Roderic I thought your paper would be useful so I posted it in a thread on gamedev.net, sorry if I beat you to it.

sebbbi · Sep 26, 2015

Infinisearch said:
I checked their Haswell microarchitecture article but it glances over the L3 since its a part of the system architecture, it talks about it as if either you have prior knowledge or they discussed it before. Guess I'll dig some more for that specifically.

Read the Sandy Bridge analysis for detailed Intel L3 cache information:
http://www.realworldtech.com/sandy-bridge/8/

rapso · Sep 27, 2015

@Infinisearch
what questions about the l3 do you have? maybe some forum's member has the answers

Simon F · Sep 28, 2015

Simon F said:
I did have a link to a page which was a "superset" in the sense that this talk was referenced, but for the moment I can't find where I've stored it!

Bingo: Link that I wanted reappeared in twitter feed: https://gist.github.com/ocornut/cb980ea183e848685a36

Rodéric · Sep 29, 2015

Infinisearch said:
Thanks once again... Roderic I thought your paper would be useful so I posted it in a thread on gamedev.net, sorry if I beat you to it.

Knowledge should be shared freely

(And I was on holiday so didn't check emails or forums at all ^^)

Infinisearch · Sep 30, 2015

rapso said:
what questions about the l3 do you have? maybe some forum's member has the answers

Basically I was wondering about the implementation of L3's in AMD and Intel cpu's. In addition Intel's modern L3's are hooked up as cache slices on a ring bus, and I was wondering what the latency penalty for a non-local slice is and how the tags are handled? Is the tag ram local to the slice? Are tags replicated across slices?...? Stuff like that and any tidbits on layout constraints in regards to bus's.

tunafish · Oct 1, 2015

Infinisearch said:
Basically I was wondering about the implementation of L3's in AMD and Intel cpu's.

The systems AMD and Intel use are very different.

In addition Intel's modern L3's are hooked up as cache slices on a ring bus, and I was wondering what the latency penalty for a non-local slice is

One cycle per hop in the ring bus, both on the request and on the response. Since the ring bus is unidirectional, with requests flowing in the opposite direction, this often means you need to go over the system manager/memory controller hops too.

and how the tags are handled? Is the tag ram local to the slice? Are tags replicated across slices?...? Stuff like that and any tidbits on layout constraints in regards to bus's.

Tags are located on the same slice as the cache lines they cover.

Intel uses the the L3 as a cache that is fully inclusive of all the levels below it. Physical addresses are striped evenly across all L3 slices -- that is, you cannot only access the local slice, all cores access all L3 slices evenly. Each tag entry in L3 has a bit for every possible consumer of the cache.

For example, if core #1 wanted to read a cache line, whose address naturally resides in the L3 slice #3 and which was currently held as modified in L1 of core #2, the following would happen:

Core #1 sends a request upstream with the address. It will cross the system agent, then hop back down the other side until it reaches L3 #3.
The line is currently held in some cache, so L3 has a valid tag entry for it. That tag entry currently lists it as modified by Core #2, so the data currently held locally in L3 is not valid. L3 #3 sends another request upstream, addressed to Core #2.
Core #2 sets the line as shared, sends the the dirty data downstream to L3 #3.
L3 #3 receives the data, updates it's own copy, sets tag bits shared, held by core #2 and held by core 1, sends data downstream to core #1.
Data hops it's way around the ring bus until it reaches core #1, is read into L1 and used.

Infinisearch · Oct 7, 2015

Thanks tunafish, I had already read the article sebbbi had posted when you posted. But thanks all the same.

Processor cache architectures

Infinisearch

Simon F

Tea maker

Rodéric

a.k.a. Ingenu

Infinisearch

sebbbi

Infinisearch

sebbbi

rapso

Simon F

Tea maker

Rodéric

a.k.a. Ingenu

Infinisearch

tunafish

Infinisearch

Similar threads