AMD RyZen CPU Architecture for 2017

When looking at HT Assist, the properties that look difficult to me are that the non-presence of a line in the filter means it is considered uncached,
I don't know the particular implementation of the filter in AMD's processors. But a Bloom filter might have false positives but is guaranteed to never have false negatives; You might get to send an extra coherence probe, but you will never miss one.

I have no idea what kind of magic goes into maintaining the filter in a (semi) correct state.

Cheers
 
I don't know the particular implementation of the filter in AMD's processors. But a Bloom filter might have false positives but is guaranteed to never have false negatives; You might get to send an extra coherence probe, but you will never miss one.

I have no idea what kind of magic goes into maintaining the filter in a (semi) correct state.
The inability to remove entries conflicts with how AMD's HT Assist conserves storage by relying on the non-presence of an entry to indicate it is uncached, and tracks invalidations and evictions so that only lines in a valid state are tracked. The false-positive rate for residency would be based on the invalidate and eviction rate.
It's extra coherence probes and it constrains something like HT Assist that is set-associative and fixed-capacity.

It sounds like a bloom filter would be primarily a tracker of line residency, which HT Assist does more than. It's a more complex topic with false-positives when HT Assist also tracks MOES status, and can control whether the wrong kind of probes are generated.

Older reference:
https://www.cs.columbia.edu/~junfeng/12sp-w4118/lectures/amd.pdf

edit:
One item of note is that one of the kernel patches for Zen does mention a probe filter, although might be for a heterogenous shared memory system. It's listed outside of the L3.
https://lkml.org/lkml/2016/2/16/888
 
One item of note is that one of the kernel patches for Zen does mention a probe filter, although might be for a heterogenous shared memory system. It's listed outside of the L3.
https://lkml.org/lkml/2016/2/16/888
The L3 cache itself was listed to have shadow tags though, while the probe filter error is listed under "Coherent Slave", which is under the DF (data fabric).

A possible interpretation would be the DF has a master-slave arrangement: cores, accelerators & I/O as masters; "coherent slaves" (that connects to memory controllers) for memory requests and other system functions as slaves.

In other words, Zen could possibly have a two-tier coherence tracking: each L3 tracks the internal lines of its CCX, and coherent slaves are responsible of all lines originated from them, using either probing or a sparse probe filter (still owner pointer, perhaps?).

That's said it would be fun to know if the 32-core Naples would be scalable beyond 2 sockets...
 
Last edited:
The L3 cache itself was listed to have shadow tags though, while the probe filter error is listed under "Coherent Slave", which is under the DF (data fabric).
I noted that it the probe filter entries were listed outside of the L3 subsection.
It might be a more formal description of what is a replacement or iteration of the Onion bus.
The use of the term shadow tags has varied. Being some form of secondary tag location or mirror is one use case, but there are also some relevant to a cache system that can dynamically allocate cache storage.

In other words, Zen could possibly have a two-tier coherence tracking: each L3 tracks the internal lines of its CCX, and coherent slaves are responsible of all lines originated from them, using either probing or a sparse probe filter (still owner pointer, perhaps?).
More recent papers shared memory discuss some form of region-based coherence (multiple lines/pages), with GPU traffic skipping the actual coherent bus as long as a probe indicates that a given region is not cached.
That actually might be a place where a bloom filter could do well, particularly if some of the other research on time-based consistency is applied. The probability of specific lines in a region being cached can be low, and using something like timed data validity or invalidation could provide a cap in how far a given bloom filter instantiation needs to live.

The exact responsibilities of the coherent slaves differ from true peer coherence clients. The security and transport violations seem to be unique to them, so a coherent slave might be a controller or interface block that serves to abstract away another device like a GPU, rather than depend on the GPU to maintain coherence.
 
More recent papers shared memory discuss some form of region-based coherence (multiple lines/pages), with GPU traffic skipping the actual coherent bus as long as a probe indicates that a given region is not cached.
There was actually a paper (or two?) predated that with a couple of patent filings, which primarily discussed coarse grained coherence with region privatisation in many-core CPUs. The technique is agnostic to what a coherent agent actually is, but could be especially beneficial to the possible cache hierarchy integration of GPUs.

The exact responsibilities of the coherent slaves differ from true peer coherence clients. The security and transport violations seem to be unique to them, so a coherent slave might be a controller or interface block that serves to abstract away another device like a GPU, rather than depend on the GPU to maintain coherence.
I meant coherent slaves could possibly be a "home agent" in Intel's nomenclature, rather than a interface for any agents that want to participate or access the coherence domain. That would be instead a "peripheral master" or whatever following that idea.

ARM's CCI has a similar but reserved nomenclature - all IPs use slave interfaces, while the memory controllers are the masters.
 
IMO the key line is "incoming probe response". Unless AMD switches to source snooping (which hasn't been the case for more than a decade), I don't think anything other than the coherence managers would get probe responses, since they would be the one issuing the probes.

Moreover, it is a bit weird to see an interface to take incoming requests, if it also issues "reads" (since it anticipates "read response") and has the ability to access a probe filter. It doesn't look like a coherent interface for a non-caching IP block to me at all.
 
Last edited:
That makes sense in light of CS being a data fabric type rather than core.
Aside from not being listed under the L3, being categorized in DF makes more distinct from the core.

The probe filter access ECC error is the one mention of the filter. Die shots did not appear to give a location for it, and it's a smaller set of ECC error cases than other storage arrays. Possibly the ECC error is a nested error from the L3's tag and data array ECC handling if the filter is allocated from there?

The address and security errors are a case where I've seen more concern with non-CPU clients. Violations on the CPU side have had documented issues at the page/TLB level, which the being outside the L3 wouldn't be present. Perhaps those are part of platform management or AMD's expanded security/encryption features.
The atomic request entry was something I wasn't sure about for the coherent side.
 
The address and security errors are a case where I've seen more concern with non-CPU clients. Violations on the CPU side have had documented issues at the page/TLB level, which the being outside the L3 wouldn't be present. Perhaps those are part of platform management or AMD's expanded security/encryption features.
It might be Secure Encrypted Virtualization. AMD said SEV would tag all data and codes with the respective VM ASID anywhere in the SOC. The data fabric & the coherence manager is working in real physical address, so this is likely at least for the protection against unintended accesses from the hypervisor, in addition to encryption.

http://amd-dev.wpengine.netdna-cdn....MD_Memory_Encryption_Whitepaper_v7-Public.pdf


Sent from my iPhone using Tapatalk
 
Seems like a very early engineering sample: L3 and SMT misreported or disabled, turbo-boost probably disabled as well.
 
There's always L3. A 32-cores/chip server CPU without L3 would not function for its intended purpose.

Yes I Know, i was quoting the details on the test.

Anyway, as kalelovil points this must be misreported by the test. This is an ES running at @1.44 Ghz
 
MacBookAir6,2 Intel Core i5-4210U 1400 MHz (2 cores)
Mac OS X 64-bit
Single-Core Score Multi-Core Score
12873 22762

System manufacturer System Product Name Intel Core i5-6500 2901 MHz (4 cores) Windows 64-bit
6458 18426

System manufacturer System Product Name Intel Core i7-6700K 4001 MHz (4 cores)

Windows 64-bit
6447 21879


wtf?
its just bad detection
doesn't detect runinng clocks or anything at all
this is my 4.4ghz 3770k
https://browser.geekbench.com/v4/cpu/155751

occording to this:
Geekbench 4 scores are calibrated against a baseline score of 4,000 (which is the score of an Intel Core i7-6600U @ 2.60 GHz). Higher scores are better, with double the score indicating double the performance.

So skylake is like 50% better performance per clock which isn't that realistic.
 
Interesting. The clocks are reasonably high after all, especially the GPU clocks, but I suspect a lot of throttling at 65W.
 
So, do we still expect Zen to be (paper-)launched this year? Things seem pretty quiet. It seems like there should have been a few leaks.
 
Back
Top