AMD RyZen CPU Architecture for 2017

monstercameron · Aug 25, 2016

goes over some of the slides.

Voxilla · Aug 25, 2016

Interesting presentation.
With the strict separation of interger computations and floating point I wonder where SIMD integer computations are done.
It alsmost seems that is done by the integer pipelines, which would be kind of weird as SIMD registers store both floating point and integer.
Maybe there are integer SIMD execution units too besides the ADD/MUL, but then why are they not on the diagram ?

3dilettante · Aug 25, 2016

SIMD is in the FPU. Integer in this case describes the scalar integer domain.

Infinisearch · Aug 25, 2016

3dilettante said:
The Cat cores introduced a perceptron branch predictor, which later BD cores took in as well. It's a straightforward neural net that learns from a long branch history the behavior of a branch or set of them, where the history and a set of weights are summed together to get a prediction. The hash likely decides which of a bank of perceptrons will be used. It has the benefit of allowing for very long histories to be tracked without the exponential growth in predictor size per number of branches evaluated in the history register.

It has certain weaknesses and strengths, and per Agner Fog's testing it is pretty accurate. Nested loops have not been handled too well, and branch behavior that is highly regular but not linearly separable (example: alternately taken/not taken) cannot be perfectly learned. Possibly, AMD has improved on some of these things.

Thanks... However I'm still curious on why its labeled a hash, any idea?

3dilettante said:
The page table coalescing feature might be useful for making items like AMD's TLBs (particularly its small L0) more effective, and to ameliorate the indexing issues resulting from its Icache still not having the associativity needed for a cache of its size with a 4KB page granularity to fully avoid aliasing.

Any idea why AMD seems to use lower associativity I-caches?

3dilettante · Aug 25, 2016

itsmydamnation said:
@3dilettante I listened to the hotchips prez today, in term of the exclusive L3, Michael Clark stated that they track at the CCX level what data is in each core, so i guess some sort of intra CCX directory.

If it's tracking specifically just the lines cached in the CCX perhaps it's a snoop filter.
If it's CCX-level, that might explain the earlier allusion to Zen not scaling as wonderfully in multisocket since there are 2 separate domains that snoops need to broadcast to or be tracked on the same chip.

Infinisearch said:
Thanks... However I'm still curious on why its labeled a hash, any idea?

The portion of the predictor that accumulates outcomes or learns weight values would be subdivided into separate sections/sub-arrays. Picking what parts of the overall predictor get chosen to learn the pattern involves some kind of hash, such as using parts of the branch address to select an entry in order to better capture behavior of nearby branches and to keep the behavior of distant and likely less-relevant branches from taking the perceptron or counter bank off-pattern.

Why the slide says Hash now when there's always been hashing might be just a slide choice (why does AMD indicate the L2 has instructions and data now when that's pretty much always been the case for them?), or there's more complicated logic being put into the predictor.
The slide on SMT indicates the predictor is subject to algorithmic prioritization, and an important source for the perceptron is the SMT-tagged TLB structures.
The hash function for selecting a perceptron could plausibly take into account which thread ID seems to be missing more or whether the thread IDs correlate strongly. Other possibilities are that there are multiple prediction schemes, and their outputs are subject to a different hash function.

Any idea why AMD seems to use lower associativity I-caches?

They're usually bigger at the same time, which could go to prioritizing reducing capacity misses or weighing the pros and cons of higher complexity/delay/power/design burden for higher associativity. It was usually a small penalty even for Bulldozer, though it appears CMT or the implementation of it could suffer more in some scenarios--enough to prompt a proposed patch. The penalty for aliasing is generally minor, since it involves checking on a miss if an alias needs to be invalidated. Zen's higher associativity that before does reduce the amount somewhat. Maybe that can be avoided if it is determined that the page in question is a coalesced one. (edit: not sure if there are concerns if 4K and coalesced pages might coexist)

On a side note that might deserve its own thread, Samsung disclosed information about its custom ARM core:
http://www.anandtech.com/show/10590/hot-chips-2016-exynos-m1-architecture-disclosed

Among other things, a larger L1-I with a lower associativity (wider line though), a "neural net" predictor, shared LLC, distributed schedulers.
One reason to note it given the context of Zen inheriting some Jaguar DNA is where it appears much of AMD's Cat team went to after they quit.
Some of visual language in the slides seems to have Jaguar presentation DNA, even.

xEx · Aug 25, 2016

Does anyone here seem plausible AMD "fusioning" both concepts in the future? a CMT+SMT design? or could it be just too complex for AMD to even think about trying it?

itsmydamnation · Aug 26, 2016

3dilettante said:
If it's tracking specifically just the lines cached in the CCX perhaps it's a snoop filter.
If it's CCX-level, that might explain the earlier allusion to Zen not scaling as wonderfully in multisocket since there are 2 separate domains that snoops need to broadcast to or be tracked on the same chip.

I am hardly a hardware designer but I would be surprised by that, in my world all the large scale routing protocols are multi tier, OSPF, BGP, MPLS, IS-IS, LISP etc. In most of these the hierarch is decoupled , because there is only (form what we know) one exit point from a CCX this seems ideal to me to implement a hierarchical design where intra and inter CCX really have no idea about each others state the inter only knows the addresses of what intra is tracking.

There are upto 8 CCX's to a 32 core chip, so logically there already needs to be something between CCX's, maybe they dont have a 3rd level ( inter Socket ) so the memory requests have to probe the other socket unnecessarily?

Alexko · Aug 26, 2016

I figured that generally speaking, having a cache be twice as large with half the associativity compared to another one would yield roughly similar performance, but with lower power, at the cost of more silicon.

Would that be accurate?

I.S.T. · Aug 26, 2016

Forgive me if this is incorrect, but from what I recall, that's accurate but not for every piece of code. Some would drastically suffer, I believe, and some would drastically benefit.

tunafish · Aug 26, 2016

Alexko said:
I figured that generally speaking, having a cache be twice as large with half the associativity compared to another one would yield roughly similar performance, but with lower power, at the cost of more silicon.

Roughly the same cache hit rate, with less power use, more silicon, and higher delay (so it clocks slightly lower or requires an additional stage).

3dilettante · Aug 26, 2016

itsmydamnation said:
I am hardly a hardware designer but I would be surprised by that, in my world all the large scale routing protocols are multi tier, OSPF, BGP, MPLS, IS-IS, LISP etc. In most of these the hierarch is decoupled , because there is only (form what we know) one exit point from a CCX this seems ideal to me to implement a hierarchical design where intra and inter CCX really have no idea about each others state the inter only knows the addresses of what intra is tracking.

With a 48-bit physical address space, just knowing what addresses are being tracked makes up the majority of the storage requirement.
In order to conserve bandwidth in sending updates to the higher-level table, it may be that the specific cache line states are not tracked, but that's a handful of bits and one CCX has 148K lines to track (less if reserved lines can be exempt).

There are upto 8 CCX's to a 32 core chip, so logically there already needs to be something between CCX's, maybe they dont have a 3rd level ( inter Socket ) so the memory requests have to probe the other socket unnecessarily?

It's a random hypothesis about how an earlier post in this thread could be true that Zen might have poorer multi-socket scaling (assuming it's not just how most things tend to scale worse if going off-chip).

AMD's traditional method for constraining broadcast bandwidth consumption had its storage allocated in the L3, which is no longer global. That AMD's L3 was a victim cache may have facilitated its use for this, and the hint that it is "mostly" exclusive sound a lot like how its prior L3 was "mostly" exclusive. I'm not sure if there is only one stop on the global interconnect (whatever that might be) for a CCX for all kinds of traffic, but it would put the intra-CCX status tracking on the wrong side of that bottleneck.

It could be that reduced scaling can come from L3 capacity consumption, per what happened with HT Assist in prior multi-socket generations.
Single-socket could go without that feature, leaning on shorter on-die latencies and better bandwidth to allow the OoO cores to compensate while benefiting from full L3 storage and bandwidth.

pTmdfx · Aug 26, 2016

3dilettante said:
With a 48-bit physical address space, just knowing what addresses are being tracked makes up the majority of the storage requirement.
In order to conserve bandwidth in sending updates to the higher-level table, it may be that the specific cache line states are not tracked, but that's a handful of bits and one CCX has 148K lines to track (less if reserved lines can be exempt).

I don't think it would be that much. The L2s in total would have 32,768 lines. The overhead is around 46+4 bits per line (assuming physically tagged + 52b addr. space) without counting the bits for states & ECC. Given that the L3 is "mostly exclusive" victim, in general cases it would be an directory probe after an L3 probe misses (or in parallel for lower latencies). For those "inclusive" lines though, the L3 control may determine internal sharers by simply querying the intra CCX directory that may track only the L2 tags.

Sent from my iPhone using Tapatalk

3dilettante · Aug 26, 2016

pTmdfx said:
I don't think it would be that much. The L2s in total would have 32,768 lines. The overhead is around 46+4 bits per line (assuming physically tagged + 52b addr. space) without counting the bits for states & ECC. Given that the L3 is "mostly exclusive" victim, in general cases it would be an directory probe after an L3 probe misses (or in parallel for lower latencies). For those "inclusive" lines though, the L3 control may determine internal sharers by simply querying the intra CCX directory that may track only the L2 tags.

My math error was in making the L2 count too small. It would be ~160K lines to track per CCX.
AMD's implementation of HT Assist made the probe directory inclusive of all valid cached lines and consumed 1MB of each L3.
It's actually not sufficient to fully a worst-case scenario where one home node is the source of all cache lines in the system, but I'm assuming probes fall back to the old broadcast.

Infinisearch · Aug 26, 2016

tunafish said:
Roughly the same cache hit rate, with less power use, more silicon, and higher delay (so it clocks slightly lower or requires an additional stage).

But those few percent of hitrate increase (with more associativity) can make a big difference right? AFAIK higher associativity is/can be important for data caches, never thought about or looked into instruction caches.

Gubbi · Aug 30, 2016

3dilettante said:
If it's tracking specifically just the lines cached in the CCX perhaps it's a snoop filter.
If it's CCX-level, that might explain the earlier allusion to Zen not scaling as wonderfully in multisocket since there are 2 separate domains that snoops need to broadcast to or be tracked on the same chip.

Why not just replicate the L2 cache tags in the L3 ? It is only 25% more tags than the L3; - Or dual port the L2 tags (or multibank the L2 tags and pseudo multiport them). That's 32K lines for 4x512KB using 64 byte lines, 16K lines if they used sectored 128 byte lines in L2/3.

Either way, I can't see this as an impediment to performance.

Cheers

pTmdfx · Aug 30, 2016

Gubbi said:
Why not just replicate the L2 cache tags in the L3 ? It is only 25% more tags than the L3; - Or dual port the L2 tags (or multibank the L2 tags and pseudo multiport them). That's 32K lines for 4x512KB using 64 byte lines, 16K lines if they used sectored 128 byte lines in L2/3.

Either way, I can't see this as an impediment to performance.

Cheers

Replicating L2 tags means only snoop filtering for the internal L2s of an CCX. The CCX itself would still supposedly appear as a single "core" at the system level, together with the allegedly-private victim L3.

He is talking about coherence across multiple CCXs, which may or may not be across sockets. This can be done via snooping or a directory. Though specifically for the directory, it has to track all lines in the CCXs (incl. L3s) based on the said assumptions.

Gubbi · Aug 30, 2016

pTmdfx said:
He is talking about coherence across multiple CCXs, which may or may not be across sockets. This can be done via snooping or a directory. Though specifically for the directory, it has to track all lines in the CCXs (incl. L3s) based on the said assumptions.

They could use both (as they did with Opteron). Use snooping internally and a bloom filter/directory at the socket level. Snoop bandwidth scales quadratically with the number of coherency agents, but on die bandwidth is cheap and plentiful.

This would allow AMD to have a basic building block for producing 1-8 CCX devices with 4 to 32 cores fairly easily, and then add a complex system agent/memory controller for multi socket implementations.

Cheers

3dilettante · Aug 30, 2016

Gubbi said:
Why not just replicate the L2 cache tags in the L3 ?

AMD has snoop tag arrays on various caches that mirror the contents of the local arrays. The L3 is comparatively far away and is its own not-quite-exclusive coherence agent, so I would be curious to see what complexities would arise from that.

Either way, I can't see this as an impediment to performance.

Single-chip should manage. Power consumption and the complexity/latency of the arrangement would be an area of trade-offs. Having a CCX-level structure that only concerned itself with the lines within the CCX rather than a full directory would have the ability to filter down probes that would otherwise have to hit 16 ways of L3 and 32 ways of L2 spread across the whole module.
The AMD slide that showed the L3 had some odd rectangles that either extended the core's coloration or were gray and arranged throughout the CCX. Possibly some of them are intra-CCX tables.

They could use both (as they did with Opteron). Use snooping internally and a bloom filter/directory at the socket level. Snoop bandwidth scales quadratically with the number of coherency agents, but on die bandwidth is cheap and plentiful.

Opteron used to have a unified L3, request queue, and memory controller setup that handled internal snooping. Some of the ways Zen is exposing its CCX arrangement do not act like this has been maintained.

I wouldn't know about a bloom filter being used in this case, AMD's prior filter isn't described as having to deal with the false-positive aspect or regeneration phase that an accumulative filter would have.
There's some assumption about something like that being used for Intel's TSX, but that's for short-lived transaction read sets that spill out of the smaller local cache.

HT Assist worked per home node (one chip) and was generally turned off unless going for 4-8 sockets. 2-sockets would generally manage without it, but that was in a system with 2 home nodes. Per-CCX traffic, assuming each CCX serves as a home node to the adjacent memory channel, starts the scaling trend such that two sockets is a hybrid performance case between 2 and 4 sockets with lost L3 capacity with potentially 1/2 of the coherence traffic being significantly faster than a true quad-socket case.

This would allow AMD to have a basic building block for producing 1-8 CCX devices with 4 to 32 cores fairly easily, and then add a complex system agent/memory controller for multi socket implementations.

The 8 CCX case is described as being an MCM, so there's going to be complexity in terms of inter-die connections akin to separate sockets.
The patch describing how AMD is allocating complex and core IDs leaves some doubt to going beyond dual Zeppelin MCMs.

What exactly Zeppelin is physically is unclear. The base CCX+memory channel building block seems to carry through, but the interconnect situation with GMI and the like is unclear. How many links there are may influence how many hops are needed, which would also go to how scalable Zen is above a single die that already shows NUMA effects.

Gubbi · Aug 31, 2016

3dilettante said:
I wouldn't know about a bloom filter being used in this case, AMD's prior filter isn't described as having to deal with the false-positive aspect or regeneration phase that an accumulative filter would have.

I don't know either, but I'd expect AMD to just not care. If you get a false positive, probe the socket indicated. This won't do any harm, since it doesn't have a cache line copy. If the frequency of false positives is low (say <10%) you still save a ton of bandwidth.

Cheers

3dilettante · Aug 31, 2016

The protocol would be expanded in order to handle certain transient states related to false-positive probes.
When looking at HT Assist, the properties that look difficult to me are that the non-presence of a line in the filter means it is considered uncached, and the functionality with dedicated invalidation messages and how that plays into removing things from the directory. Since a bloom filter cannot remove things, the false-positive rate would have as a floor the rate invalidates and evictions. The rebuild process looks like it would be costly.

AMD RyZen CPU Architecture for 2017

Similar threads