monstercameron
Newcomer
goes over some of the slides.
The Cat cores introduced a perceptron branch predictor, which later BD cores took in as well. It's a straightforward neural net that learns from a long branch history the behavior of a branch or set of them, where the history and a set of weights are summed together to get a prediction. The hash likely decides which of a bank of perceptrons will be used. It has the benefit of allowing for very long histories to be tracked without the exponential growth in predictor size per number of branches evaluated in the history register.
It has certain weaknesses and strengths, and per Agner Fog's testing it is pretty accurate. Nested loops have not been handled too well, and branch behavior that is highly regular but not linearly separable (example: alternately taken/not taken) cannot be perfectly learned. Possibly, AMD has improved on some of these things.
The page table coalescing feature might be useful for making items like AMD's TLBs (particularly its small L0) more effective, and to ameliorate the indexing issues resulting from its Icache still not having the associativity needed for a cache of its size with a 4KB page granularity to fully avoid aliasing.
@3dilettante I listened to the hotchips prez today, in term of the exclusive L3, Michael Clark stated that they track at the CCX level what data is in each core, so i guess some sort of intra CCX directory.
The portion of the predictor that accumulates outcomes or learns weight values would be subdivided into separate sections/sub-arrays. Picking what parts of the overall predictor get chosen to learn the pattern involves some kind of hash, such as using parts of the branch address to select an entry in order to better capture behavior of nearby branches and to keep the behavior of distant and likely less-relevant branches from taking the perceptron or counter bank off-pattern.Thanks... However I'm still curious on why its labeled a hash, any idea?
They're usually bigger at the same time, which could go to prioritizing reducing capacity misses or weighing the pros and cons of higher complexity/delay/power/design burden for higher associativity. It was usually a small penalty even for Bulldozer, though it appears CMT or the implementation of it could suffer more in some scenarios--enough to prompt a proposed patch. The penalty for aliasing is generally minor, since it involves checking on a miss if an alias needs to be invalidated. Zen's higher associativity that before does reduce the amount somewhat. Maybe that can be avoided if it is determined that the page in question is a coalesced one. (edit: not sure if there are concerns if 4K and coalesced pages might coexist)Any idea why AMD seems to use lower associativity I-caches?
If it's tracking specifically just the lines cached in the CCX perhaps it's a snoop filter.
If it's CCX-level, that might explain the earlier allusion to Zen not scaling as wonderfully in multisocket since there are 2 separate domains that snoops need to broadcast to or be tracked on the same chip.
I figured that generally speaking, having a cache be twice as large with half the associativity compared to another one would yield roughly similar performance, but with lower power, at the cost of more silicon.
With a 48-bit physical address space, just knowing what addresses are being tracked makes up the majority of the storage requirement.I am hardly a hardware designer but I would be surprised by that, in my world all the large scale routing protocols are multi tier, OSPF, BGP, MPLS, IS-IS, LISP etc. In most of these the hierarch is decoupled , because there is only (form what we know) one exit point from a CCX this seems ideal to me to implement a hierarchical design where intra and inter CCX really have no idea about each others state the inter only knows the addresses of what intra is tracking.
It's a random hypothesis about how an earlier post in this thread could be true that Zen might have poorer multi-socket scaling (assuming it's not just how most things tend to scale worse if going off-chip).There are upto 8 CCX's to a 32 core chip, so logically there already needs to be something between CCX's, maybe they dont have a 3rd level ( inter Socket ) so the memory requests have to probe the other socket unnecessarily?
I don't think it would be that much. The L2s in total would have 32,768 lines. The overhead is around 46+4 bits per line (assuming physically tagged + 52b addr. space) without counting the bits for states & ECC. Given that the L3 is "mostly exclusive" victim, in general cases it would be an directory probe after an L3 probe misses (or in parallel for lower latencies). For those "inclusive" lines though, the L3 control may determine internal sharers by simply querying the intra CCX directory that may track only the L2 tags.With a 48-bit physical address space, just knowing what addresses are being tracked makes up the majority of the storage requirement.
In order to conserve bandwidth in sending updates to the higher-level table, it may be that the specific cache line states are not tracked, but that's a handful of bits and one CCX has 148K lines to track (less if reserved lines can be exempt).
My math error was in making the L2 count too small. It would be ~160K lines to track per CCX.I don't think it would be that much. The L2s in total would have 32,768 lines. The overhead is around 46+4 bits per line (assuming physically tagged + 52b addr. space) without counting the bits for states & ECC. Given that the L3 is "mostly exclusive" victim, in general cases it would be an directory probe after an L3 probe misses (or in parallel for lower latencies). For those "inclusive" lines though, the L3 control may determine internal sharers by simply querying the intra CCX directory that may track only the L2 tags.
But those few percent of hitrate increase (with more associativity) can make a big difference right? AFAIK higher associativity is/can be important for data caches, never thought about or looked into instruction caches.Roughly the same cache hit rate, with less power use, more silicon, and higher delay (so it clocks slightly lower or requires an additional stage).
If it's tracking specifically just the lines cached in the CCX perhaps it's a snoop filter.
If it's CCX-level, that might explain the earlier allusion to Zen not scaling as wonderfully in multisocket since there are 2 separate domains that snoops need to broadcast to or be tracked on the same chip.
Replicating L2 tags means only snoop filtering for the internal L2s of an CCX. The CCX itself would still supposedly appear as a single "core" at the system level, together with the allegedly-private victim L3.Why not just replicate the L2 cache tags in the L3 ? It is only 25% more tags than the L3; - Or dual port the L2 tags (or multibank the L2 tags and pseudo multiport them). That's 32K lines for 4x512KB using 64 byte lines, 16K lines if they used sectored 128 byte lines in L2/3.
Either way, I can't see this as an impediment to performance.
Cheers
He is talking about coherence across multiple CCXs, which may or may not be across sockets. This can be done via snooping or a directory. Though specifically for the directory, it has to track all lines in the CCXs (incl. L3s) based on the said assumptions.
AMD has snoop tag arrays on various caches that mirror the contents of the local arrays. The L3 is comparatively far away and is its own not-quite-exclusive coherence agent, so I would be curious to see what complexities would arise from that.Why not just replicate the L2 cache tags in the L3 ?
Single-chip should manage. Power consumption and the complexity/latency of the arrangement would be an area of trade-offs. Having a CCX-level structure that only concerned itself with the lines within the CCX rather than a full directory would have the ability to filter down probes that would otherwise have to hit 16 ways of L3 and 32 ways of L2 spread across the whole module.Either way, I can't see this as an impediment to performance.
Opteron used to have a unified L3, request queue, and memory controller setup that handled internal snooping. Some of the ways Zen is exposing its CCX arrangement do not act like this has been maintained.They could use both (as they did with Opteron). Use snooping internally and a bloom filter/directory at the socket level. Snoop bandwidth scales quadratically with the number of coherency agents, but on die bandwidth is cheap and plentiful.
The 8 CCX case is described as being an MCM, so there's going to be complexity in terms of inter-die connections akin to separate sockets.This would allow AMD to have a basic building block for producing 1-8 CCX devices with 4 to 32 cores fairly easily, and then add a complex system agent/memory controller for multi socket implementations.
I wouldn't know about a bloom filter being used in this case, AMD's prior filter isn't described as having to deal with the false-positive aspect or regeneration phase that an accumulative filter would have.