AMD RyZen CPU Architecture for 2017

It all looks pretty good to me, at worst AMD should have a sound, solid design that they should be able to refine over time without having to "fight" it, as they did with Bulldozer.

Those discrete scheduling queues seem a bit odd, though.
 
Bulldozer was built on the concept that integer instructions are more common and important than float, what changed now?

The ratio of FP to INT between Zen and Bulldozer in terms of uops is mostly the same. Zen has 4 INT and 4 FP per core, which Bulldozer's narrow INT and shared FPU wound up halving per core.

The originating idea for BD was more ambitious, with what we know as BD "cores" actually being inner clusters of execution inside of a larger overarching core's scheduler. More complex speculation, advanced multi-threading, and a much tighter critical L1D+EXE loop for very high clocks were the original motivations. Area savings, such as they were, would have been a secondary benefit.
Physical reality gutted the high clocks, and the Bulldozer line did nothing interesting, or at least nothing good, with the other motivations.

The second-order upsides to shared resources were measured to be as compelling as BD was found to be wanting.

All that aside, AMD was outplayed pretty seriously on the FP front, given the SSE5/FMA4/FMA3/AVX+noFMA mess.
Some items BD had were nice, but in general Intel had more influence and eventually what proved to be a cleaner and more extensible vector solution.

It all looks pretty good to me, at worst AMD should have a sound, solid design that they should be able to refine over time without having to "fight" it, as they did with Bulldozer.

Those discrete scheduling queues seem a bit odd, though.
At least initially, much of this seems to be on the order of a Haswell core, with SMT potentially closer in complexity to Sandy Bridge, with AVX, LS width, and memory speculation being the most notable areas where AMD has not shown features that bring it into parity with Haswell/Broadwell. That level of software equivalence is what I've seen mooted as a rough design goal in order to piggy-back on software generally targeting Intel.

The split integer schedulers looks like a possible power optimization. The IEUs are pretty generic in their support, with only certain instructions requiring specific lanes to be active. The znver1 patch indicated IE1 is the sole integer multipler, and IE2 is the sole integer divider. Call instructions can choose between 0 and 3. There's almost nothing referenced where the integer units need to work in concert like in the FPU.
Integer instructions that interact with the FP unit reside on 0 and 2.
An highly serial integer-only workload with little division could potentially ignore half the lanes, or more.

Splitting the schedulers may enable the more aggressive clock gating the IEUs have, and the core may be able to determine if specific units become less necessary for a stretch of cycles and gate them off. Perhaps other optimizations are possible if the integer cluster has banked resources for forwarding and the register file, where portions of those could be turned off if the schedulers detect that nothing their units need is valid due to a mispredict or the renamed registers they are linked to have been rendered obsolete.

The FPU is not really as capable of doing this due to the more heavy specialization and because multiple operations use multiple pipes at the same time.
 
https://www.computerbase.de/2016-08/amd-zen-architektur/

slides inside, nothing about tsx, but there isn't a lot of specific detail either.......
Surprisingly, unlike what's being predicted, the L3 Cache is mostly exclusive of L2 (as it was in Bulldozer). How the L3 is going to work among multiple complexes and chips is also uncertain though.

Despite the L3 being depicted as local to a core complex, it is still possible that each core complex might assume the responsibility as a home agent for a channel, and partitions its own L3 for a directory.
 
Last edited:
Surprisingly, unlike what's being predicted, the L3 Cache is mostly exclusive of L2 (as it was in Bulldozer). How the L3 is going to work among multiple complexes and chips is also uncertain though.

Despite the L3 being depicted as local to a core complex, it is still possible that each core complex might assume the responsibility as a home agent for a channel, and partitions its own L3 for a directory.

The non-inclusive nature of the L3 means it cannot serve as a snoop filter as naturally as the LLC does for Intel. At least the L1 is inclusive of the L2, lest that number rise even higher.
It would seem to have power efficiency implications if the L2 needs to be concerned with as much broadcast traffic as the L3 would. The L3's interleaving basically means each slice needs to worry about 1/4 the snoop traffic, whereas the L2s need to worry about all of it all the time (effectively meaning snoops are 4-5X worse depending on whether it is in-cluster or remote)--unless something is maintained outside of them to cut the traffic down.
Also, if some variation of TSX were in place, it would not have an inclusive L3 to serve as a backing store for rollback or as a means of giving breathing room for transaction management when the L2 needs to be globally visible.

Comparisons that likened Zen's complex to a Jaguar module would be off-base in this case. Jaguar had a true LLC in the form of an L2 whose interface handled coherence, which Zen does not have with private L2s and an exclusive L3. Also unclear is what the "mostly exclusive" arrangement is. If it's like AMD's previous CPUs, there's a heuristic for keeping lines that have a history of being shared between cores even if it breaks exclusivity.

Some other possibly interesting items in the slides:
The ucode ROM is shown to be after the uop queue, which is different from Haswell if the Realworldtech article on that architecture is equivalent. That would prevent pollution of the caches and queues by a uop stream, and might mean the decoders are not totally blocked by a vector path instruction (the znver1 patch indicated all decoders were blocked--but with a note to revisit in the future). One other possible optimization there is to feed the ucode ROM output to a subset of integer schedulers and/or set a checkpoint for that instruction's stream of uops. There might be more predictability on what a vector path instruction needs in terms of scheduling, and this might reduce the impact of more complex string operations or maybe a TSX/gather/scatter instruction.
One minor tidbit not mentioned in the slides but mentioned elsewhere is an instruction that zeroes out a while cache line. What decode path that goes into, and whether this instruction could point to an optimization in the LS or cache controllers is unknown.

Another feature is that the decode block maintains a load/store memory file, which might just be for the stack engine (unclear if there is more to it going by the slide).
The Icache fetch bandwidth itself is one area that AMD remains wider at than Haswell.
 
Yes the slides are very core and flow orientated. To me its all about showing that they have made a high performance core without giving away much more. Nothing about South bridge, North bridge or even how the ccx's on chip or across interposer connect, nothing on on-board 10gbe or any accelerators if there are any, how inter socket comms work. It will be interesting to see how snooping is handled, but at 32 cores a P and upto 2P you would think they have to have something. Maybe something long the lines of the snoop cache/directory AMD has on current HT links, maybe something like a distributed directory, a hierarchical directory ( a directory intra ccx, and a directory inter ccx) ?
 
So many boxes, arrows & names :runaway:
I used to try to understand what they mean but I'm kinda beyond that nowadays :???:
Need someone to make a proper comparison for me
 
The non-inclusive nature of the L3 means it cannot serve as a snoop filter as naturally as the LLC does for Intel. At least the L1 is inclusive of the L2, lest that number rise even higher.
AMD already dedicated part of the L3 starting with their 6-core K10 processors (Thuban) as a "probe" filter, when used in SMP configurations. I think it was called HT-assist or something like it.
 
Woohoo! Cache line zero (CLZERO) instruction. That was really useful in PPC. Perfect for initializing temp memory (no need to read old crap from memory to cache that previously existed in those locations). Hopefully Intel adapts it as well.

Might this be useful in an OS kernel or a hypervisor?
So that you get some benefits even if your software doesn't support it at large.
 
Regarding the Blender benchmark.
I cannot imagine that was running with AVX.
The whole Zen SIMD is designed for 128 bit (load/store/compute). Compare this to 256 bit for Intel.
Would AMD really have been running with SSEx (or even without SIMD)?
The rendering of the simple scene (processor package) certainly didn't seem fast to me.
 
Might this be useful in an OS kernel or a hypervisor?
So that you get some benefits even if your software doesn't support it at large.
OS could use this instruction in the virtual memory allocator. It needs to clear memory pages before it gives them to another process (for security purposes). But I am unsure whether it would be a win compared to using SSE streaming writes to zero the same area. Streaming writes do not pollute caches.

The purpose of this instruction is to init a cache line to zero in order to access scratch memory quickly with no cache miss latency. Maybe compilers could use this to init a newly allocated object's memory to zero before running the constructor, in order to ensure that the region is in cache before being accessed (no cache miss on read). But cache line alignment requirement makes it hard to use this instruction automatically.
 
Regarding the Blender benchmark.
I cannot imagine that was running with AVX.
The whole Zen SIMD is designed for 128 bit (load/store/compute). Compare this to 256 bit for Intel.
Would AMD really have been running with SSEx (or even without SIMD)?
The rendering of the simple scene (processor package) certainly didn't seem fast to me.

The workload was made for a typical 50s render .. you just increase the number of sample for get there.

The test was to render a mockup of a Zen based desktop CPU, with an effective workload of 50 seconds for these chips

You can have a simple cube with simple shader, one light and just put 10K samples...

vfLHq.png


The question coud be why use Blender, in reality, AMD at this time, is welll involved on the developpement and contribution on it ( some project, release of the version 2.8 are directly developped on collaboration with AMD or financed by them. )

Its one "benchmark" ( if we can call it benchmark, i will say it is more an example ), i will not read too much in it .
 
Last edited:
Woohoo! Cache line zero (CLZERO) instruction. That was really useful in PPC. Perfect for initializing temp memory (no need to read old crap from memory to cache that previously existed in those locations). Hopefully Intel adapts it as well.

This came up in some GCC patches a while back. One possibility is that this is a comparatively low-cost way to get some of the benefits that very wide AVX had for memory allocation without having an FPU and memory pipeline that could actually natively support wide AVX. Some possible optimizations could be that this sets a smaller number of flags rather than 64 bytes of zeroes needing to be fed through the FPU and LS. I have not seen details concerning if it is microcoded or if it uses the write combiner path to the cache.

late edit: Another possibility is that this might enable getting the benefit of wide AVX for specific initialization routines without turning the most of FPU on at all.


On slide eight there seems to be something called a Hash Perceptron, any idea what this does?
The Cat cores introduced a perceptron branch predictor, which later BD cores took in as well. It's a straightforward neural net that learns from a long branch history the behavior of a branch or set of them, where the history and a set of weights are summed together to get a prediction. The hash likely decides which of a bank of perceptrons will be used. It has the benefit of allowing for very long histories to be tracked without the exponential growth in predictor size per number of branches evaluated in the history register.

It has certain weaknesses and strengths, and per Agner Fog's testing it is pretty accurate. Nested loops have not been handled too well, and branch behavior that is highly regular but not linearly separable (example: alternately taken/not taken) cannot be perfectly learned. Possibly, AMD has improved on some of these things.
 
Last edited:
The page table coalescing feature might be useful for making items like AMD's TLBs (particularly its small L0) more effective, and to ameliorate the indexing issues resulting from its Icache still not having the associativity needed for a cache of its size with a 4KB page granularity to fully avoid aliasing.

I thought I had seen some discussion a long time ago for combining sequential pages into a larger entry, possibly in the context of GPUs, though.
 
@3dilettante I listened to the hotchips prez today, in term of the exclusive L3, Michael Clark stated that they track at the CCX level what data is in each core, so i guess some sort of intra CCX directory.
 
Back
Top