AMD RyZen CPU Architecture for 2017

That's pretty nice price. But remember that 14 cores at 2.0 GHz have exactly the same maximum throughput as 8 cores at 3.5 GHz (same architecture). And this is in perfectly multithreaded scenarios (all 28 threads 100% utilized). So don't expect it to match a modern 3.5 GHz 8-core CPU in many applications.

2GHz is the base clock. As you'll see in the CPU-Z screenshot from the page, the CPU's maximum boost clock is actually 3GHz.
Boost clocks are dependent on temperature and TDP, and in a X99 you can configure the BIOS to use extra TDP, making sure the cores can always clock at the maximum boost values (given you have an adequate cooler, of course).


Prices for the models that actually clock that low can get ridiculous, though.
 
AMD seems pretty keen to hit 7nm with Zen 2 their roadmap has zen 2 sometime near 2018 on 7nm. I wonder how that is going to shape up. I have very little faith that globalfounderies can get a 7nm node up and running for high end cpus by 2018. I wonder if there will be a Zen redesign on 14nm if the 7nm is delayed.
 
I expect Zen-referesh on 14nm+ proces at the beginning of 2018, Zen2 at the beginning of 2019 and Zen3 at the beginning of 2020. I don't believe that the cycles will be shorter than ~12 months.
 
Ryzen has twice as much I$, which might make a difference when you have two contexts stomping around. It also has twice the L2 cache, and as important twice the associativity. Sky/KabyLake's L2 is only four way associative. Ideally you want to have three ways per context for when the code, stack and heap segments alias or you might end up evicting hot cache lines because you run out of associativity ways.
I'm curious what the implications may be for Skylake X, with some of the early reporting/rumors indicating that its L2 leapfrogs Zen's with 1 MB capacity. That would make sense with its expanded vector capability. Usually, Intel would be increasing the number of ways as its increases capacity, although one complication is that the capacity and associativity of the per-core caches raises the burden on an inclusive L3--which seems unusually small per-core.

The aggregate scheduling resources are also bigger for Ryzen than Sky/Kabylake, something that might make a difference once you're limited by LS throughput and have to schedule around extended latencies. Also, according to Agner Fog, execution throughput, AVX instructions excepted, is higher for Ryzen than any Intel core.
Ryzen's integer schedulers are segmented, and are individually relatively shallow compared to how Intel describes its unified scheduler. That's not to say that Intel's is necessarily without some kind of internal subdivision, however.
While we do not know the exact limitations of Zen's schedulers or its mapping stage, it would seem more likely to hit transient stalls on an unlucky confluence of dependences/hazards on a 14-entry queue within a window of 192 versus Skylake's 97-entry scheduler out of a window of 224.

On the other hand, a balanced SMT load mostly doubles the number of ops available before in-thread dependences can exhaust a scheduler, and might be more readily spread out between schedulers with the exception of the few operations like MUL and DIV that are not replicated across multiple ports.
More schedulers might also give more flexibility in forwarding and operand scheduling versus the higher costs of doing so in a unified manner, at least in the SMT case where half the operations' independence is trivially proven.
Agner didn't mention anything about AMD's segmented scheduling, which could mean it doesn't matter that much or his testing did not try to push it.

For single-threaded, Skylake's deeper scheduler could go longer, and I am not sure I'd bet against its branch predictor still being better than Zen's for having more valid uops in-flight for a single thread.
If Zen's predictor is provisioned well-enough, it's possible that cutting the distance of speculation in half per-thread might hide more of the diminishing returns even with marginally weaker predictors.
 
I'm curious what the implications may be for Skylake X, with some of the early reporting/rumors indicating that its L2 leapfrogs Zen's with 1 MB capacity. That would make sense with its expanded vector capability. Usually, Intel would be increasing the number of ways as its increases capacity, although one complication is that the capacity and associativity of the per-core caches raises the burden on an inclusive L3--which seems unusually small per-core.

My bet is Intel is moving to a non-inclusive L3. Inclusive caches don't scale well with high core count (power goes up with N^2, latency with N). Maybe a memory cache (as the L4 in Iris Pro) rather than a strict victim cache; That way the integrated graphics might better utilize it.

Ryzen's integer schedulers are segmented, and are individually relatively shallow compared to how Intel describes its unified scheduler. That's not to say that Intel's is necessarily without some kind of internal subdivision, however.
While we do not know the exact limitations of Zen's schedulers or its mapping stage, it would seem more likely to hit transient stalls on an unlucky confluence of dependences/hazards on a 14-entry queue within a window of 192 versus Skylake's 97-entry scheduler out of a window of 224.

Ryzen's six 14 entry deep scheduling queues are just for integer instructions though. The FPU has its own scheduling apparatus, probably a lot shallower since it doesn't have to deal with the highly variable latencies of memory accesses and just have to handle execution latencies. You could probably fill the FP scheduling queues with div instructions and stall it, but then performance would tank anyway.

I'd be surprised if Intels unified scheduler actually is one big queue given how the dispatch ports are partitioned. Ports 2, 3, 4 and 7 (in SkyLake) can only dispatch ops that talks to the cache hierarchy, ports 0,1,5 and 6 handles all the rest.

Cheers
 
My bet is Intel is moving to a non-inclusive L3. Inclusive caches don't scale well with high core count (power goes up with N^2, latency with N). Maybe a memory cache (as the L4 in Iris Pro) rather than a strict victim cache; That way the integrated graphics might better utilize it.
I'd tend to agree with the non-inclusive L3, but in Ryzen's case with Infinity/mesh it might make sense for some workloads. I'd think they would partition cache usage somewhat dynamically for the task at hand. Large scale HPC or any program with lots of associations between threads would benefit from inclusive. VMs with independent threads not so much as long as the cache was used effectively.

More specifically a large write cache makes sense with nonvolatile memory where significant energy is expended during writes as opposed to reads. That won't be every system though. I'd think for a nonvolatile based system a L4 would be required for any graphics or even acceptable performance. Nonvolatile presents some interesting characteristics (slower, denser, low power) to be designed around.
 
My bet is Intel is moving to a non-inclusive L3. Inclusive caches don't scale well with high core count (power goes up with N^2, latency with N). Maybe a memory cache (as the L4 in Iris Pro) rather than a strict victim cache; That way the integrated graphics might better utilize it.
It seems like there is a good possibility of a change in the hierarchy, although that would change some of the parameters for the comparison between the AMD and Intel subsystems. The L3 would no longer serve as a probe filter, and the latencies could worsen.
Making the L3 memory-side would strip it out of much of the coherency domain, since it serves as a stand-in for the bus for a physical memory access.

Ryzen's six 14 entry deep scheduling queues are just for integer instructions though. The FPU has its own scheduling apparatus, probably a lot shallower since it doesn't have to deal with the highly variable latencies of memory accesses and just have to handle execution latencies.
It's difficult to compare the FP scheduling hierarchy. There's an in-order queue prior to the actual scheduler, which apparently serves as a way to prevent backpressure from the scheduler from stalling uop issue.
The scheduler itself is not shown as being subdivided, and AMD gives the total queue count for FP as 96. Segmenting would be more complicated since there are quite a few operations that span multiple ports.

You could probably fill the FP scheduling queues with div instructions and stall it, but then performance would tank anyway.

I'd be surprised if Intels unified scheduler actually is one big queue given how the dispatch ports are partitioned. Ports 2, 3, 4 and 7 (in SkyLake) can only dispatch ops that talks to the cache hierarchy, ports 0,1,5 and 6 handles all the rest.
Zen's integer schedulers are per-port, where something like a stream of DIV or MUL operations 15 deep could potentially choke off uop issue completely. A dependent chain of address calculations would also fill up the AGU schedulers to the point of a global stall long before the queues are exhausted.

Skylake likely has its own stall points and subdivisions, but would it stall completely after encountering a 15th MUL?
 
Ryzen has twice as much I$, which might make a difference when you have two contexts stomping around. It also has twice the L2 cache, and as important twice the associativity. Sky/KabyLake's L2 is only four way associative. Ideally you want to have three ways per context for when the code, stack and heap segments alias or you might end up evicting hot cache lines because you run out of associativity ways.

The aggregate scheduling resources are also bigger for Ryzen than Sky/Kabylake, something that might make a difference once you're limited by LS throughput and have to schedule around extended latencies. Also, according to Agner Fog, execution throughput, AVX instructions excepted, is higher for Ryzen than any Intel core.

Currently Ryzen does a pretty good job at running code optimized for Intel microarchitectures. It'll be interesting to see what performance increases (if any) are achieved once compilers start to accomodate Ryzen.

Cheers
Something I noticed in several comparison videos is that Ryzen usually runs using a lot less of the CPU to achieve the same performance on comparable CPUs from Intel, which might be due to either a very optimised utilisation of the CPU or a lackluster use of the CPU resources, but the potential for even greater numbers, performance wise, is there.

 
Something I noticed in several comparison videos is that Ryzen usually runs using a lot less of the CPU to achieve the same performance on comparable CPUs from Intel, which might be due to either a very optimised utilisation of the CPU or a lackluster use of the CPU resources, but the potential for even greater numbers, performance wise, is there.


I think a lot has to do with memory latency, your core is sitting idle while waiting for data from memory. Hopefully new firmware can help. I dont see a lot of these review guys focusing on driving down memory latency, on some MB's with 3200 CL14 you can get low 70ns in AIDA and if you bclock you can get memory to 3600 and get to mid 60's.

I guess going forward they really need to make the fabric clock rate higher and decouple its speed from memory/allow more multipliers to be set for it ( 0.66, 0.75 etc)
 
I'm reading this thread on a 486SX. Can't watch the video. thanks!

It's just your typical hype video with no information whatsoever. Hopefully we'll see a 6core + something with similar perf to 1060/580 on a small form factor. Although the first few APUs will be limited to 4 core design afaik.
 
A technical explanation of why SMT performs better than Intel HT (same thing, different name, I know):
The throughput of each core in the Ryzen is higher than on any previous AMD or Intel x86 processor, except for 256-bit vector instructions. Loops that fit into the µop cache can have a throughput of five instructions or six µops per clock cycle. Code that does not fit into the µop cache can have a throughput of four instructions or six µops or approximately 16 bytes of code per clock cycle, whichever is smaller. The 16 bytes fetch rate is a likely bottleneck for CPU intensive code with large loops.
Most instructions are supported by two, three, or four execution units so that it is possible to execute multiple instructions of the same kind simultaneously. Instruction latencies are generally low.

The 256-bit vector instructions are split into two µops each. A piece of code that contains many 256-bit vector instructions may therefore be limited by the number of execution units. The maximum throughput here is four vector-µops per clock cycle, or six µops per clock cycle in total if at least a third of the µops use general purpose registers.

The throughput per thread is half of the above when two threads are running in the same core. But the capacity of each core is higher than what a single-threaded application is likely to need. Therefore, the Ryzen gets more advantage out of simultaneous multithreading than similar Intel processors do. Inter-thread communication should be kept within the same 4-core CPU complex if possible.

The very high throughput of the Ryzen core places an extra burden on the programmer and the compiler if you want optimal performance. Obviously, you cannot execute two instructions simultaneously if the second instruction depends on the output of the first one. It is important to avoid long dependency chains if you want to even get close to the maximum throughput of five instructions per clock cycle.

The caches are fairly big. This is a significant advantage because cache and memory access is the most likely bottleneck in most cases. The cache bandwidth is 32 bytes per clock which is less than competing Intel processors have.

Source: http://www.agner.org/optimize/microarchitecture.pdf page 216-217
 
The L3 would no longer serve as a probe filter, and the latencies could worsen.

They could, but don't really have to.

Replicate L2 tags in a coherency agent and have a fast inter-cache connect (something Intel has been good at in the past). We're talking 100KB SRAM per core assuming 1MB L2.

How they'd organize the many core case I don't know (but I do have ideas) , as long as they don't copy AMD, which has really good intra-CCX probe latencies, and awful inter-CCX latencies.

Cheers
 
AMD Enjoying High Yields For Zen Based Processors – More Than 80% Of Dies Have All 8 Cores Fully Operational
Yields are one of the more important variables in the life-cycle of any silicon product and on first glance it appears that AMD has run into quite a bit if good luck as far as they are concerned. According to a report from Bitsandchips.it, they are currently enjoying yields upwards of 80% which basically means that more than 80% of the Zen dies fabricated have all 8 cores fully functional. This is of course something that will have quite an impact on the financials of the company as they will be able to increase their profit margins.

While at first glance it appears that AMD is having unexpected high yields, this is actually not completely true. The yields are excellent, yes, but entirely expected. Firstly, the process they are using is the 14nm process which has had more than a year to completely mature and secondly, the die itself is quite small so yields will naturally be higher (conventional wisdom dictates that the smaller the die, the larger the yield). Since die size is one of the more significant factors in determining yield, I am sure that the company was well aware of this fact when it decided on the exact size. That said, however, the effect of these high yields can only have a positive effect for the company.
 
The number of functional cores is a major component, although it doesn't rule out something like an 8-core chip that had a fault in an uncore block or excessive defects in an L3.
Similarly, parametric yields can mean 8 cores that technically run, just not in the power range needed or at the clocks desired. That could still give a decent number of chips that are in the lower tiers or are being discarded despite the number of functional cores. The area of core logic that isn't cache with at least some fault tolerance is maybe 30mm2 or lower.
 
The number of functional cores is a major component, although it doesn't rule out something like an 8-core chip that had a fault in an uncore block or excessive defects in an L3.
Similarly, parametric yields can mean 8 cores that technically run, just not in the power range needed or at the clocks desired. That could still give a decent number of chips that are in the lower tiers or are being discarded despite the number of functional cores. The area of core logic that isn't cache with at least some fault tolerance is maybe 30mm2 or lower.

Production Yields are production yields, the rest is technically an other consideration, who are at another level than the waffer productions . We dont count the numbers of Nand who dont pass the QA at a rated voltage / speed in production yields no ?
 
Production Yields are production yields, the rest is technically an other consideration, who are at another level than the waffer productions . We dont count the numbers of Nand who dont pass the QA at a rated voltage / speed in production yields no ?

The source article just states 80% of the chips have 8 functional cores, which does not provide a complete picture. We also don't have reference to how good that is, relative to other chips with similar ratios of CPU logic to total die area.

The other way around, roughly 20% of Ryzen chips have irrecoverable faults in 10-15% of their die area. What does that mean for the chips that have faults in the other 85-90%?
 
Back
Top