AMD RyZen CPU Architecture for 2017

fellix · Oct 20, 2014

“From where we are today, our [server] product portfolio in x86 is several years old, and we are certainly looking at updating that over the next couple of years,” said Lisa Su during a conference call with investors and financial analysts.

Dr. Su seems to fully admit that AMD’s current-generation Opteron microprocessors based on the Piledriver micro-architecture (first introduced in 2012) are completely outdated and cannot really compete against offerings from Intel Corp. Still, Lisa Su believes that even before 2016 AMD will be able to address dense-servers/micro-servers with its low-power x86/ARM-based offerings.

It is not the first time when AMD implies that its next-generation high-performance x86 micro-architecture code-named Zen will materialize in 2016. However, the reiteration of the plan from the new CEO seems to be an important event since it means that, at least now, the new head of the company believes in AMD’s ability to create a competitive high-performance x86 micro-architecture.

The only thing currently known about AMD’s Zen is that it will drop clustered multi-thread (CMT) design in favour of more traditional simultaneous multi-threaded (SMT) design. This may result in decrease of the amount of cores inside AMD’s processors, but will increase their efficiency. Jim Keller, who led the development of AMD’s ultra-successful Athlon 64 and Opteron (K8) processors in the early 2000s, heads the team that designs AMD’s next-generation Zen micro-architecture.

AMD’s Lisa Su: high-end ‘Zen’ x86 cores set to be available in 2016

Grall · Oct 20, 2014

About time AMD embraces SMT. I for one welcome this development.

Alexko · Oct 20, 2014

Yeah but I don't remember hearing anything from AMD itself about using SMT, although they were more or less clear about dropping CMT.

fehu · Oct 20, 2014

I'll wait the next conference in wich they will present the all new powerfull excavator while explaining that it was only a wrong turn and that it will be put in the trash can as soon as possible

3dilettante · Oct 20, 2014

It is one thing to say that Piledriver-based Opterons are outdated and another to trash Excavator.
AMD's non-APU Opteron line is quite outdated, and the focus on the low-power line indicates they will probably do the updating with Seattle and repurposed Excavator chips. Absent more data, it still sounds like AMD's ducking the Opteron line proper.

Is there a quote from AMD that corresponds with what the article asserts as far as CMT goes?

fehu · Oct 20, 2014

cmt is an interesting idea, there's any technical limitation to implement it with per core smt?

3dilettante · Oct 20, 2014

It would be possible, but scaling in threads introduces different scaling costs between the per-core and shared resources.

Assuming a BD-type module, the front end sees the demands placed on maintaining adequate branch prediction and residency in the instruction cache rise at a 2xthread rate instead of per-core, and the FPU sees the necessary amount of scheduling and arbitration rise at 2xthread. The issue mechanism from each integer core to the FPU would likely see additional complication above just doubling, because currently there is just one thread at a core level that can issue FP instructions without worrying about another thread contending for the same link.

The tiny data caches and the write-combining cache are more likely to have problems. There could be issues with maintaining latencies for inter-core communication. The BD line is already prone to weird througput problems when it comes sharing or potentially sharing between modules or between cores in a module, and in my admittedly jaundiced opinion I would bet on it making things weirder and worse.

Alexko · Oct 20, 2014

3dilettante said:
It would be possible, but scaling in threads introduces different scaling costs between the per-core and shared resources.

Assuming a BD-type module, the front end sees the demands placed on maintaining adequate branch prediction and residency in the instruction cache rise at a 2xthread rate instead of per-core, and the FPU sees the necessary amount of scheduling and arbitration rise at 2xthread. The issue mechanism from each integer core to the FPU would likely see additional complication above just doubling, because currently there is just one thread at a core level that can issue FP instructions without worrying about another thread contending for the same link.

The tiny data caches and the write-combining cache are more likely to have problems. There could be issues with maintaining latencies for inter-core communication. The BD line is already prone to weird througput problems when it comes sharing or potentially sharing between modules or between cores in a module, and in my admittedly jaundiced opinion I would bet on it making things weirder and worse.

I believe both threads within a module can issue an instruction to either FP pipe in BD/PD/SR.

3dilettante · Oct 20, 2014

Since there are two cores in a module, there is no need for arbitration for instructions within a single core when sending to the FPU.
Making the cores SMT would change that, requiring logic for a situation that didn't exist prior to that.

Alexko · Oct 21, 2014

Yes, sorry, I read your previous post too quickly.

entity279 · Oct 21, 2014

Simplifying it ( maybe not for eloquence but rather because I'm not proficient enough to grasp the details

) but you would want SMT in order to achieve efficiency in case of a whitish execution stage. AMD's CMT cores are far from being wide enough for SMT to ever make a difference

fellix · Oct 21, 2014

Well, NetBurst also wasn't wide enough, but the SMT implementation there was for different reasons (long pipeline prone to stalls?). Granted, HT worked better when was re-introduced in the much wider Nehalem later.

lanek · Oct 21, 2014

entity279 said:
Simplifying it ( maybe not for eloquence but rather because I'm not proficient enough to grasp the details ) but you would want SMT in order to achieve efficiency in case of a whitish execution stage. AMD's CMT cores are far from being wide enough for SMT to ever make a difference

Well, for be honest, i still think that if the OS was understand what to do with SMT; this will maybe be a bit better than what we have see of it. Who know, maybe in some stage in the future we could see it back.

3dilettante · Oct 21, 2014

Did you mean CMT? SMT is commonplace and not going anywhere.
At least the first Windows scheduler hotfix was to treat a BD chip as if it were an SMT processor with each module treated as a core.
It's less than ideal, but at the same time AMD's preferred option was that the OS sift through the thread memory access history, guess the future, and track whether a thread should allocate a timeslice on a core it left earlier, and do other things at runtime to schedule threads so that they matched the module layout.

The CMT model has a general propensity to halve a lot of the upsides that would otherwise be available to a thread at times where they were most needed, whilst simultaneously doubling the downsides by requiring the other half of the module to be active and forcing strange stalls and throughput losses.
Per the designer that conceived it, the architecture does not make sense unless you intend to do something interesting with it. AMD in the end did nothing interesting with it, and until it proves it is capable of doing something interesting in the future (leveraging aging IP and not progressing on the very difficult interesting future directions isn't it) I don't see why CMT should come back.

Deleted member 13524 · Oct 21, 2014

3dilettante said:
It is one thing to say that Piledriver-based Opterons are outdated and another to trash Excavator.

Then again, by AMD's own statements, they're also saying that the yet-to-be-released Excavator is, at best, just 32% (1,15*1,15) better per watt than a "completetely outdated" and non-competitive Piledriver..

None of this is new, though. We all knew it would come to this as soon as AMD released that "10-15% performance each year" slide.

The great news is that AMD hasn't given up on the high-end x86 cores yet. We consumers need competition on those, badly.

3dilettante · Oct 21, 2014

ToTTenTranz said:
Then again, by AMD's own statements, they're also saying that the yet-to-be-released Excavator is, at best, just 32% (1,15*1,15) better per watt than a "completetely outdated" and non-competitive Piledriver..

General platform features surrounding the Opteron CPUs are also rather dated, so at a non-performance level there is stagnation in IO and system features as well.
The official AMD position, as worded, is a recognition that chips set down years ago are updated.
It omits what can be inferred about promised levels of performance improvement, but omissions and the assumption that the audience cannot connect two dots is not out of character.

The great news is that AMD hasn't given up on the high-end x86 cores yet. We consumers need competition on those, badly.

High end for whom is the question. If they refuse to commit to taking on Intel in the markets where real high-end x86 cores exist, then they are not high end to a standard outside observers are using.

HMBR · Oct 22, 2014

looking at performance, the module with 2 CMT cores seems to deliver the promised performance of around 80% of 2 independent cores I think,

what is killing "Bulldozer" is the low single thread performance (and poor power efficiency), but maybe CMT is a barrier for them to improve on single thread performance and power efficiency, considering they are not using CMT (and SMT) for their most power efficient CPUs, if "Jaguar" is the basis for the new architecture, it would be natural to drop CMT!?

DavidC · Oct 22, 2014

fellix said:
Well, NetBurst also wasn't wide enough, but the SMT implementation there was for different reasons (long pipeline prone to stalls?). Granted, HT worked better when was re-introduced in the much wider Nehalem later.

Well it was replay that messed up Hyperthreading on Netburst chips. I think an engineer(on RWT? Not sure) said that getting Hyperthreading to be effective as it was on Netburst is much harder on Nehalem because the perf/clock is so much higher.

Gubbi · Oct 22, 2014

In a SMT processor, many resources (ROB, store buffers etc) are split between contexts; AMDs rationale for CMT was duplicating integer execution units for each context was a small incremental cost. The premise for this rationale was based on the K7/8 microarchitectures where the integer units were a tiny fraction of a core.

This of course doubles the cost when you want to improve a single execution unit ( like a fully out-of-order load/store unit or a fast divider) or make the core wider internally with wider instruction issue.

The consequence is you end up with a core that is narrower with slower and less sophisticated execution units. AMD tried to make up for this by boosting operating frequency, which is ... bizarre considering their competitor was shunning speed racers and embracing power efficiency in CPUs.

On top of that you have the store-through datacaches to a slow L2 cache.

Cheers

fellix · Oct 22, 2014

DavidC said:
Well it was replay that messed up Hyperthreading on Netburst chips. I think an engineer(on RWT? Not sure) said that getting Hyperthreading to be effective as it was on Netburst is much harder on Nehalem because the perf/clock is so much higher.

Yep. Probably an alternative VMT implementation was more suitable for P4's pipeline than SMT?

Gubbi said:
In a SMT processor, many resources (ROB, store buffers etc) are split between contexts; AMDs rationale for CMT was duplicating integer execution units for each context was a small incremental cost. The premise for this rationale was based on the K7/8 micro-architectures where the integer units were a tiny fraction of a core.

It wasn't only the INT logic being duplicated, but also the L/S pipelines and data caches, so it wasn't that small, though. AMD banked on the future of IGP, as an integral part of a common architecture, where both intensive and more casual FP code would be "naturally" offloaded. The FPU was left shared, inefficient and mostly underpowered, as a consequence.

AMD RyZen CPU Architecture for 2017

fellix

Grall

Invisible Member

Alexko

fehu

3dilettante

fehu

3dilettante

Alexko

3dilettante

Alexko

entity279

fellix

lanek

3dilettante

Deleted member 13524

Guest

3dilettante

HMBR

DavidC

Gubbi

fellix

Similar threads