AMD RyZen CPU Architecture for 2017

Bulldozer wasn't anything new, from time perspective. There were patent fillings for a similar multi-threading architecture in 2003 or 2004, AFAIK. I can guess Bulldozer was in-fact the "B-plan" for AMD, kicked up in a hustle some time after the K10h fiasco.
There seems to have been an A and B version of Bulldozer, going by a demarcation of sorts in the patents.
I think that was different from an even more speed-oriented design, as described on comp.arch (by Mitch Alsup, if my fuzzy memory is a guide) as a design that removed result forwarding in order to save the gate delays.
Less certain were rumors of an aborted wide architecture that was another failed attempt at replacing K8.

It still boggles me, why AMD decided to invest in a NetBurst redux after it was quite clear that the future is definitely not in the power-sucking high-speed monsters. Not that they had a choice back then.

An interesting aspect about CMT in one direction put forward by Glew was splitting things up in order to slim down the critical execution loop for a fireball-type design that could really crank up the clock, which Bulldozer does not quite reach for given its modest reduction in per-stage complexity.
It was that goal and/or the implementation of enhanced speculation that should have been in mind if CMT were selected at the beginning of the design process. Perhaps the weirdness now comes from either one or both coming to dead ends, and the design we have now is an awkward fattening of a too-skinny pipeline and/or an awkward grafting of what could be salvaged from abandoned interesting directions that AMD found wouldn't work or could not be made to work with the resources available.
 
This is where AMD's highlighting the minor area cost of its integer ALUs was shown to be an incomplete picture of the situation the design faced.

Before Bulldozer came out AMD said so many misleading or outright incorrect things about the uarch. Mostly by John Fruehe on his official blog.

They said that three ALUs were rarely useful based on their testing, but they didn't just take away an ALU. The two EX pipes were shared by all ops that weren't pure loads or FP/SIMD, so branches, stores, and multiplies contended for the same resources as simple ALU ops. Allegedly the AG pipes could handle some sort of simple ALU ops but that looks like it was barely if ever true on the first release. But it does raise the question of exactly how they saved on register ports. Maybe saved on forwarding paths by limiting how much those units could quickly forward between clusters.

They said that the third AGU on K10 was useless due to their only being two load/store ports and only there to make the implementation more symmetric. That's not true because the AGUs are decoupled from the LD/ST path. They said that K10 could only do 3 of ALU or AGU, while Bulldozer could do 2 AGU + 2 ALU, which is just blatantly wrong. And they said repeatedly that IPC would actually increase.

As someone interested in CPU uarch details it was very disappointing to see a company let themselves misrepresent their design so badly. Engineers there should have been approving the blog posts or at least speaking up when they noticed how wrong they were.
 
I do not have much insight into how much technical expertise Fruehe had at that time, but a lot of the factoids he ran with sounded like they were sourced from technical staff. Odds are, there were people from engineering vetting the posts, although what proportion was engineers versus their bosses would be up for debate. Let's note that everyone's boss in the CPU division during most of the troubled gestation of Bulldozer eventually became the boss of the whole company around the point of release. If there were objections by engineers, they would have been handled in-house and they would be mindful that they were getting negative about their own work and that of everyone's boss.

I do know there were grumbles. Comp.arch and Glew's reflections on CMT indicate that there was a grapevine through which there were more honest assessments about what Bulldozer was going to be prior to release. Less verifiable were certain rumors about the attitudes of management--particularly that of everyone's boss--when it came to raising objections.

Things were also evasive and misleading with Shanghai and R600, and we've entered an era of marketspeak and abused codenames of Orwellian proportions when it comes to describing IP levels or what the goes into a chip called Carrizo, so whatever problems there were stem from longer-lived trends than that one example.
 
Last edited:
Do I remember right, didn't Fruehe write under the misapprehension that IPC went up because what was being counted was both "cores" in a "module" and he just called the sum of them the IPC. I suppose in a way it's the right thing to do, since no one really thinks a module has 2 cores, it's just a dual-threaded core with large amounts of sharing. But everyone else thought it was single-thread IPC he was talking about.
 
Do I remember right, didn't Fruehe write under the misapprehension that IPC went up because what was being counted was both "cores" in a "module" and he just called the sum of them the IPC. I suppose in a way it's the right thing to do, since no one really thinks a module has 2 cores, it's just a dual-threaded core with large amounts of sharing. But everyone else thought it was single-thread IPC he was talking about.

I'm not sure if there was initial confusion on that count, but later statements were clear that a Bulldozer module had two integer cores.
Whether there was abuse of the term IPC beyond the long-standing consensus definition, I do not recall. Summing up instructions across two cores as instructions per clock, while technically correct, would be called out in an industry that does not use it that way.

As far as labeling the integer sections in a module as cores, I am among the set of people that think it actually is two cores. AMD is as well, at least now. There were old allegedly leaked slides that if true hinted that AMD might have tried to call a module a core, but that did not last and I feel that other companies would have balked--if such objections were an actual instead of hypothetical reason that AMD changed its nomenclature to standard usage.
The processor control structures for issue/retirement and signal paths that govern the execution of instructions and the critical path of deciding the direction taken by a thread are physically separate. One of the forms proposed in Glew's clustered architecture created a hierarchical global scheduler at what we now call the module level that fed into per-cluster fast schedulers. If that level of control and decision making had persisted in Bulldozer, the core argument might have gone the other way.

The lack of it, and other factors that could have made the integer cores of Bulldozer different from other multicore designs with heavy sharing like Niagara would have make me wonder if there was something AMD dropped from the design.
 
Last edited:
AMD cuts ‘Bulldozer’ instructions from ‘Zen’ processors

Advanced Micro Devices has been talking about development of its next-generation high-performance “Zen” architecture for months now, but so far it has not revealed any details about the chips officially. Nonetheless, thanks to a recent patch for Linux we have learnt one significant detail about “Zen”: it will not support many instructions found in the current-generation processors.

AMD recently started to enable support of its forthcoming “Zen” microprocessors in Linux operating systems. While typically patches to Linux distributives do not reveal a lot of micro-architectural peculiarities of various central processing units, this time is a clear exception. AMD explicitly revealed in the description of the patch to the GNU Binutils package that “Zen”, its third-generation x86-64 architecture in its first iteration (znver1 – Zen, version 1), will not support TBM, FMA4, XOP and LWP instructions developed specifically for the “Bulldozer” family of micro-architectures.

Elimination of such instructions clearly points to the fact that AMD’s new micro-architecture is a complete far cry from “Bulldozer”. The company even decided to remove support of the “Bulldozer”-specific instructions to save transistors and die space for something more useful. It seems that AMD now considers “Bulldozer” a dead-end and does not want to support even promising instructions introduced in the recent iterations of the company’s micro-architectures.
 
I suppose everyone will have varying levels of attachment to the instructions lost.
If Zen is meant to be a sibling to a custom ARM, I would think the hardware capability for FMA4 could be maintained internally, although I suppose it could be dropped as well.
Is there something mirroring the LWP instructions in the Intel x86 realm? Some people seemed to like that.
 
I suppose everyone will have varying levels of attachment to the instructions lost.
If Zen is meant to be a sibling to a custom ARM, I would think the hardware capability for FMA4 could be maintained internally, although I suppose it could be dropped as well.
Is there something mirroring the LWP instructions in the Intel x86 realm? Some people seemed to like that.
I guess losing FMA4 isn't really much of a problem, because fma3 just has 3 variants instead which allow you to chose which operand register to overwrite (and which operand can have the memory operand). I suspect it should be possible to pick a suitable version nearly always without having to resort to MOVs.
XOP had in theory some nice stuff in it which isn't in AVX2 (luckily, the big SSE omission, the true vector shift, is in both). Powerful vector permutes, integer mad, rotate, ... Most of it (in contrast to that previously missing true vector shift which is terrible to emulate) should be possible to emulate with just a few instructions. And probably almost noone used them anyway.
Not sure about the LWP stuff...
 
How can transistors saved on exotic instructions be relevant when the count of ALU transistors is about, oh, I dunno, 5% of the entire CPU?
 
I guess losing FMA4 isn't really much of a problem, because fma3 just has 3 variants instead which allow you to chose which operand register to overwrite (and which operand can have the memory operand).
It's more of a question of whether AMD would revamp an FMA4 pipeline for Zen if there's a sibling ARM core that would still use it. The later Bulldozers derivatives had no problem using the same ALU to support FMA3.

How can transistors saved on exotic instructions be relevant when the count of ALU transistors is about, oh, I dunno, 5% of the entire CPU?

It frees up the encoding space, particularly if there's some kind of conflict with one of Intel's current or upcoming encodings. For AMD, it's more important to match Intel. Even if not, the encoders are complex beasts to get right and there's a raft of architectural behaviors AMD doesn't need to invest in validating when it needs to spend its efforts validating for two other companies' architectures.
There would be knock-on effects down the pipeline past the decoder for the instructions that are snipped. The XBAR unit and the plumbing for LWP wouldn't be needed if the instructions that use them aren't there, perhaps more so if the ARM architecture AMD is juggling doesn't have instructions for them, either. ARM is why I wondered about FMA4 and whether the hardware would actually be going away.

I think the next question is what the full range of new instructions will be for Zen. For a small part of the implementation, it will indicate what the direction and ambitions of the core are--and AMD perhaps.
 
Was reading an article the other day talking about how they bought back a bunch of the old big AMD names for this.
Its a pretty old article but I hadn't really been paying any attention for a long time.

Old names who had previously jumped from the sinking ship coming back to AMD seems like a really good thing.

AMD feels like they're on something of a roll at the moment putting out a fairly steady stream of revisions with various new techs coming online in relatively good time.

How is this architecture shaping up?
New AMD64/Core2 type step-change into strong competition with Intel again or just likely to be a 'good enough' alternative in the mid-range bulk stuff?

Whole heap of new techs seem to be converging to bring possibility of a significant step up?

I think they don't need to compete at top end stupid price stuff, but being a valid competition (price, performance & power) to top i5 models/low end i7 seems like a good thing to target.
At least, that'd get Intel rolling for a bit anyway.
 
Last edited:
How is this architecture shaping up?
New AMD64/Core2 type step-change into strong competition with Intel again or just likely to be a 'good enough' alternative in the mid-range bulk stuff?

IMHO, the IPC and performance/watt differences between Bullzdozer revisions and Haswell/Broadwell is so great that a big step-change would be needed to provide a "good enough" alternative.
As it is, they're just selling big chips with high power consumption for cheaper than Intel's chips that take half of the die area with disabled cores/hyperthreading/turbo for redundancy.


I think they don't need to compete at top end stupid price stuff, but being a valid competition (price, performance & power) to top i5 models/low end i7 seems like a good thing to target.
At least, that'd get Intel rolling for a bit anyway.
They don't need to, but if they can, they should. The Halo Effect is a very real thing IMO.
 
AMD needs to focus on performance/power.

Everything is limited by power these days, not only in the mobile market, but in server and desktop markets as well.

Hope they spend some more FO4 delays on individuel pipeline stages, and make it *much* wider.

Cheers
 
They don't need to, but if they can, they should. The Halo Effect is a very real thing IMO.
So apparently they are going for Server first which kinda actually does imply an attempt at the top end, is a promising prospect.

As it is, they're just selling big chips with high power consumption for cheaper than Intel's chips that take half of the die area with disabled cores/hyperthreading/turbo for redundancy.
Yeah, its not good right now & lets Intel pretty much idle along.
 
If it's true that quad-cores will (finally) be the baseline for the Skylake generation then AMD is "lucky" that Intel delayed Skylake to Q3 or later, as "more cores" is the only argument left for AMD on the desktop.
 
So apparently they are going for Server first which kinda actually does imply an attempt at the top end, is a promising prospect.

If they go for high-end servers, they are crazy. They are road-kill in that space. Does anyone seriously think they have the resources to turn that around? Especially whilst at the same time tackling all the other markets? Smacks of wishful thinking to me.
 
That's really interesting if true, do you have a source?
Sorry but I was speaking from memory, but IIRC it was just Skylake's Wikipedia entry (I didn't bother to check on the usual rumor sites). The english wikipage now says "Up to 4 cores as default" while the portuguese wikipage says "Quad-core everywhere".
So I guess it's my mistake or just a rumor, which is unfortunate since Intel released desktop quadcores on 65nm but are still trying to position them as sort of high-endish on 22nm or 14nm while even a Raspberry Pi 2 is already quad-core. It's ridiculous.
 
I agree, its long past time that 8 cores became the high end and quad the baseline IMO. Unfortunately everything I've seen points to skylake being the same as previous gens with regards to core count.
 
Back
Top