AMD RyZen CPU Architecture for 2017

For me it looks like a they are gearing up toward a sale of the company, they cut as much fat as possible and actually a lot more than that.
There is no roadmap any more and they just killed "excavator" before it got released, Intel PR team would have not done better.

IMO, they've just acknowledge that they can no longer compete with Intel, their low power offering is not competitive, something for the mid range high-end.

In the meantime they want to show that they know a couple of thing about integration (shown in Jaguar SOC, up coming ARM server), that they still have great IP (CPU and GPU), and they are to focus R&D on something that is more an industry standard and proved to be working: SMT

If that is not an attempt to make the company as a sexy girl, I don't know what it is.
 
looking at performance, the module with 2 CMT cores seems to deliver the promised performance of around 80% of 2 independent cores I think,

what is killing "Bulldozer" is the low single thread performance (and poor power efficiency), but maybe CMT is a barrier for them to improve on single thread performance and power efficiency, considering they are not using CMT (and SMT) for their most power efficient CPUs, if "Jaguar" is the basis for the new architecture, it would be natural to drop CMT!?

The catch to the promise about 80% of 2 independent cores is an implicit assumption that these are 2 independent cores with the same resources outside of the shared ones that are now split off.
The reality is that an independent core doesn't necessarily need to be bound to the constraint that it be designed so that it can serve as a counterexample for a CMT design.

There's a rough generalization that there is a set or footprint in terms of active transistors that can be utilized by a context. CMT makes sure a significant fraction of that footprint cannot readily go where it may be needed at any moment due to the static split, and CMT's forcing the other half to be active when it is not needed and other costs actually shrink the per-context activity budget further. This is on top of the other physical and engineering constraints that likely cut its budget further.

As far as using Jaguar as an example, I have doubts. Jaguar's other design parameters besides (somewhat) low power was that it be cheap and dense. AMD's sights need to be a bit higher for their next design.
Jaguar's line doesn't seem capable of hitting the real low power range covered by Intel and ARM, and it's getting squeezed by the lower range of the higher-performance lines.
In terms of cheap and low power, it's hard to beat an ARM core AMD doesn't have to design, but that's not good enough to stand against the designs that are compressing Jaguar from above.

In terms of engineering, there's probably a footprint of R&D that AMD needs to be able to focus by not devoting even a Bobcat or Jaguar team away from the band its next in-house cores target.

In a SMT processor, many resources (ROB, store buffers etc) are split between contexts;
True, but generally a significantly reduced set of them are split statically.
Some resources, like the data caches and load buffers that Bulldozer wound up duplicating, actually raise the cost of the design over a unified one, because there are obligations that coherent cores have to each other and the memory system as a whole. The costs that can come from burdening inter-core communication get missed in the CMT narrative. The overly complicated and frequently underperforming even on its own merits memory subsystem indicates this cost was higher than they could handle.

AMDs rationale for CMT was duplicating integer execution units for each context was a small incremental cost. The premise for this rationale was based on the K7/8 microarchitectures where the integer units were a tiny fraction of a core.
I'm not sure if AMD focused much on the integer units as part of their rationale. At least publically, they all but lied about the nature of the four integer pipelines in the core until they finally divulged some architectural details.
If they focused on it afterwards, although I'm not sure they did above other things like FPU and front-end savings, it might be putting the cart before the horse.
If integer execution resources were that low cost, why did they cut Bulldozer's per-core integer unit count by 33% from its predecessor, a chip that had no problem hosting that many units?
Unless other architectural decisions, like hardware savings and a measurable decrease in per-stage complexity due to higher clock targets, made that incremental cost unpalatable.


It wasn't only the INT logic being duplicated, but also the L/S pipelines and data caches, so it wasn't that small, though. AMD banked on the future of IGP, as an integral part of a common architecture, where both intensive and more casual FP code would be "naturally" offloaded. The FPU was left shared, inefficient and mostly underpowered, as a consequence.

I personally have trouble accepting this premise. The architectural differences and workloads for the IGP and a core like Bulldozer are so vast that I don't see how their engineers could have thought this could have been made to work, not in the time frame of Bulldozer or any CPU in its line prior to replacement. The granularity, latency, features, infrastructure, implementation needs, and software ecosystem differences are so vast that they had to have known it would take years--possibly in the line that replaced Bulldozer or beyond--to even get things past the point of "this solution is incredibly unacceptable", as they haven't really gotten beyond it now.
Why make the CPU of that day match a hypothetical design reality it and its descendants could not live to see?
 
True, but generally a significantly reduced set of them are split statically.
Some resources, like the data caches and load buffers that Bulldozer wound up duplicating, actually raise the cost of the design over a unified one, because there are obligations that coherent cores have to each other and the memory system as a whole.

Smaller structures (halved ROB, store buffers) clock higher, which seems to be Bulldozer's modus operandi.

Doubling the caches also gives you twice the bandwidth, - unless you make them store through. I simply cannot understand why they made the D$ store through. I get that it simplifies coherency, but in store heavy workloads you run each core at *half* L2 bandwidth.

I'm not sure if AMD focused much on the integer units as part of their rationale. At least publically, they all but lied about the nature of the four integer pipelines in the core until they finally divulged some architectural details.
If they focused on it afterwards, although I'm not sure they did above other things like FPU and front-end savings, it might be putting the cart before the horse.

I think it went down like this:
1. Amortize the cost of a big frontend+FPU over two execution cores at modest incremental cost.
2. Optimize for the dual thread case. No point in making a superwide execution core when each core only sees two instructions per cycle on average, right ?
3. Fuck up the cache subsystem
4. Unforseen consequences 1: cores limited by shared frontend-> double down on decoders
5. Unforseen consequences 2: With wider frontend cores now limited by narrow execution core

I think of Bulldozer as AMD's Netburst. Intel made wrong assumptions about the cost of x86 decoding and implemented a trace cache which became a bottleneck in a lot of cases. They also implemented cannonball scheduling+replay which was awesome when it worked (no D$ misses) and awful when it didn't interacting with the cache hierarchy in unexpected manners.

Core which became Core 2 saved Intels' bacon. We can hope AMD has learned some lessons and can develop a high end core using the best from Jaguar and BD. They're running out of time (cash) though.

Cheers
 
Last edited by a moderator:
Why make the CPU of that day match a hypothetical design reality it and its descendants could not live to see?
Bulldozer was initially server-oriented design, with priority given to the multi-threaded INT throughput and simplistic load/store patterns, where the CPU mostly waits for the data. It was later in Piledriver that AMD tweaked some parts of the pipeline to match more closely desktop client type of workloads. I think they really didn't have a B-plan, like Intel and had to go on with what they had in the works, generously sugared with marketing hyperbolas and overestimations to sweeten the bitter product. The whole Fusion/HSA/APU hype was more like a good wish that there would be a magical solution out of the loop, by simply throwing some GPU tech in the mix and luring the software developers to do the rest of the heavy lifting work. Pushing half-baked solutions is still a tradition in this company? :???:
 
Their approximate timeline for readiness seems too far out to be in the next AMD core. Their PR indicates they haven't worked out the kinks yet, and a 2016-ish complex design should be too far down the pipeline to revamp its execution paradigm.

While it is true that AMD gave them some money, how much skin AMD has in that game is unknown. For AMD and Samsung, it might be enough to have dibs on IP while a good chunk of the money they would have spent developing that far out on a limb was paid by Mubadala and a range of non-CPU contributors.
 
It is one thing to say that Piledriver-based Opterons are outdated and another to trash Excavator.
AMD's non-APU Opteron line is quite outdated, and the focus on the low-power line indicates they will probably do the updating with Seattle and repurposed Excavator chips. Absent more data, it still sounds like AMD's ducking the Opteron line proper.

Is there a quote from AMD that corresponds with what the article asserts as far as CMT goes?


Steamroller isn't better than Piledriver and Excavator is the next incremental improvement on it. With AVX2 at least, and "this time it does HSA for real, we promise".

Mind you I think it will be pretty decent - for updating a PC from 2009 or so - if it has real and consistent +15% improvement on the single-thread perf (NOT "up to 15%")
 
Yep. Probably an alternative VMT implementation was more suitable for P4's pipeline than SMT?

It wasn't only the INT logic being duplicated, but also the L/S pipelines and data caches, so it wasn't that small, though. AMD banked on the future of IGP, as an integral part of a common architecture, where both intensive and more casual FP code would be "naturally" offloaded. The FPU was left shared, inefficient and mostly underpowered, as a consequence.

But IMHO sharing the FPU was one of the best design decisions in Bulldozer. They just left it to be too narrow to be competitive with intel.

FPU always has long latencies, so it needs multiple operations in-flight to hide these latencies. Trying to get all from one thread can be hard, especially if there are lots of data dependencies in the code.

Executing multiple threads/FPU makes the FPU much easier to utilize well. Intel does with SMT.


You claim that bulldozer's FPU was inefficient? What do you mean by that?
 
But IMHO sharing the FPU was one of the best design decisions in Bulldozer. They just left it to be too narrow to be competitive with intel.

FPU always has long latencies, so it needs multiple operations in-flight to hide these latencies. Trying to get all from one thread can be hard, especially if there are lots of data dependencies in the code.

Executing multiple threads/FPU makes the FPU much easier to utilize well. Intel does with SMT.


You claim that bulldozer's FPU was inefficient? What do you mean by that?


as a dirty peasant i always saw bulldozers long and narrow higher clocking pipeline vs a shorter wider lower clocking pipeline as the biggest issue. To me CMT itself seems fine, they lost on raw int performance, i use a 8350 for realtime 1080p h264 transcoding using x264 which uses XOP/FMA/AVX and its FPU is more then capable, its now just to narrow to compete with haswell.
 
But IMHO sharing the FPU was one of the best design decisions in Bulldozer. They just left it to be too narrow to be competitive with intel.
The FPU was designed with the 128-bit "SSE5" ISA in mind. It is actually not that narrow per se, but simply placed in a miscast concept of what was the assumption by AMD, that FP workloads will mostly shift to HSA. In my opinion, AMD aimed for a rather elegant super-scalar 128-bit pipeline design, with more emphasis on SIMD/SSE performance evolution, at the cost of several legacy (x87) setbacks, by relaxing some key instruction latencies. Intel, for instance, choose to keep the FADD and FDIV pipes short for a good reason, I guess.

The overall result is an unbalanced integration and half-baked AVX implementation (part to blame Intel for that), but that's not the worst aspect of Bulldozer by far.
 
But IMHO sharing the FPU was one of the best design decisions in Bulldozer. They just left it to be too narrow to be competitive with intel.
A wider FPU would have been bottlenecked by the limited throughput of the INT to FP link between the cores and the FPU, so the narrowness of the integer core and memory pipeline would have cut into the improvement for a wider FPU. This capping of the FPU's upside seems to be a consequence of AMD's conception of CMT, to the point that Steamroller narrowed the FPU slightly at no serious cost according to AMD.

The FPU was designed with the 128-bit "SSE5" ISA in mind. It is actually not that narrow per se, but simply placed in a miscast concept of what was the assumption by AMD, that FP workloads will mostly shift to HSA.
I think AMD created a conservatively encoded extension because they couldn't drive the x86 platform as a minority player. AVX had a lot of forward-looking changes that AMD's marginal position couldn't carry. I still do not think AMD's engineers could have drunk the kool-aid to such an extent that they would define SSE5 so that it was hamstrung by a tech they wouldn' t see implemented until SSE6 or whatever.
Aside from a few instructions like FMA4, AVX introduced better forward-looking context saving and a path to widening and increasing the number of registers, as well as a less messy encoding situation.
 
as a dirty peasant i always saw bulldozers long and narrow higher clocking pipeline vs a shorter wider lower clocking pipeline as the biggest issue. To me CMT itself seems fine, they lost on raw int performance, i use a 8350 for realtime 1080p h264 transcoding using x264 which uses XOP/FMA/AVX and its FPU is more then capable, its now just to narrow to compete with haswell.
x264 is primarily an integer / fixed point affair AFAIK.
 
x264 is primarily an integer / fixed point affair AFAIK.
yes it is, except for where it isn't. if you look back at the dev threads when x264 guys first started playing with bulldozer they found lots of XOP based performance gains, then they figured out how to do most of those INT workloads with AVX, so now XOP only adds a small benefit over AVX, i have no idea if AVX2 has again improved performance over AVX/XOP.
 
yes it is, except for where it isn't. if you look back at the dev threads when x264 guys first started playing with bulldozer they found lots of XOP based performance gains, then they figured out how to do most of those INT workloads with AVX, so now XOP only adds a small benefit over AVX, i have no idea if AVX2 has again improved performance over AVX/XOP.

I just looked through x264's source and I didn't see any floating point code. Just a lot of integer SSE and some optional AVX2. If there was XOP I didn't recognize it.

Much (probably most) of the SSE code works on packed 8-bit or 16-bit data types and involves a lot of shifts and logical ops. Not a great fit for floating point.
 
XOP only has frac() instruction(s) for FP, the remainder are all integer instructions.
if i get time i wil try and find the thread, it was on doom 9 or 10 when bulldozer first came out, so a long time ago. who knows whats happened to the code base since then ( im sure some people do....lol).

edit:

i had a quick look and found this:
http://forum.doom9.org/showthread.php?p=1474908#post1474908
im sure there was more and thats pre bulldozer release.
 
Last edited:
If integer execution resources were that low cost, why did they cut Bulldozer's per-core integer unit count by 33% from its predecessor, a chip that had no problem hosting that many units?

Because Bulldozer was supposed to clock some 25% higher than K7/K8/K10 did, and even though the integer units are small and cheap, they require ports in register file and connections on bypass network. And these ports and connections make hitting high clock speeds harder.

The plan was to trade away some 10% of ilp but gain some 25% of clock speed. But then they had some bad critical paths elsewhere in the core and they could barely clock it to some 10% faster.
 
Because Bulldozer was supposed to clock some 25% higher than K7/K8/K10 did, and even though the integer units are small and cheap, they require ports in register file and connections on bypass network. And these ports and connections make hitting high clock speeds harder.
This is where AMD's highlighting the minor area cost of its integer ALUs was shown to be an incomplete picture of the situation the design faced.

The plan was to trade away some 10% of ilp but gain some 25% of clock speed. But then they had some bad critical paths elsewhere in the core and they could barely clock it to some 10% faster.

They traded more ILP than 10%, however, with one or two core revisions with 10-15% per-clock performance improvements to get consistently equivalent or better IPC than Phenom.
Even the theoretical clock speed increase, given the modest decrease per-stage complexity, put a 25% upclock at the top of the optimistic curve. The misstep was clear in that even on paper it probably needed 10-15% on top of that to compensate for some of the major latency regressions for elements like the L2.

It actually would have been understandable at the time if the initial SKUs didn't fully reach the peak of the design's clock envelope relative to what a highly iterated previous generation product on a more mature process could reach. It happened before with 90nm versus 65nm. It's difficult to say where things went the most wrong since Bulldozer's line barely budged past its initial high-water mark.

I feel as if Bulldozer was AMD's falling back to a design direction it had originally considered less compelling compared to the directions taken by at least one or two other cancelled designs that it replaced. At the same time, I wonder if how Bulldozer was originally envisioned prior to going through the full design process was significantly pared back. The designer that conceived of the CMT idea that went into Bulldozer did not find the concept worthwhile if it only went as far as AMD's unadventurous implementation.
 
Bulldozer wasn't anything new, from time perspective. There were patent fillings for a similar multi-threading architecture in 2003 or 2004, AFAIK. I can guess Bulldozer was in-fact the "B-plan" for AMD, kicked up in a hustle some time after the K10h fiasco.

It still boggles me, why AMD decided to invest in a NetBurst redux after it was quite clear that the future is definitely not in the power-sucking high-speed monsters. Not that they had a choice back then.
 
Back
Top