AMD RyZen CPU Architecture for 2017

Discussion in 'PC Industry' started by fellix, Oct 20, 2014.

Tags:
  1. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    For me it looks like a they are gearing up toward a sale of the company, they cut as much fat as possible and actually a lot more than that.
    There is no roadmap any more and they just killed "excavator" before it got released, Intel PR team would have not done better.

    IMO, they've just acknowledge that they can no longer compete with Intel, their low power offering is not competitive, something for the mid range high-end.

    In the meantime they want to show that they know a couple of thing about integration (shown in Jaguar SOC, up coming ARM server), that they still have great IP (CPU and GPU), and they are to focus R&D on something that is more an industry standard and proved to be working: SMT

    If that is not an attempt to make the company as a sexy girl, I don't know what it is.
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,365
    Likes Received:
    3,955
    Location:
    Well within 3d
    The catch to the promise about 80% of 2 independent cores is an implicit assumption that these are 2 independent cores with the same resources outside of the shared ones that are now split off.
    The reality is that an independent core doesn't necessarily need to be bound to the constraint that it be designed so that it can serve as a counterexample for a CMT design.

    There's a rough generalization that there is a set or footprint in terms of active transistors that can be utilized by a context. CMT makes sure a significant fraction of that footprint cannot readily go where it may be needed at any moment due to the static split, and CMT's forcing the other half to be active when it is not needed and other costs actually shrink the per-context activity budget further. This is on top of the other physical and engineering constraints that likely cut its budget further.

    As far as using Jaguar as an example, I have doubts. Jaguar's other design parameters besides (somewhat) low power was that it be cheap and dense. AMD's sights need to be a bit higher for their next design.
    Jaguar's line doesn't seem capable of hitting the real low power range covered by Intel and ARM, and it's getting squeezed by the lower range of the higher-performance lines.
    In terms of cheap and low power, it's hard to beat an ARM core AMD doesn't have to design, but that's not good enough to stand against the designs that are compressing Jaguar from above.

    In terms of engineering, there's probably a footprint of R&D that AMD needs to be able to focus by not devoting even a Bobcat or Jaguar team away from the band its next in-house cores target.

    True, but generally a significantly reduced set of them are split statically.
    Some resources, like the data caches and load buffers that Bulldozer wound up duplicating, actually raise the cost of the design over a unified one, because there are obligations that coherent cores have to each other and the memory system as a whole. The costs that can come from burdening inter-core communication get missed in the CMT narrative. The overly complicated and frequently underperforming even on its own merits memory subsystem indicates this cost was higher than they could handle.

    I'm not sure if AMD focused much on the integer units as part of their rationale. At least publically, they all but lied about the nature of the four integer pipelines in the core until they finally divulged some architectural details.
    If they focused on it afterwards, although I'm not sure they did above other things like FPU and front-end savings, it might be putting the cart before the horse.
    If integer execution resources were that low cost, why did they cut Bulldozer's per-core integer unit count by 33% from its predecessor, a chip that had no problem hosting that many units?
    Unless other architectural decisions, like hardware savings and a measurable decrease in per-stage complexity due to higher clock targets, made that incremental cost unpalatable.


    I personally have trouble accepting this premise. The architectural differences and workloads for the IGP and a core like Bulldozer are so vast that I don't see how their engineers could have thought this could have been made to work, not in the time frame of Bulldozer or any CPU in its line prior to replacement. The granularity, latency, features, infrastructure, implementation needs, and software ecosystem differences are so vast that they had to have known it would take years--possibly in the line that replaced Bulldozer or beyond--to even get things past the point of "this solution is incredibly unacceptable", as they haven't really gotten beyond it now.
    Why make the CPU of that day match a hypothetical design reality it and its descendants could not live to see?
     
  3. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,587
    Likes Received:
    986
    Smaller structures (halved ROB, store buffers) clock higher, which seems to be Bulldozer's modus operandi.

    Doubling the caches also gives you twice the bandwidth, - unless you make them store through. I simply cannot understand why they made the D$ store through. I get that it simplifies coherency, but in store heavy workloads you run each core at *half* L2 bandwidth.

    I think it went down like this:
    1. Amortize the cost of a big frontend+FPU over two execution cores at modest incremental cost.
    2. Optimize for the dual thread case. No point in making a superwide execution core when each core only sees two instructions per cycle on average, right ?
    3. Fuck up the cache subsystem
    4. Unforseen consequences 1: cores limited by shared frontend-> double down on decoders
    5. Unforseen consequences 2: With wider frontend cores now limited by narrow execution core

    I think of Bulldozer as AMD's Netburst. Intel made wrong assumptions about the cost of x86 decoding and implemented a trace cache which became a bottleneck in a lot of cases. They also implemented cannonball scheduling+replay which was awesome when it worked (no D$ misses) and awful when it didn't interacting with the cache hierarchy in unexpected manners.

    Core which became Core 2 saved Intels' bacon. We can hope AMD has learned some lessons and can develop a high end core using the best from Jaguar and BD. They're running out of time (cash) though.

    Cheers
     
    #23 Gubbi, Oct 23, 2014
    Last edited by a moderator: Oct 23, 2014
  4. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,506
    Likes Received:
    424
    Location:
    Varna, Bulgaria
    Bulldozer was initially server-oriented design, with priority given to the multi-threaded INT throughput and simplistic load/store patterns, where the CPU mostly waits for the data. It was later in Piledriver that AMD tweaked some parts of the pipeline to match more closely desktop client type of workloads. I think they really didn't have a B-plan, like Intel and had to go on with what they had in the works, generously sugared with marketing hyperbolas and overestimations to sweeten the bitter product. The whole Fusion/HSA/APU hype was more like a good wish that there would be a magical solution out of the loop, by simply throwing some GPU tech in the mix and luring the software developers to do the rest of the heavy lifting work. Pushing half-baked solutions is still a tradition in this company? :???:
     
  5. mboeller

    Regular

    Joined:
    Feb 7, 2002
    Messages:
    922
    Likes Received:
    1
    Location:
    Germany
  6. nutball

    Veteran Subscriber

    Joined:
    Jan 10, 2003
    Messages:
    2,249
    Likes Received:
    593
    Location:
    en.gb.uk
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,365
    Likes Received:
    3,955
    Location:
    Well within 3d
    Their approximate timeline for readiness seems too far out to be in the next AMD core. Their PR indicates they haven't worked out the kinks yet, and a 2016-ish complex design should be too far down the pipeline to revamp its execution paradigm.

    While it is true that AMD gave them some money, how much skin AMD has in that game is unknown. For AMD and Samsung, it might be enough to have dibs on IP while a good chunk of the money they would have spent developing that far out on a limb was paid by Mubadala and a range of non-CPU contributors.
     
  8. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256

    Steamroller isn't better than Piledriver and Excavator is the next incremental improvement on it. With AVX2 at least, and "this time it does HSA for real, we promise".

    Mind you I think it will be pretty decent - for updating a PC from 2009 or so - if it has real and consistent +15% improvement on the single-thread perf (NOT "up to 15%")
     
  9. hkultala

    Regular

    Joined:
    May 22, 2002
    Messages:
    284
    Likes Received:
    6
    Location:
    Herwood, Tampere, Finland
    But IMHO sharing the FPU was one of the best design decisions in Bulldozer. They just left it to be too narrow to be competitive with intel.

    FPU always has long latencies, so it needs multiple operations in-flight to hide these latencies. Trying to get all from one thread can be hard, especially if there are lots of data dependencies in the code.

    Executing multiple threads/FPU makes the FPU much easier to utilize well. Intel does with SMT.


    You claim that bulldozer's FPU was inefficient? What do you mean by that?
     
  10. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,314
    Likes Received:
    414
    Location:
    Australia

    as a dirty peasant i always saw bulldozers long and narrow higher clocking pipeline vs a shorter wider lower clocking pipeline as the biggest issue. To me CMT itself seems fine, they lost on raw int performance, i use a 8350 for realtime 1080p h264 transcoding using x264 which uses XOP/FMA/AVX and its FPU is more then capable, its now just to narrow to compete with haswell.
     
  11. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,506
    Likes Received:
    424
    Location:
    Varna, Bulgaria
    The FPU was designed with the 128-bit "SSE5" ISA in mind. It is actually not that narrow per se, but simply placed in a miscast concept of what was the assumption by AMD, that FP workloads will mostly shift to HSA. In my opinion, AMD aimed for a rather elegant super-scalar 128-bit pipeline design, with more emphasis on SIMD/SSE performance evolution, at the cost of several legacy (x87) setbacks, by relaxing some key instruction latencies. Intel, for instance, choose to keep the FADD and FDIV pipes short for a good reason, I guess.

    The overall result is an unbalanced integration and half-baked AVX implementation (part to blame Intel for that), but that's not the worst aspect of Bulldozer by far.
     
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,365
    Likes Received:
    3,955
    Location:
    Well within 3d
    A wider FPU would have been bottlenecked by the limited throughput of the INT to FP link between the cores and the FPU, so the narrowness of the integer core and memory pipeline would have cut into the improvement for a wider FPU. This capping of the FPU's upside seems to be a consequence of AMD's conception of CMT, to the point that Steamroller narrowed the FPU slightly at no serious cost according to AMD.

    I think AMD created a conservatively encoded extension because they couldn't drive the x86 platform as a minority player. AVX had a lot of forward-looking changes that AMD's marginal position couldn't carry. I still do not think AMD's engineers could have drunk the kool-aid to such an extent that they would define SSE5 so that it was hamstrung by a tech they wouldn' t see implemented until SSE6 or whatever.
    Aside from a few instructions like FMA4, AVX introduced better forward-looking context saving and a path to widening and increasing the number of registers, as well as a less messy encoding situation.
     
  13. swaaye

    swaaye Entirely Suboptimal
    Legend

    Joined:
    Mar 15, 2003
    Messages:
    8,592
    Likes Received:
    673
    Location:
    WI, USA
    x264 is primarily an integer / fixed point affair AFAIK.
     
  14. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,314
    Likes Received:
    414
    Location:
    Australia
    yes it is, except for where it isn't. if you look back at the dev threads when x264 guys first started playing with bulldozer they found lots of XOP based performance gains, then they figured out how to do most of those INT workloads with AVX, so now XOP only adds a small benefit over AVX, i have no idea if AVX2 has again improved performance over AVX/XOP.
     
  15. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    I just looked through x264's source and I didn't see any floating point code. Just a lot of integer SSE and some optional AVX2. If there was XOP I didn't recognize it.

    Much (probably most) of the SSE code works on packed 8-bit or 16-bit data types and involves a lot of shifts and logical ops. Not a great fit for floating point.
     
  16. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    869
    Likes Received:
    277
    XOP only has frac() instruction(s) for FP, the remainder are all integer instructions.
     
  17. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,314
    Likes Received:
    414
    Location:
    Australia
    if i get time i wil try and find the thread, it was on doom 9 or 10 when bulldozer first came out, so a long time ago. who knows whats happened to the code base since then ( im sure some people do....lol).

    edit:

    i had a quick look and found this:
    http://forum.doom9.org/showthread.php?p=1474908#post1474908
    im sure there was more and thats pre bulldozer release.
     
    #37 itsmydamnation, Nov 30, 2014
    Last edited: Nov 30, 2014
  18. hkultala

    Regular

    Joined:
    May 22, 2002
    Messages:
    284
    Likes Received:
    6
    Location:
    Herwood, Tampere, Finland
    Because Bulldozer was supposed to clock some 25% higher than K7/K8/K10 did, and even though the integer units are small and cheap, they require ports in register file and connections on bypass network. And these ports and connections make hitting high clock speeds harder.

    The plan was to trade away some 10% of ilp but gain some 25% of clock speed. But then they had some bad critical paths elsewhere in the core and they could barely clock it to some 10% faster.
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,365
    Likes Received:
    3,955
    Location:
    Well within 3d
    This is where AMD's highlighting the minor area cost of its integer ALUs was shown to be an incomplete picture of the situation the design faced.

    They traded more ILP than 10%, however, with one or two core revisions with 10-15% per-clock performance improvements to get consistently equivalent or better IPC than Phenom.
    Even the theoretical clock speed increase, given the modest decrease per-stage complexity, put a 25% upclock at the top of the optimistic curve. The misstep was clear in that even on paper it probably needed 10-15% on top of that to compensate for some of the major latency regressions for elements like the L2.

    It actually would have been understandable at the time if the initial SKUs didn't fully reach the peak of the design's clock envelope relative to what a highly iterated previous generation product on a more mature process could reach. It happened before with 90nm versus 65nm. It's difficult to say where things went the most wrong since Bulldozer's line barely budged past its initial high-water mark.

    I feel as if Bulldozer was AMD's falling back to a design direction it had originally considered less compelling compared to the directions taken by at least one or two other cancelled designs that it replaced. At the same time, I wonder if how Bulldozer was originally envisioned prior to going through the full design process was significantly pared back. The designer that conceived of the CMT idea that went into Bulldozer did not find the concept worthwhile if it only went as far as AMD's unadventurous implementation.
     
  20. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,506
    Likes Received:
    424
    Location:
    Varna, Bulgaria
    Bulldozer wasn't anything new, from time perspective. There were patent fillings for a similar multi-threading architecture in 2003 or 2004, AFAIK. I can guess Bulldozer was in-fact the "B-plan" for AMD, kicked up in a hustle some time after the K10h fiasco.

    It still boggles me, why AMD decided to invest in a NetBurst redux after it was quite clear that the future is definitely not in the power-sucking high-speed monsters. Not that they had a choice back then.
     
    Lightman likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...