AMD RyZen CPU Architecture for 2017

Discussion in 'PC Industry' started by fellix, Oct 20, 2014.

Tags:
  1. hoom

    Veteran

    Joined:
    Sep 23, 2003
    Messages:
    3,262
    Likes Received:
    813
  2. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland
    Lightman, Newguy, fehu and 3 others like this.
  3. fehu

    Veteran

    Joined:
    Nov 15, 2006
    Messages:
    2,067
    Likes Received:
    992
    Location:
    Somewhere over the ocean
    I missed dresden boy
     
    hoom likes this.
  4. fehu

    Veteran

    Joined:
    Nov 15, 2006
    Messages:
    2,067
    Likes Received:
    992
    Location:
    Somewhere over the ocean
    Oh hey wait! So +40% istructions per clock and probably lower clock? :yelling:
     
  5. hoom

    Veteran

    Joined:
    Sep 23, 2003
    Messages:
    3,262
    Likes Received:
    813
    Not super high clocks like the Bulldozer architecture was supposed to do (but which it never really achieved in practice like with P4).
    As long as its still in the upper 3.x to low 4.x Ghz this still sounds great.
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    I've skimmed some of the FP stuff, and while I think we might need to wait on more significant rewrites, there are a few oddities.

    I am not sure about the way the operations are split in Dresdenboy's speculative diagram.
    That has 4 arrows out of the scheduler (the pipes?) of MUL and ADD, paired into FMAC units.
    However, I don't think that is reflected in how some of the operations are split out.
    For example, the division ops reside in fp3, and divisor algorithms benefit significantly from having a readily available MAC.
    This might be a typo, but the fma ops connect fp0 or fp1 to fp3. fp2, which also handles store and shifts, is not involved.
    For the sseiadd ops, fp0|fp1|fp3 seem like they could be reminiscent of some of the unexpected FP integer throughput that Bulldozer (the original) had, and also not indicative of an alternating port mapping.

    I'm trying to tally the overall behaviors to see how the integer, fp, scalar, and packed ops in the FPU fall out. I think the actual mapping of units and how they are used is not as straightforward as 2x(ADD+MUL), depending on functionality and domain. Shifts, shuffles, classic FP ops, and special functions complicate the resource sharing.
    As to the exact bit-width of the units, 128-bit seems like it could be costly, particularly in light of the apparent bottlenecking on fp3 for FMA. The given costs for the instructions when going to the (unmodeled) fastpath double for 256-bit may be omitting how well they can run without conflicting.

    One oddity I do see that departs from the earlier bdver4 patches is the latency figures for the stores that come out of the fp2 pipeline. (correction: fp2 for MMX, undefined for others)

    If it's a streaming extension store, the given cost is 1, not the 4 or more given for the store ops in earlier cores.
    That could be an error, but it might also explain why AMD went through the trouble of adding a zero cache line instruction, if one of the big uses for wide SIMD--zeroing out blocks of memory--suddenly doesn't work the same way through the store path as usual. (edit: Granted, it could also be something needed to keep pace with a use case for AVX-512 stores.)

    Maybe after I'm done tallying up where all the ops are going I can see more of the pattern.
     
    #246 3dilettante, Oct 5, 2015
    Last edited: Oct 6, 2015
  7. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    Im not sure we should read too much in the "pipelines" right now, too much infos is missing or unclear.

    Anyway, maybe interessant to read his comment about FMA bridge http://citavia.blog.de/2009/11/23/some-additional-bits-of-information-7441398/

    ( Personally, i have not much analysis it, but seems quite possible that in this sense they have keep some Bulldozer integration, if i understand well, the patch of GCC is not complete and will continue to be updated )
     
  8. entity279

    Veteran Subscriber

    Joined:
    May 12, 2008
    Messages:
    1,332
    Likes Received:
    500
    Location:
    Romania
    I think though that was the meat in dresdenboy's speculation, the pipes number.

    I'm a bit surprized by the number of decoders which look like a big shift from the BD arch. Are the scheduling resources the limiter when deciding to make this core a 4 - way SMT vs a 2 - way one? I guess so
     
  9. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    The BD frontend can decode four instructions per cycle. In fact, it is about the only part of BD that is as good or better than the Intel counterpart.

    I hope they've added at post-decode cache with wider issue, similar to Intel's. Internally Zen is quite wide (4 INT, 2 L/S, 4 FP) so should be able to exploit >4 instructions/cycle in compact loops.

    As for FP, I hope they go wide. BD's shared FP unit was conceived at a point in time where static leakage was looking to dominate power consumption, that's not the case anymore with FinFETs. A large unused FP unit is just dark silicon; Silicon is almost free, power isn't.

    Cheers
     
    I.S.T. and Jawed like this.
  10. fehu

    Veteran

    Joined:
    Nov 15, 2006
    Messages:
    2,067
    Likes Received:
    992
    Location:
    Somewhere over the ocean
    WCTech says a depressing late Q4 2016 fx cpu, and generic 2017 apu, but in 4 years amd will be able to match the xbox one performance, even if with carrizo they are not that far at the moment.
     
  11. hoom

    Veteran

    Joined:
    Sep 23, 2003
    Messages:
    3,262
    Likes Received:
    813
    That is depressing.
    I'd have thought a tape out earlier this year might have meant availability early-mid 2016.

    But I don't know much about those kind of lead times & with an all new architecture I guess there will need to be a lot more validation vs a modification of existing core.
     
    #251 hoom, Nov 28, 2015
    Last edited: Nov 29, 2015
  12. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,541
    Likes Received:
    964
    As far as I know, ~18 months between tape-out and release is pretty standard.
     
  13. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland
    On CPUs yes. But hasn't AMD stated for quite some time already that Zen will come available late 2016, and full availability 2017?
     
  14. hoom

    Veteran

    Joined:
    Sep 23, 2003
    Messages:
    3,262
    Likes Received:
    813
    Hmm in that case its on track I guess.
    I guess I'm just kinda projecting my own desire for a move up from my aging Thuban...
     
    #254 hoom, Nov 29, 2015
    Last edited: Nov 30, 2015
  15. fehu

    Veteran

    Joined:
    Nov 15, 2006
    Messages:
    2,067
    Likes Received:
    992
    Location:
    Somewhere over the ocean
    From the past news I was expecting an apu before late 2016
     
  16. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,833
    Likes Received:
    18,633
    Location:
    The North
    Grah. Was hoping this would be my next. I'm not sure if I can hold out another year on a PhenomII X2 unlocked.
     
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The time frame also generally coincides with the presumed 4-5 year lead time from starting a design and bringing it to market, if we go by when AMD hired Jim Keller.
    Assuming this isn't another case of a design being targeted at a node and delayed to the next like Bulldozer was at 45nm, that seems to allow for a more clean-sheet design than if it had come out sooner.
     
  18. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,833
    Likes Received:
    18,633
    Location:
    The North
    I bit off-topic amateur question: but is there a reason why ARM chip makers tend to iterate more often or be ahead on node process before the x86 giants do?
    edit: nvm, I think the trade off is to get that power watt/performance is higher on the list of priorities.
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    What chips are we comparing, chips in the same product range, or just in general?
    Apple is perhaps the ARM architectural licensee that has the shortest cadence, although it has had one major architectural transition with more iterative changes since.
    There are more companies designing ARM cores than x86, although their individual rates of product introduction are slower than the press-release collective ARM drumbeat.

    Are there other vendors that are able to beat Intel's cadence--at least prior to its apparent tick-tock stumble in the latest generation? Intel actually has tweaked its cores at process transitions, and some of those transitions could have been labelled a core revision or a new core by other vendors.
    Process-wise, 14nm FinFET has been in Intel products for quite a while.

    There has been a gap in product requirements, where server-bound x86 chips can take a year or more than client offerings. That can be one reason for putting Zen's server variant after the client one, on top of AMD's products being part of the yield-learning process for GF.
    AMD being beaten in iteration rate is because it is a struggling giant, if it can rate in that category.

    For what it's worth, Intel also has a history of putting a lot more up-front effort into its cores, including the physical design and system integration, whereas ARM has historically left more on the table in order to make a more broadly applicable core, with incremental revisions that gradually work up better sustained performance or power-efficiency. The A72 might be an example of ARM making a more concerted effort to revise and target physical implementation better.
     
    iroboto likes this.
  20. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,833
    Likes Received:
    18,633
    Location:
    The North
    I guess I was more focused on the foundry part of things. If there was some sort of manufacturing issues in place that cause some companies to be ahead or behind of others. But you're right, Intel has been on 14nm for some time now and they use their own foundries from what I understand. Apple is still sitting at 20nm I believe (edit correction sorry 14nm) - but the foundries are Samsung?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...