AMD RyZen CPU Architecture for 2017

hoom · Sep 29, 2015

GloFo now saying their 14nm is on track http://wccftech.com/globalfoundries-14nm-finfet-amd-zen-2016/
I'm inclined to think 'the lack of clarity' boils down to something simple like that some APUs will be TSMC.

Kaotik · Oct 3, 2015

https://patchwork.ozlabs.org/patch/524324/

AMD has patched GCC for Zen optimizations

Dresdenboy has made some analysis based on it
http://dresdenboy.blogspot.fi/2015/10/amds-zen-core-family-17h-to-have-ten.html

fehu · Oct 4, 2015

I missed dresden boy

fehu · Oct 4, 2015

Oh hey wait! So +40% istructions per clock and probably lower clock?

hoom · Oct 4, 2015

Not super high clocks like the Bulldozer architecture was supposed to do (but which it never really achieved in practice like with P4).
As long as its still in the upper 3.x to low 4.x Ghz this still sounds great.

3dilettante · Oct 5, 2015

I've skimmed some of the FP stuff, and while I think we might need to wait on more significant rewrites, there are a few oddities.

I am not sure about the way the operations are split in Dresdenboy's speculative diagram.
That has 4 arrows out of the scheduler (the pipes?) of MUL and ADD, paired into FMAC units.
However, I don't think that is reflected in how some of the operations are split out.
For example, the division ops reside in fp3, and divisor algorithms benefit significantly from having a readily available MAC.
This might be a typo, but the fma ops connect fp0 or fp1 to fp3. fp2, which also handles store and shifts, is not involved.
For the sseiadd ops, fp0|fp1|fp3 seem like they could be reminiscent of some of the unexpected FP integer throughput that Bulldozer (the original) had, and also not indicative of an alternating port mapping.

I'm trying to tally the overall behaviors to see how the integer, fp, scalar, and packed ops in the FPU fall out. I think the actual mapping of units and how they are used is not as straightforward as 2x(ADD+MUL), depending on functionality and domain. Shifts, shuffles, classic FP ops, and special functions complicate the resource sharing.
As to the exact bit-width of the units, 128-bit seems like it could be costly, particularly in light of the apparent bottlenecking on fp3 for FMA. The given costs for the instructions when going to the (unmodeled) fastpath double for 256-bit may be omitting how well they can run without conflicting.

One oddity I do see that departs from the earlier bdver4 patches is the latency figures for the stores that come out of the fp2 pipeline. (correction: fp2 for MMX, undefined for others)

If it's a streaming extension store, the given cost is 1, not the 4 or more given for the store ops in earlier cores.
That could be an error, but it might also explain why AMD went through the trouble of adding a zero cache line instruction, if one of the big uses for wide SIMD--zeroing out blocks of memory--suddenly doesn't work the same way through the store path as usual. (edit: Granted, it could also be something needed to keep pace with a use case for AVX-512 stores.)

Maybe after I'm done tallying up where all the ops are going I can see more of the pattern.

lanek · Oct 6, 2015

Im not sure we should read too much in the "pipelines" right now, too much infos is missing or unclear.

Anyway, maybe interessant to read his comment about FMA bridge http://citavia.blog.de/2009/11/23/some-additional-bits-of-information-7441398/

( Personally, i have not much analysis it, but seems quite possible that in this sense they have keep some Bulldozer integration, if i understand well, the patch of GCC is not complete and will continue to be updated )

entity279 · Oct 6, 2015

I think though that was the meat in dresdenboy's speculation, the pipes number.

I'm a bit surprized by the number of decoders which look like a big shift from the BD arch. Are the scheduling resources the limiter when deciding to make this core a 4 - way SMT vs a 2 - way one? I guess so

Gubbi · Oct 6, 2015

entity279 said:
I'm a bit surprized by the number of decoders which look like a big shift from the BD arch. Are the scheduling resources the limiter when deciding to make this core a 4 - way SMT vs a 2 - way one? I guess so

The BD frontend can decode four instructions per cycle. In fact, it is about the only part of BD that is as good or better than the Intel counterpart.

I hope they've added at post-decode cache with wider issue, similar to Intel's. Internally Zen is quite wide (4 INT, 2 L/S, 4 FP) so should be able to exploit >4 instructions/cycle in compact loops.

As for FP, I hope they go wide. BD's shared FP unit was conceived at a point in time where static leakage was looking to dominate power consumption, that's not the case anymore with FinFETs. A large unused FP unit is just dark silicon; Silicon is almost free, power isn't.

Cheers

fehu · Nov 26, 2015

WCTech says a depressing late Q4 2016 fx cpu, and generic 2017 apu, but in 4 years amd will be able to match the xbox one performance, even if with carrizo they are not that far at the moment.

hoom · Nov 28, 2015

That is depressing.
I'd have thought a tape out earlier this year might have meant availability early-mid 2016.

But I don't know much about those kind of lead times & with an all new architecture I guess there will need to be a lot more validation vs a modification of existing core.

Alexko · Nov 29, 2015

As far as I know, ~18 months between tape-out and release is pretty standard.

Kaotik · Nov 29, 2015

Alexko said:
As far as I know, ~18 months between tape-out and release is pretty standard.

On CPUs yes. But hasn't AMD stated for quite some time already that Zen will come available late 2016, and full availability 2017?

hoom · Nov 29, 2015

Hmm in that case its on track I guess.
I guess I'm just kinda projecting my own desire for a move up from my aging Thuban...

fehu · Nov 30, 2015

From the past news I was expecting an apu before late 2016

iroboto · Nov 30, 2015

Grah. Was hoping this would be my next. I'm not sure if I can hold out another year on a PhenomII X2 unlocked.

3dilettante · Nov 30, 2015

The time frame also generally coincides with the presumed 4-5 year lead time from starting a design and bringing it to market, if we go by when AMD hired Jim Keller.
Assuming this isn't another case of a design being targeted at a node and delayed to the next like Bulldozer was at 45nm, that seems to allow for a more clean-sheet design than if it had come out sooner.

iroboto · Nov 30, 2015

3dilettante said:
The time frame also generally coincides with the presumed 4-5 year lead time from starting a design and bringing it to market, if we go by when AMD hired Jim Keller.
Assuming this isn't another case of a design being targeted at a node and delayed to the next like Bulldozer was at 45nm, that seems to allow for a more clean-sheet design than if it had come out sooner.

I bit off-topic amateur question: but is there a reason why ARM chip makers tend to iterate more often or be ahead on node process before the x86 giants do?
edit: nvm, I think the trade off is to get that power watt/performance is higher on the list of priorities.

3dilettante · Nov 30, 2015

What chips are we comparing, chips in the same product range, or just in general?
Apple is perhaps the ARM architectural licensee that has the shortest cadence, although it has had one major architectural transition with more iterative changes since.
There are more companies designing ARM cores than x86, although their individual rates of product introduction are slower than the press-release collective ARM drumbeat.

Are there other vendors that are able to beat Intel's cadence--at least prior to its apparent tick-tock stumble in the latest generation? Intel actually has tweaked its cores at process transitions, and some of those transitions could have been labelled a core revision or a new core by other vendors.
Process-wise, 14nm FinFET has been in Intel products for quite a while.

There has been a gap in product requirements, where server-bound x86 chips can take a year or more than client offerings. That can be one reason for putting Zen's server variant after the client one, on top of AMD's products being part of the yield-learning process for GF.
AMD being beaten in iteration rate is because it is a struggling giant, if it can rate in that category.

For what it's worth, Intel also has a history of putting a lot more up-front effort into its cores, including the physical design and system integration, whereas ARM has historically left more on the table in order to make a more broadly applicable core, with incremental revisions that gradually work up better sustained performance or power-efficiency. The A72 might be an example of ARM making a more concerted effort to revise and target physical implementation better.

iroboto · Nov 30, 2015

3dilettante said:
What chips are we comparing, chips in the same product range, or just in general?
Apple is perhaps the ARM architectural licensee that has the shortest cadence, although it has had one major architectural transition with more iterative changes since.
There are more companies designing ARM cores than x86, although their individual rates of product introduction are slower than the press-release collective ARM drumbeat.

Are there other vendors that are able to beat Intel's cadence--at least prior to its apparent tick-tock stumble in the latest generation? Intel actually has tweaked its cores at process transitions, and some of those transitions could have been labelled a core revision or a new core by other vendors.
Process-wise, 14nm FinFET has been in Intel products for quite a while.

There has been a gap in product requirements, where server-bound x86 chips can take a year or more than client offerings. That can be one reason for putting Zen's server variant after the client one, on top of AMD's products being part of the yield-learning process for GF.
AMD being beaten in iteration rate is because it is a struggling giant, if it can rate in that category.

For what it's worth, Intel also has a history of putting a lot more up-front effort into its cores, including the physical design and system integration, whereas ARM has historically left more on the table in order to make a more broadly applicable core, with incremental revisions that gradually work up better sustained performance or power-efficiency. The A72 might be an example of ARM making a more concerted effort to revise and target physical implementation better.

I guess I was more focused on the foundry part of things. If there was some sort of manufacturing issues in place that cause some companies to be ahead or behind of others. But you're right, Intel has been on 14nm for some time now and they use their own foundries from what I understand. Apple is still sitting at 20nm I believe (edit correction sorry 14nm) - but the foundries are Samsung?

AMD RyZen CPU Architecture for 2017

hoom

Kaotik

Drunk Member

fehu

fehu

hoom

3dilettante

lanek

entity279

Gubbi

fehu

hoom

Alexko

Kaotik

Drunk Member

hoom

fehu

iroboto

Daft Funk

3dilettante

iroboto

Daft Funk

3dilettante

iroboto

Daft Funk

Similar threads