AMD RyZen CPU Architecture for 2017

hoom · Jun 8, 2015

3dilettante said:
The L1 data cache for Bulldozer has been 4-way and 16KB. AMD seems to have doubled the number of arrays so that there is a straight doubling of capacity and associativity.
The L1 instruction cache has been 2-way up until Steamroller replicated one of the arrays and made it 3-way.

Hmm OK they have moved on from 2-way then, but surely just doubling a 4-way cache doesn't = 8-way? (I may well be misunderstanding how associativity works)

3dilettante said:
I am very unclear as to that article's description of better branch prediction making data cache accesses more efficient.
What latencies were improved is unclear.

Yeah the description was sufficiently unclear I didn't make a serious attempt to try to understand WTF he was trying to say.
Point is big work with front-end & cache stuff has been something needed/lacking in Bulldozer and it doesn't seem unreasonable to expect these kind of improvements to be shared with Zen.

silent_guy said:
in the field, those never materialize, and we only ever see low performance models. I think the market is just not there: mediocre CPUs with mediocre GPU performance

Yeah, I had really expected to see AMD target the mid-range PC market with APUs rather than bottom end.
Have seen so many forum posts with people complaining about whatever game performing badly even though they'd bought a new 'Gaming PC', posted specs being basically crap.
I'd really expected an APU targeting that kind of market, bringing in a real decent level of performance without getting too expensive.

Alexko · Jun 8, 2015

silent_guy said:
I know AMD has hinted at something HBM APU in the future, but I don't see HBM happening any time soon for APU in the near future for consumer applications for the same reasons as for lower performance discrete GPU SKUs: I believe we've seen hints in the past the AMD APUs supported GDDR5? And relatively wide DDR3 memory interfaces? But in the field, those never materialize, and we only ever see low performance models. I think the market is just not there: mediocre CPUs with mediocre GPU performance that would be slightly less mediocre with a heavily underused HBM.

Now if AMD hits it really out of the ballpark with Zen (or whatever future CPU) and then links it to a truly outstanding high performance GPU that really needs all the bandwidth, maybe there's a case. But I think we're a technology node or two away from that.

My theory is that AMD is hinting at some HPC CPU/GPU combo where the cost of HBM is no objection...

From what I understand, the main reason that GDDR5 was dropped for APUs is that there was only one source of DIMM-based GDDR5 (SK-Hynix) which was a strategic liability. One would expect a few different manufacturers to produce HBM.

3dilettante · Jun 8, 2015

hoom said:
Hmm OK they have moved on from 2-way then, but surely just doubling a 4-way cache doesn't = 8-way? (I may well be misunderstanding how associativity works)

It's not a requirement, but a cache can be designed so that capacity can be scaled by the number of ways.
Intel's last-level caches have that property as well, frequently seen in the other direction. SKUs that have reduced capacity have an proportionally lower associativity.
An array or set of arrays implements a certain number of ways, so adding or removing them is more straightforward.

I think that's how AMD arrived at an I-cache with an odd number of ways, possibly how ARM did as well.
In the case of AMD's I-cache, it was a cheaper design change. It didn't fix one of the criticisms of that design, where the cache is too big for its associativity and page size, so aliasing is possible.

Yeah the description was sufficiently unclear I didn't make a serious attempt to try to understand WTF he was trying to say.
Point is big work with front-end & cache stuff has been something needed/lacking in Bulldozer and it doesn't seem unreasonable to expect these kind of improvements to be shared with Zen.

Keeping line segments inactive for loads smaller than the port size seems nice.
The improvements are good to have, so perhaps some of the techniques have been used for Zen as well. It's hard to say with how little information we have. These are parts intimately tied to the core's memory pipeline, so it can depend on how much Zen shares from that line.

fehu · Aug 9, 2015

http://www.google.com/patents/US20150155876

3dilettante · Aug 9, 2015

The bulk of the patent is actually older than the current filing date, and it also covers a system that abstracts the claims away from external devices. This seems orthogonal to the CPU architecture.

Voxilla · Aug 11, 2015

What AMD need to do IMHO is produce cores that do neither SMT nor CMT.
That will allow the cores to be simpler and use less energy.
Next they need to produce true 8 and 4 cores and position them against the Intel 4C/8T and 2C/4T processors.

That way + an IPC increase, they will win every multi-threading benchmark and still be competitive in single threading.

3dilettante · Aug 11, 2015

Are these cores half the size of an Intel core, or the same size?
The former case has been tried, and CMT is not entirely to blame for the problems that result.
The latter results in doubled area consumption, and AMD has the financial results of that as well.

SMT has some nice properties, where a single core can have more resources than can be justified in a single-threaded context, while providing some of the benefits of two cores without twice the area.
At 8 cores, and without the shared interface with the rest of the chip, AMD's chip would need to implement a surrounding SOC capable of handling 8 independent cores. That's a wholly different product and pricing tier above quads, so pricing the AMD equivalent of an E-series chip against a relatively inexpensive non-extreme edition desktop chip is as unwinnable as the current situation.

Voxilla · Aug 11, 2015

3dilettante said:
Are these cores half the size of an Intel core, or the same size?

These cores would be big cores, but smaller as Intels as all the SMT overhead and some other excess resouces can be cut out

AMD would compete with a chip tailored for 8 threads the same way Intel does tailor it's chips for 8 threads.
Only AMD would be better at it as it has 8 cores / 8T versus 4 cores / 8T.
For the rest of the CPU it only needs to dimension memory bandwidth and L3 cache size for the same 8 threads. So no need for the 'extreme' resources needed to handle 16 threads.

We all know silicon is a fractional cost of the price a CPU is sold for, in particular for the Intel case.
For an Intel beating (in multi-threading) mainstream CPU, AMD can charge similar prices as Intel does.
and that would benefit them a lot more as currently, even if their die size would be larger as Intel.

Grall · Aug 11, 2015

Voxilla said:
What AMD need to do IMHO is produce cores that do neither SMT nor CMT.

Why would you not want to have SMT? Not having SMT is exactly what AMD has been doing all up until now, and it murders IPC.

pjbliverpool · Aug 11, 2015

Grall said:
Why would you not want to have SMT? Not having SMT is exactly what AMD has been doing all up until now, and it murders IPC.

I wouldn't quite go that far. Don't forget that i5's don't have SMT either and clock for clock they are usually pretty close to i7's in most apps, and that's with less L3 too. Not that I agree with foregoing SMT.

3dilettante · Aug 11, 2015

Voxilla said:
These cores would be big cores, but smaller as Intels as all the SMT overhead and some other excess resouces can be cut out

The overhead is a small percentage increase in area and complexity, and a fair number of execution units and OoO resources that would be difficult to justify if a core never ran more than one thread.
However, once SMT brings those resources into the core, they can be used to increase performance incrementally.

AMD would compete with a chip tailored for 8 threads the same way Intel does tailor it's chips for 8 threads.

Make a large Xeon chip with reduced clocks that is then sold for as much as a 4-core i7?

For the rest of the CPU it only needs to dimension memory bandwidth and L3 cache size for the same 8 threads. So no need for the 'extreme' resources needed to handle 16 threads.

With SMT, the rest of the chip is sized to match the needs of the cores, not the threads.
A quad i7 is going to need infrastructure for 4 cores.
The hypothetical AMD chip will need infrastructure for 8, and that is past the threshold Intel uses for the compact consumer uncore.

pjbliverpool said:
I wouldn't quite go that far. Don't forget that i5's don't have SMT either and clock for clock they are usually pretty close to i7's in most apps, and that's with less L3 too. Not that I agree with foregoing SMT.

That's because they are physically i7s. They get the free ride of being more heavily built because the i7 and the server workloads that the core applies to can justify the additional resources, whereas the single-threaded only case would likely lead to reductions since Intel has a harder rule on performance uplift per power cost, and the icing on the cake resources from the i7's design target would not make sense without the i7's throughput goals.

Kaarlisk · Aug 11, 2015

pjbliverpool said:
Don't forget that i5's don't have SMT either and clock for clock they are usually pretty close to i7's in most apps

In most single-threaded apps. As soon as you have multi-threaded applications, SMT can increase IPC per core (not per thread) substantially.

Voxilla · Aug 12, 2015

3dilettante said:
With SMT, the rest of the chip is sized to match the needs of the cores, not the threads..

That is very wrong!
For one the working set of 2 threads is larger as for 1 thread.
So you need bigger L3 caches to accommodate 4 cores with 2xSMT versus 4 cores with no SMT.
Or similar sized L3 caches for 4 cores with 2xSMT and 8 cores with no SMT.

Similar your L3 bandwidth requirements for a core with SMT are higher compared to no SMT.

Or to make it even more clear. Say you have very efficient SMT that allows your single core with 2xSMT to perform as good as 2 cores, obviously your single core needs the same shared infrastructure as the 2 weaker cores.

Voxilla · Aug 12, 2015

Grall said:
Why would you not want to have SMT? Not having SMT is exactly what AMD has been doing all up until now, and it murders IPC.

For single threading SMT doesn't do anything to increase IPC.
For multi threading the total IPC will always be higher for 2 cores with no SMT compared to 1 core with 2xSMT.

Rurouni · Aug 12, 2015

The thing is, how much space and complexity is added to the core if we add SMT? If it isn't much, then why not add SMT? Why do you even need to add more L3 for SMT? I'd rather have 4 core with SMT vs 4 core with no SMT even if L3 size is the same and the no SMT is slightly more power efficient (by the removal of SMT stuff thus smaller core). I don't think you can't just remove SMT from 2 core CPU and expect to use the empty space to make 3 to 4 core CPU. And if you're not adding cores, the best that could happen by removing SMT is a more power efficient CPU. Making a simpler CPU would probably drop the IPC, thus needs more core to compete and single threaded app would suffer from it (because the lower single core performance).

Kaarlisk · Aug 12, 2015

Voxilla said:
Or to make it even more clear. Say you have very efficient SMT that allows your single core with 2xSMT to perform as good as 2 cores, obviously your single core needs the same shared infrastructure as the 2 weaker cores.

Thing is, SMT is a performance win only if not all of the core's resources are used. The shared infrastructure is sized for the single core at a high resource utilization, and SMT just allows splitting those resources among two threads that have low resource utilization.

Voxilla · Aug 12, 2015

Rurouni said:
The thing is, how much space and complexity is added to the core if we add SMT? If it isn't much, then why not add SMT?.

My point is, what is best performing hardware to execute N threads ?
Is it N cores or N / 2 cores with 2xSMT ?
SMT reduces your IPC per thread.

Some applicatoins are limited to the number of threads they can create to solve a prolbem. A CPU with lower IPC per thread will fail to provide the fastest solution.

Kaarlisk · Aug 12, 2015

Voxilla said:
My point is, what is best performing hardware to execute N threads ?
Is it N cores or N / 2 cores with 2xSMT ?
SMT reduces your IPC per thread.

N cores. But those N cores take up almost twice the die size of N/2 with 2xSMT.
SMT is basically a free performance win in case of threads with low IPC, while costing almost nothing otherwise.

Quad core with SMT is not meant to replace octa cores. Quad core with SMT is a way of getting octa core performance from a quad core whenever there is a low-IPC multi-threaded application.

Voxilla said:
Some applicatoins are limited to the number of threads they can create to solve a prolbem. A CPU with lower IPC per thread will fail to provide the fastest solution.

That is only true if the application is capable of exploiting IPC. If the application has low IPC, SMT will increase performance, as two threads will be running on a single core at almost full speed.

No SMT: 4 x 1 = 4 units of performance
With SMT 8 x 0.8 = 6.4 units of performance

http://www.hardware.fr/focus/101/perfs-avec-2-4-6-8-coeurs-4-jeux-loupe.html
Look at Watch dogs and Crysis

Voxilla · Aug 12, 2015

You are confusing things, IPC is a property of your CPU not of the software running on the CPU.
What you describe is threads that stall. That is an indication of bad software design.
SMT brings only max 30%, 2 cores bring 100% improvement.

Kaarlisk · Aug 12, 2015

Voxilla said:
You are confusing things, IPC is a property of your CPU not of the software running on the CPU.
What you describe is threads that stall.

No matter what you call it, the issue remains that not all code is capable of utilizing a core fully.

Voxilla said:
What you describe is threads that stall. That is an indication of bad software design.

No, it is not. Sometimes, but not always. I thought just as you did, and was corrected here: https://forum.beyond3d.com/posts/1863680/

Voxilla said:
SMT brings only max 30%, 2 cores bring 100% improvement.

In Watch dogs, 2 cores + HT is 50% faster than 2 cores without HT. At higher core counts the advantage is lower, of course.
http://images.anandtech.com/graphs/graph9483/76750.png

AMD RyZen CPU Architecture for 2017

hoom

Alexko

3dilettante

fehu

3dilettante

Voxilla

3dilettante

Voxilla

Grall

Invisible Member

pjbliverpool

B3D Scallywag

3dilettante

Kaarlisk

Voxilla

Voxilla

Rurouni

Kaarlisk

Voxilla

Kaarlisk

Voxilla

Kaarlisk

Similar threads