AMD RyZen CPU Architecture for 2017

Because the power regulation is per core the 32 core part could be really really interesting particularly with VM farms with clocks all over the place between cores. What amd set the max turbo at for a 180watt TDP part would be interesting to see.
Most VM farms in the enterprise space disable CPU power efficiency features, too many latency issues. When I was at VMware, support would often tell customers to disable Intel SpeedStep and C-states. And server purchasing decisions have more to do with integration, scaling, and automation than benchmarks.

That said, Ryzen looks like a really good hyperscaler CPU.
 
Last edited:
Follow up: Facebook and others don't like NUMA and are increasingly bottlenecked by memory bandwidth. They build their own nodes, fabric and automation stack so the enterprise features of Cisco or HP offer little value. 64 threads and 8 DDR4 channels per node could be very compelling.
 
Last edited:
L3 apparently matches the clock of the fastest core
This seems like a good thing, previously L3 was at Northbridge clocks wasn't it?
I don't get why they have 4* clock gates if the whole thing is at fastest core clock?
Which reminds me of being perplexed how Intel could claim no latency penalty for access to downclocked L3 slices.

Also something that has bugged me: AMD talks about 'same average latency' to all L3 slices.
Is that: 10ns, 20ns, 20ns, 30ns = 20ns average (made up numbers)
or 20ns, 20ns, 20ns, 20ns?
If the former then its a totally pointless claim :unsure:
 
About Ryzen 7 1700 de-lidded. ( sorry for the repost, not see it on previous page ) .

Professional extreme overclocker Roman "der8auer" Hartung from Germany recently managed to successfully de-lid his AMD Ryzen 7 1700 processor and confirmed that AMD is, in fact, using solder as its thermal interface material of choice between the Ryzen die and IHS (integrated heat spreader). The confirmation that AMD is using solder is promising news for enthusiasts eager to overclock the new processors and see just how far they are able to push them on air and water cooling.



It seems that AMD is using two small pads of Indium solder along with some gold plating on the inside of the IHS to facilitate heat transfer and allow the solder to mate with the IHS. Because AMD is using what seems to be high quality solder TIM, delidding and replacing the TIM does not seem to be necessary at all; however, Roman "der8auer" Hartung speculates that direct die cooling could work out very well for those enthusiasts brave enough to try it since the cooler does not need to put high amounts of pressure onto the CPU to hold it into place unlike an LGA socket.

https://www.pcper.com/news/Processo...onfirms-AMD-Using-Solder-IHS-Ryzen-Processors
 
Most VM farms in the enterprise space disable CPU power efficiency features, too many latency issues. When I was at VMware, support would often tell customers to disable Intel SpeedStep and C-states. And server purchasing decisions have more to do with integration, scaling, and automation than benchmarks.
Yes, even i disable low power C states in my VM boxes at home, but thats not what im talking about, the current nutanix AHV ( KVM) clusters im working with have all P states enabled for example . Zen is supposed to do dynamic/shadow P states, something that could be very interesting in farms where you have very asymmetric scheduling between guests. So Nutanix CVM is the perfect example of something that loves clock speed and requires a very large amount of CPU resource and is needed on every node. So assuming that you dont get killed by the latency of the P state switching being able to boost high while not wasting power/thermal head room from dragging up other cores voltage could be interesting to see.

given a 32 core server is going to be TDP bound, if they give the option to go aggressive with that kind of feature i can see places where it will be an interesting test.

That said, Ryzen looks like a really good hyperscaler CPU.
Yes im guessing upto 32 NVME drives a Proc would do that :)
The rumors are that skylake-EP is supposed to be similar ( and at 3600 pin socket i can believe it #).
 
Still reading it but did that reviewer mess up 3Dmark numbers or is there something very fucky going on with Combined Scores? :confused:

SATA/M2 performance mostly right up there with Intel but has some points where it bogs down.
 
Just a suggestion, that doesnt need to be followed, but it would be useful for organizational purposes to create a separate thread for actual Reviews.

Thank you.
 
Yes, even i disable low power C states in my VM boxes at home, but thats not what im talking about, the current nutanix AHV ( KVM) clusters im working with have all P states enabled for example . Zen is supposed to do dynamic/shadow P states, something that could be very interesting in farms where you have very asymmetric scheduling between guests. So Nutanix CVM is the perfect example of something that loves clock speed and requires a very large amount of CPU resource and is needed on every node. So assuming that you dont get killed by the latency of the P state switching being able to boost high while not wasting power/thermal head room from dragging up other cores voltage could be interesting to see.

given a 32 core server is going to be TDP bound, if they give the option to go aggressive with that kind of feature i can see places where it will be an interesting test.


Yes im guessing upto 32 NVME drives a Proc would do that :)
The rumors are that skylake-EP is supposed to be similar ( and at 3600 pin socket i can believe it #).
Small world, PM sent!
 
If the SMT problem can be resolve with code then its really good because right now is completely useless it either does nothing or hurt performance.

How didn't AMD find about this before?
 
Intel had 15+ years to perfect their SMT implementation. It's quite laborious to QA a complicated OoO architecture for SMT.

New architecture, new process, new memory, new chipset -- all colliding at once, it was bound to have some imperfections.
 
Intel had 15+ years to perfect their SMT implementation. It's quite laborious to QA a complicated OoO architecture for SMT.

New architecture, new process, new memory, new chipset -- all colliding at once, it was bound to have some imperfections.

This is why I was waiting for actual reviews.

People should also note that Intel's SMT (HyperThreading) caused some serious performance issues in games for many many years. It was multiple generations of CPUs before it became relatively safe to leave HT on. It's one of the main reasons I never bothered with i7 CPUs and just stuck with i5 CPUs. The additional cost to have SMT just wasn't worth it for the workloads I typically use. Gaming being one of them.

That shouldn't be too surprising as depending on the game, it could already be maxing a CPU core. Add in the overhead for SMT, and you'll end up losing performance. While SMT is great for making use of unused CPU cycles, if there are no spare CPU cycles to be had on a core, SMT is going to end up hurting performance.

Regards,
SB
 
This seems like a good thing, previously L3 was at Northbridge clocks wasn't it?
It could reduce the synchronization penalties from crossing disparate domains, although the slower cores would be out of step. It might be that this keeps the L2's shadow tags from bottlenecking cores if they were kept at a slower uncore clock.
Power-wise, or perhaps for clock scaling/turbo, it's something extra that needs to be driven harder when single-core performance is being pushed.

Also something that has bugged me: AMD talks about 'same average latency' to all L3 slices.
On average, accesses should fall into an even distribution among the 4 L3 slices and their portion of the shadow tags.
It might take a specific pattern to really hit a local slice, and there's probably some pipelining that might constrain how frequently a core can draw from a given quadrant. A 32 byte connection for one thing is going to eat up an extra cycle of access time, so that's an extra cycle that's going to serve as a bit of margin that can allow a pipeline to overlap some of the remote vs local latency. Other elements like the serial nature of the shadow tag checks, array access, and contention may also have fixed costs that mean the variable portion's influence is diluted.

Shouldn't that be done/started before the lunch?
Kind of chicken and the egg to devs to find issues for a chip and platform that has only relatively recently been made available, and some of the timelines may be shorter than otherwise because of early samples.
The platform seems pretty raw as it is, but the alternative would be to delay it even further. A lot of issues wouldn't be found until after that anyway.




I was hoping that there would be more disclosure about the neural net prediction for core execution, but that slide didn't show up in the reviews I've seen so far.
The chip seems a bit raw, some things look kind of inflexible. It's still decently close in many areas to a very mature platform and chip, and there is only so much that can compensate for that.
The silicon does look like it's being pushed pretty heavily already, thanks to the performance features in place.

I think part of the reason why there's no 140W SKU is that it doesn't seem like the cores and CCXs are really scoped to scale speeds and voltages much more than they do. It's sort of a "balance" thing where there's less left bottlenecked by having the cores lose area or power efficiency by having margin in out of spec scenarios. Perhaps that's a question of physical maturity, or if you want 30% more power draw it would take another CCX to do it.
I am curious now what import this has for the APUs, if the CPU-only physical implementation reaches these speeds. The GPU might not make it easier.
Ryzen seems decent enough, being usually in the area of some good cores. It's a bit of a lateral move for many gamers, unfortunately.
It's early days, and perhaps people have forgotten about the CPU optimization game being so important since the BD line was so terrible for so long that it was pointless.

I wouldn't have minded some kind of value-add or "wait, there's more" for the architecture to keep things interesting. It's a pretty standard and mostly respectable product with a decent value proposition if you don't already have something that performs similarly. There seems to be more waiting in the wings for the non-consumer market.
 
"No, this is strictly CPU scheduling within the game." Robert Hallock, talking if a gpu driver can fix SMT issues on games.
 
Considering that it performs pretty well in the non-gaming scenarios, I wonder if it is possible that AMD uses a very aggressive power saving state switching (courtesy of a bazillion sensors in the chip). Games tend to have very variable workloads depending on which part of the frame is being processed. Did any of the reviewers mention if they tried testing with power saving off?
 
Some suggest that big part of the performance issues (where they exist) are due threads being thrown (too easily) from CCX to CCX, and cache-snooping the other CCX's L3 while making memory access in case that's a miss too doesn't help. I wonder if they can evolve the design so that each CCX share same big L3 cache instead of dedicated L3 cache per CCX? It could solve at least part of the problem
 
Back
Top