AMD RyZen CPU Architecture for 2017

AMD said pretty much exactly that when asked about the Inter-CCX communication. If the data is found in the other CCX's L3, then the memory request is cancelled. Which only makes sense if you have a buffer that's deep enough to accomodate for the latency of getting this information (not necessarily the data itself though), won't it?

The L3 isn't inclusive, so checking it is insufficient. Perhaps the statement covers the shadow tag check as well?
There's a number of questions I have about the sequence of events, since this is not in a directly written AMD statement.

The buffer for absorbing unneeded RAM requests shouldn't need to be the size of those external arrays. I'm not sure where exactly the L2 miss handling is being done, and the CCX interface may play some role in it with the shadow tag functions. Wherever the handling is done, the number of misses that can be handled before the pipeline stalls is limited. Some papers on one of AMD's ideas for handling coherence between the CPU and GPU peg it at perhaps several dozen (problematic for GPU loads that throw out multiple thousands, http://research.cs.wisc.edu/multifacet/papers/micro13_hsc.pdf).
If we go by the visually estimated 2MBx2 capacity of those external arrays, that is 64K cache line requests. A whole Naples 2-socket system is going to stall before noticeably filling one array.
Could AMD have massively changed things this time around? Even if we assume the paper's baseline is conservative and predates Zen, I'm not sure Zen's coherent pipeline has shown a revamp so extreme.

A directory or filter like HT Assist seems to fit the storage capacity.
The old version of HT Assist from Istanbul could cover 16MB of cache with 1MB of capacity stolen from the L3. (edit: https://www.cs.columbia.edu/~junfeng/12sp-w4118/lectures/amd.pdf)
One of those possibly 2MB arrays in Zen could cover 32MB--which would be enough to cover the L3s of a 4-chip Naples, although insufficient to cover the L2s. Having two, however, would cover one Naples MCM entirely if there's some amount of balance in cache utilization. The other effect of having the array external to the CCX is that it would allow the CCX to downclock or power-gate without affecting memory traffic or coherence, since the L3 is no longer part of the northbridge.

Another unknown is what AMD's version of MOESI is for Zen. The classical MOESI would generally not need as much buffering because a lot of traffic is going straight to memory no matter what (L3 victim cache, Shared state does not forward) . Bulldozer's MOESI seems to have shifted that preference, although some of its changes may interact with activating the directory. In the old scheme, the access is happening anyway for a lot of misses--meaning no buffering is going to save the controller's time.
If Zen is more like Bulldozer, this might be less frequent. The question becomes how quickly the probe can be resolved before the entry comes up in the memory controller's queue. It would be relatively cheap to buffer the access and overlap the delay if the probe takes a little longer. It wastes bandwidth and power if the access is cancelled, but it shaves off potentially tens of nanoseconds in the single-chip case, given the gulf in memory latency versus the inter-CCX ping latency.
 
Hoping it (Ryzen) and Vega comes to the iMac soon, not to mention that new 8k LG 32" panel ;)

That would be all I need for the work I do in the foreseeable future.
I gave up on Apple's action-lacking commitment and finally built a Ryzen rig. Still using a Mac for everyday stuff though.
:runaway:
 
Heres a 2+2 vs 4+0 test. theres really no difference. And far cry have a kind of FPS limiter on ryzen :?:

I was thinking...If the inter ccx latency is not a problem(at least as shown in those test) how is that ram speed improve performance so much?

It appears that this test was done with 3200 MHz ram based on a screen shot in the video, so that tarnishes the results a bit. This would've been a perfect opportunity to test under several different ram speeds to be able to tease out what portion of performance improvement is due to faster ram speeds and what portion is due to faster ccx interconnect speeds.

Also I think I saw some mentions of "minimum fps" in the brief part that I watched, so I'm about to hulk out over some flawed benchmark metrics.
 
It appears that this test was done with 3200 MHz ram based on a screen shot in the video, so that tarnishes the results a bit. This would've been a perfect opportunity to test under several different ram speeds to be able to tease out what portion of performance improvement is due to faster ram speeds and what portion is due to faster ccx interconnect speeds.

Also I think I saw some mentions of "minimum fps" in the brief part that I watched, so I'm about to hulk out over some flawed benchmark metrics.

Yeah thats true, the faster ram "cover" the ccx latency minimizing differences, i didn't think about it.

I think ppl don't want to show to much to keep things for their r5 reviews but the 1500 being 5% slower(in some games) with 3200 than the 7600k is pretty good for 170 (or 150 if the OC its as high).

I think AMD really missed a golden shot with the ryzen launch, they needed to focus in one board vendor and one ram(G.skill not corsair since its getting faster speeds) and work to get 3600 ram before the launch that could change a lot in the already-good early reviews.
 
Optimizing for AMD Ryzen CPU

Not sure if this was linked already, but the GDC presentation by AMD on Ryzen is up now on GPUOpen.
Nice.
Pg 10 has official cache latencies: L1 4 clocks, L2 17 clocks, L3 40 clocks.

Pg13 says 24* PCIE lanes, link to IO hub is 32Byte/clock @600Mhz = 19.2GB/s which is somewhat less than 24* PCIE 23.6GB/s so potential bottleneck, could that be why they don't expose 32* lanes (31.5GB/s)?
Or would that link be @memclock 1.2GHz (38.4GB/s)?
 
This presentation has lots of good general advice, but I fail to see much Ryzen specific tips & tricks. I would have expected at least some info and test results when optimizing code for Ryzen's unique split L3 cache configuration. And some results about AMDs SMT implementation. When SMT is good idea and what are the common bottlenecks in real code (esp games).

This presentation mentions that Ryzen manual cache prefetch instructions are slow. Jaguar's manual cache prefetch is also slow, so games aren't using it anymore. Last gen consoles required manual cache prefetch and that was painful. I would be surprised if modern games and applications used manual cache prefetch instructions at all. Automated prefetch is so good on modern Intel, ARM and AMD CPUs. Ryzen is thus well equipped to run modern software in this regard.

SoA (/AoSoA) has been the preferred performance oriented data layout since SSE got popular (10+ years ago). It is a good data layout for CPUs and GPUs. Intel has plenty of presentations about it. Don't see anything Ryzen specific here. Of course Ryzen having only a dual channel memory controller on a high throughput 8-core chip means that you are more often memory bandwidth bound, thus SoA is naturally more important as it saves bandwidth (better cache line utilization). Data oriented design (DOD) has been a big thing in games two generations already. It is no surprise that a high throughput 8-core (16 thread) CPU likes DOD. But so did Cell and Xenos and so does Jaguar.

None of the advice in this article explains why the gaming performance of Ryzen is generally lower than productivity performance (relative vs Intel). I would be highly interested in knowing the actual bottlenecks in games regarding to Ryzen and how to avoid them. An article like this would have much higher value than generic optimization advice that fits all modern CPUs.
 
Last edited:
This presentation has lots of good general advice, but I fail to see much Ryzen specific tips & tricks. I would have expected at least some info and test results when optimizing code for Ryzen's unique split L3 cache configuration. And some results about AMDs SMT implementation. When SMT is good idea and what are the common bottlenecks in real code (esp games).

I would like to have seen some more details on the stack engine. PUSH/POP not only stores or loads values from memory, but also updates the stack pointer.

Without special machinery PUSH and POP would be limited to one per cycle. Traditionally stack engines removes the false dependency on the stack pointer for two or more stack ops per cycle. That is, multiple addresses can be generated per cycle but the actual PUSH (store) and POP (load) still goes through the LS units.

AFAICT, Ryzen takes this further and renames actual stack entries. Not only are the address calculations elided, but the actual loads and stores are as well, freeing the load/store units up for other work. As long as you're not limited by dispatch rate (4-6ops/cycle), loads and stores to the stack are free making leaf functions cheaper, moving the optimum for inlining leaf functions.

Cheers
 
Just noticed something about the 12-core Zen spotted at Sandra - it's actually a castrated Naples - that means 4 Zeppelins in 3/0 configuration. The giveaway was 8x8MB L3 cache
In a 3/0 configuration, shouldn't it be 4x8MB L3? 1 enabled CCX per die = 1 pool of L3 per die.

SiSoft Sandra (leaks) has been inaccurate in its reporting of Naples L3 in the past, so I wouldn't rely on it.
 
In a 3/0 configuration, shouldn't it be 4x8MB L3? 1 enabled CCX per die = 1 pool of L3 per die.

SiSoft Sandra (leaks) has been inaccurate in its reporting of Naples L3 in the past, so I wouldn't rely on it.
Hmm, you're probably right
 
Didn't Ashes specifically have a rather large deficit on Ryzen compared to other titles?
 
Yes it had. I think it's important that amount of work needed, 400 hours who h is not very much.

Enviado desde mi HTC One mediante Tapatalk

I'd really like to know how is possible that R5 are on sale weeks before the official launch....
 
Last edited by a moderator:
https://www.pcper.com/reviews/Processors/Ashes-Singularity-Gets-Ryzen-Performance-Update

Ashes of Singularity update arrives. Up to 31% (high) and 20% (extreme) improved performance. And zero impact on Intel. So it seems that they have been specifically solving Ryzen CPU bottlenecks. So no generic memory layout optimization, etc. I would be very interested in knowing what they did.
Unleashing Ryzen in Ashes of the Singularity™

https://community.amd.com/community/gaming/blog/2017/03/30/amd-ryzen-community-update-2

https://twitter.com/AMDRyzen/status/847434538380808196
 
Back
Top