Follow along with the video below to see how to install our site as a web app on your home screen.
Note: This feature may not be available in some browsers.
Where did you read this?they could help clear out L3 pressure problems findings.
In this case AMD needs to fix their drivers. I'd assume Windows also uses this API internally to guide scheduling. We could see performance improvements as result even without software patches.Thats what CoreInfo uses, there's already a few posts about that & Ryzen seems to give wonky results with each thread reporting separate dedicated L1, L2 & L3.
That said, AMD has aimed low with just four cores for a shared L3, on top of that, Ryzen seem to have a weak inter-CCX fabric, which boggles the mind given how cheap on-die bandwidth is. This makes it much more sensitive to pathological scheduling cases.
AMD's prior interconnect started with the first dual-core A64s, and lasted in some form until Zen--theoretically. Jim Keller's presence at AMD sort of book-ends the lifespan of a really creaky interconnect.This seems to be a given. Intel changed their fabric topology and properties multiple times when they started to scale beyond 4-cores. Current one is the result of many iterations. AMD clearly is behind in this area.
Keep in mind AMD should be a mesh, not a ring. The inter-CCX fabric is probably adequate, just not for low node count designs. Smaller systems favoring Intel's ring, larger favoring AMD's mesh. How many links they support being an interesting question. At scale, I'm guessing they were going for SUMA, not NUMA.It should be said that it is Intel's Xeons,which have a dual ring L3 systems, that offers COD, - and up to 12 cores per cluster. They also offer much higher bandwidth on the LLC ring bus.
IMO, this will hurt AMD in workloads with varying amounts of active threads. The OS scheduler will have an impossible task. If so, I'd expect Ryzen 2 to have a much stronger inter-CCX fabric.
Regarding L3 and CCX, if Ryzen 1600X and 1500X are three cores per CCX, they could help clear out L3 pressure problems findings. But if they are 4+2, the waters will be much more muddy.
So it is likely to have been an AMD booboo? People have been blaming MS but it seemed like AMD to me.In this case AMD needs to fix their drivers. I'd assume Windows also uses this API internally to guide scheduling.
So it is likely to have been an AMD booboo? People have been blaming MS but it seemed like AMD to me.
Do you know what actually gives the response? (hardware, BIOS, drivers, some OS 'known hardware' file?)
My understanding is that both CCXs must have the same number of cores active.
Some sites have simulated the 1600X (and 4-core) performance using a BIOS feature.
http://www.phoronix.com/scan.php?page=article&item=amd-ryzen-cores&num=3
http://www.pcgameshardware.de/Ryzen-7-1800X-CPU-265804/Tests/Test-Review-1222033/#a5
MESIF came about as part of its creation of the ring bus and LLC, and those two features provide a fair amount of bandwidth amplification and absorb snoop traffic.
MOESI, at least as we know it, is older and doesn't derive as much read amplification from on-die storage
It is the case with MESI that a Shared line can forward data read request, which seems like a scaling barrier once multicore chips became more common.As I see it, they serve different purposes.
Without the Forward state, each cache with a line in state Shared responds with data on a read probe. With many caching agents the requesting node can be flooded with the same cacheline data from every cache that has the line in Shared state. With MESIF, only the cache that has the line in state F responds with data.
For MOESI, a write to the cache line is going to invalidate other copies. The Owned state allows Modified lines to be shared without writing back to memory, reducing writes. The Owned state tracks who is responsible for the final update, a write to a line that has copies elsewhere is going to invalidate any other shared copies, which is true for Owned and Shared. When exactly a dirty line is forced to write back is unclear, some descriptions such as from http://www.realworldtech.com/common-system-interface/5/ indicate that any transition for an Owned line is going to write something back. This is more restrictive than the wiki description of MOESI, since an O to M would seemingly be possible without a writeback (still needs invalidates). Perhaps the RWT description oversimplified the O-M case, unless AMD is very conservative with the O state. The wiki description seems to involve some operations that may not be implemented or common (write broadcasts rather than invalidates, and a writeback that converts O to S without eviction).The Owned state allows a core to broadcast writes to a cacheline to all cores which has the cache line in the Shared state. Without the O state a write to a cacheline invalidates all other copies of that cacheline which then has to be requested explicitly on subsequent reads from other cores.
Snoops may provide the cache with information that a line is shared elsewhere, which helps determine if the line should be S or E, but those other S lines do not provide data.The CCX interface acts as coherency agent for all the cores in the CCX, so a read probe from a core in a different CCX only gets one response even if all four cores in the CCX has a copy in the S state.
Trivial, but may have been a design decision. The problem goes away as they add nodes and a GPU/APU is likely to add nodes. Vega with 32 PCIE3 lanes, that likely substitute for IF links, makes it half a Naples. 8 core Ryzen's being the other half. Memory distribution is a bit unknown at this point. It could have been an oversight, or they didn't feel like fabricating two similar, but slightly different parts. Likely would come at the cost of some PCIE lanes as well.Is it non-trivial to double the inter-CCX link on a dual CCX design? Seems like low hanging fruit for Zen 2 consumer version.
Active, but not necessarily scheduling anything. So long as they don't reserve cache, they could all be enabled with 3 idling as to not consume resources.Yes you need to hace equal number of core active in each CCX this pic shows the possible configurations.
32Bytes/clock everywhereIdk if this image was posted before:
What is the BW of QPI? i heard it was similar32Bytes/clock everywhere
Confirms inter-CCX @ ram clock.
Arrows indicate 32B/clock full duplex but shared with other traffic so thats certainly a bottleneck if there is a lot of inter-CCX traffic.