Intel i9 7900x CPUs

Games seem to prefer large shared inclusive L3 cache over a smaller and slower L3 victim cache. Server software is mostly using data independent threads (or combining data infrequently) while games are frequently (every millisecond) moving data between cores or accessing world state from multiple threads at once (= big chunk of mostly immutable data that can be easily shared). Big shared L3 cache is great for this purpose. We can see similar performance issues with Ryzen. Ryzen also has a smaller and slower L3 victim cache (and it's also split between clusters).

Skylake-X gaming performance is giving us more information why Ryzen's gaming performance lags behind i7 6900K and i7 7700K. Ryzen also has 2x larger L2 cache than these Intel consumer chips, but that's not apparently a big deal for games, since Skylake-X has 4x larger L2 cache and that doesn't help either. Games seem to really love a big and fast fully shared inclusive L3. Unfortunately you can't really scale up the core count and keep caches like this around. AMD vs Intel gaming performance (8+ core chips) is now much more comparable than it was with last gen Intel chips. It looks like Zen2 doesn't need huge changes after all to compete against modern Intel HEDT in games. But unfortunately this means that Intel's quad core chips will remain the best chips for gaming. Skylake-X can't beat them, Ryzen and Threadripper can't beat them. Hopefully next gen consoles will have Ryzen in them (with 16+ threads), forcing game developers to design their systems in a way that scales properly to these 8+ core PC CPUs.

For some reason Ryzen L3 cache has significant advantage in terms of performance in AIDA64 cache benchmark:

7700k
intel-core-i7-7700k_am0xqx.jpg


Ryzen 1700
aida647zxkz.png


But latency is slightly higher for both L2 and L3.

7800x
287ca87a_aida7800x480z8b74.png


It has the highest latency of the 3 for L3 and similar latency to Ryzen for L2, L1 performance is the best of the three though.
 
Last edited:
AIDA very recently has changed how L3 cache was tested. L3-performance went down for i9-7900X from 900-595-855 in version 5.90.4200 to 125-110-120 in 5.90.4300. Latency though remained the same at 20,3 ns. I have not tested with intermediate versions.
 
7800x:
287ca87a_aida7800x480z8b74.png


It has the highest latency of the 3 for L3 and similar latency to Ryzen for L2, L1 performance is the best of the three though.
Wow, that L3 bandwidth is so low. 5x less than Ryzen. Almost as slow as main memory. But I'd still guess that the L3 size is the biggest difference in games, as both Ryzen and Skylake-X have significantly smaller shared L3 capacity than previous Intel designs (which perform better in games).
 
I see the 7800X also behind in our games-testing, but I would not try to scapegoat the L3 alone.
But even when both are overclocked to a difference of only 200MHz, the 7800X trails the 7700K by a very wide margin (20%+) in several games, doesn't that point to the L3 being the main cause?
 
For an i9-7900X, I get (roughly, since high variance):
125-110-120 GB/s
For an i7-7820X:
100-95-100 GB/s
i7-7800X (not yet run with AIDA64)
edit:75-73-75 GB/s.

From this scaling, one could conclude, that each core and its L3-portion can send roughly 12,5 GByte through the mesh.
 
Last edited:
But even when both are overclocked to a difference of only 200MHz, the 7800X trails the 7700K by a very wide margin (20%+) in several games, doesn't that point to the L3 being the main cause?
7900K trails 7700K a bit less in gaming benchmarks, but not as much as 7800K. 7900K has total of 13.75 MB L3, while 7800K has only 8.25 MB L3. Would be nice if somebody could benchmark with normalized clock rates (for example to 3.3 GHz) + disable turbo. 7700K vs 7800K vs 7900K. Would give more insight. Also 7820X results would give more insight, because it has identical single/dual core turbo clocks as 7900K (but less L3 cache).
 
Games seem to prefer large shared inclusive L3 cache over a smaller and slower L3 victim cache. Server software is mostly using data independent threads (or combining data infrequently) while games are frequently (every millisecond) moving data between cores or accessing world state from multiple threads at once (= big chunk of mostly immutable data that can be easily shared). Big shared L3 cache is great for this purpose. We can see similar performance issues with Ryzen. Ryzen also has a smaller and slower L3 victim cache (and it's also split between clusters).

Skylake-X gaming performance is giving us more information why Ryzen's gaming performance lags behind i7 6900K and i7 7700K. Ryzen also has 2x larger L2 cache than these Intel consumer chips, but that's not apparently a big deal for games, since Skylake-X has 4x larger L2 cache and that doesn't help either. Games seem to really love a big and fast fully shared inclusive L3. Unfortunately you can't really scale up the core count and keep caches like this around. AMD vs Intel gaming performance (8+ core chips) is now much more comparable than it was with last gen Intel chips. It looks like Zen2 doesn't need huge changes after all to compete against modern Intel HEDT in games. But unfortunately this means that Intel's quad core chips will remain the best chips for gaming. Skylake-X can't beat them, Ryzen and Threadripper can't beat them. Hopefully next gen consoles will have Ryzen in them (with 16+ threads), forcing game developers to design their systems in a way that scales properly to these 8+ core PC CPUs.

If I'm not mistaken, Coffee Lake will retain a big shared L3, in spite of its 6 cores. That should be a great compromise for games.
 
If I'm not mistaken, Coffee Lake will retain a big shared L3, in spite of its 6 cores. That should be a great compromise for games.
Coffee Lake die is only up to 6 cores. So a shared L3 is likely. Intel is also claiming 15% perf gains over Kaby Lake, but we don't know whether this is IPC + clock gains (= single thread improvement), TDP based (= mostly affects mobile on same TDP target) or also takes increased core counts into account. However it turns out, this is going to be the best gaming CPU. Ryzen and 8+ core Skylake-X are both slower in games than 7700K and this is going to add two cores and potential IPC + clock improvements over 7700K.

It seems that all future chips larger than the consumer die (4 cores now, 6 cores soon) are going to feature slower L3 than the past HEDT cores, bringing down gaming performance. It would be interesting to know why AMDs L3 is so slow, because it only serves four cores. Does victim cache and cross-cluster traffic add significant latency (even when not needed)?. If Zen 2 can improve this, we could see AMD IPC increasing beyond Intel's HEDT designs. AMD said that there's lots of low hanging fruit left regarding to Zen optimizations.
 
On the RWT forum I found a link to an updated Intel optimization manual including info about the Skylake-X architecture.
- Apparently the loop stream detector has been disabled (should attribute to higher heat production)
- More detail about the additional or missing FMA (the port 5 has 6 cycles latency vs 4 for port 0)
- In the AVX 512 section some code how to detect 1/2 FMAs (apparently there is no easy way to detect this)
- Plenty of more information ...
 
Does victim cache and cross-cluster traffic add significant latency (even when not needed)?
Yeah, always puzzled me why a victim cache would be slower than an inclusive one? I mean -in my understanding- it combines the best of both worlds, it stores evicted data (for later access) and maximizes cache capacity, so what gives?
 
Yeah, always puzzled me why a victim cache would be slower than an inclusive one? I mean -in my understanding- it combines the best of both worlds, it stores evicted data (for later access) and maximizes cache capacity, so what gives?
Exactly. That was also my thinking, but IIRC cache coherency protocols become harder. Don't know the hardware details, but this could add latency. Also victim caches aren't used for data prefetching... but I haven't understood why this is a bad thing, since prefetched data is most often used right away on the core that caused the prefetch -> private L2 would be the best place to store it.

Could somebody with better HW understanding explain why victim caches tend to be slower and smaller.
 
Did you manually link to his profile? If you want to generate an alert for him to the post, you need to simply put @ in front of the username. @3dilettante
I doubt it, its an issue with this forum, whenever you copy and paste something here, it often retains the formating (even though personally in 99% of cases I dont want it to, perhaps I'm in the minority, though I suspect DavidGraham had the same idea), there is a remove formatting button but it often doesnt work, eg I tried it on the blue text
 
@zed
It always works perfectly for me after doing a select-all then hitting the Remove Formatting button. That's all I ever do to fix up all the messed up posts.

Also, to fix up your copy-pastes, first thing is paste the text into Notepad, then select and copy the text from Notepad. Yes, doing an intermediate jump to another program before posting into forums is silly, but it's required for certain website sources and browsers.
 
This is kind of far-ranging, so I hope I can keep my thoughts organized. I will try to be clear when I get out of my depth past higher-level concepts and start speculating about more low-level details.

It would be interesting to know why AMDs L3 is so slow, because it only serves four cores. Does victim cache and cross-cluster traffic add significant latency (even when not needed)?.

One correlation of a sort for why AMD's L3 is slow for the number of clients it has is that AMD has seemingly always been conservative with its L3s, or perhaps more broadly not been as capable with caches as Intel has been.
Somewhat separate is the choice to not have a shared last-level cache, which may not entirely overlap with the decision to make the L3 a (mostly) exclusive victim cache.

The shadow tag macros that allow the L3 complex to handle intra-CCX snoops are a serializing step, which can take more than one iteration based on the rather simple 4-way banking and a two-stage early tag comparison. The L3 check appears be held until after that point.
Those add some amount of latency, and the clock and voltage behavior of the L3 and CPU domains is not as simple. The L3 matches the fastest core, and the individual cores and clients may be running at different per-core voltages and a subset of clock offsets, which would be crossing multiple domains and add synchronization latency.
The L3 and L2 arrays have their own voltage rail, which may be another domain crossing.
If Zen aggressively gates arrays off, there may be wakeup latency in it.
The L3 isn't fully exclusive, as there is some level of monitoring for sharing that might make it maintain lines, so it's not a straightforward exclusive cache.
AMD's cache is optimized for cache to cache transfers if there is data in other L2s, which I'm curious about. For certain situations, it may save some trips to the L3 if a line needs to be written out of one L2 and can move to the other without arbitrating an L3 allocation.

By comparison, the non-mesh Intel L3 can make a determination of line status, sharing, and potentially return a response from one hit to one specific location. AMD's is less centralized, and it's not clear that these overheads can be made to go away. The CCX doesn't know until it wends its way through all these checks, and since the shadow tag macros are supposed to be up to date copies of the L2 tags, there must be some kind of synchronization window to make sure they aren't stale while evaluating them. The nearest directory or filter beyond that is in or across the data fabric.

Of note is that Intel's L3 did have something of a latency bump after the ring bus and L3 were taken out of sync from the cores, which like AMD meant extra time could be taken in cycles missed for synchronization purposes.

Intel's mesh setup modified the interconnect and separated the snoop functionality from the L3 lines, so the conversion to a victim cache isn't the only possible source of latency.

Exactly. That was also my thinking, but IIRC cache coherency protocols become harder. Don't know the hardware details, but this could add latency. Also victim caches aren't used for data prefetching... but I haven't understood why this is a bad thing, since prefetched data is most often used right away on the core that caused the prefetch -> private L2 would be the best place to store it.

Could somebody with better HW understanding explain why victim caches tend to be slower and smaller.

One part of why victim caches can be slower and smaller is that they aren't required to be big enough to match the upper caches, and be fast enough to not hold them back. If an implementation could be big enough and fast enough, would a designer pick a victim cache?

There are two notable dimensions to the recent AMD and Intel L3s. One is how inclusive/exclusive they are, and the other is how broadly each level of cache is shared.
Intel's ring-bus and inclusive LLC effectively embeds a snoop filter into the L3. A hit in the L3 answers whether a snoop needs to be generated, and to which cores it should be sent. Having varying levels of inclusion in the upper caches allows them to avoid having to service snoops at speed, which can cause them to be unable to service tag checks or requires duplicate tags that can service snoops in parallel with normal operation.

I think the reduction in clients that need to worry about coherence may have some benefits in reducing the complexity of the implementation. At least conceptually, having a more local check with data and line status in one spot instead of querying distant and non-synchronous clients with various time windows of inconsistent state makes it easier to avoid hidden corner cases.
Bulldozer's many clients were such that I think it might have played a role in why Agner Fog's testing showed that L1 bandwidth went down if more than one core was active on-chip, and may have contributed to the TLB bug for Phenom. The TLB bug was an issue where line eviction from the L2 to L3 of a page table entry in the same period as that entry being modified by a TLB update might allow a separate core to snoop the stale copy from the L3.
I don't know if a more inclusive LLC and hierarchy could have prevented it, but it seems like there could be ways the design could reduce the number of transitory and physically separated states that could be missed in testing.


As far as prefetching goes, a victim cache can constrain how aggressive hardware prefetchers can be. Intel's L2 hardware prefetcher for the inclusive hierarchies would always move data into the L3, and usually into the L2. The "usual" modifier is a key point in that the cache pipeline may opt to quash prefetches if heavy demand fetches need line fill buffers.
The data prefetched may find use soon, but a burst of memory traffic could halt a prefetcher if all it has is the L2. The distributed L3 and the separate slice controllers can absorb more prefetched lines without thrashing, and could support more outstanding prefetcher transactions than the more in-demand local L2.

As a third way, the most recent IBM Power L3s are not exclusive, but are not fully shared as an LLC. The L3's larger size and resources per partition means it more heavily participates in prefetching. By having local partitions, the local 5-10 or so MB per core has low latency for an L3. There is some level of managed copying of lines to other partitions to get some of the benefits of a shared last-level cache. It doesn't come cheap across multiple dimensions, like complexity, actual cost, and power.


Yeah, always puzzled me why a victim cache would be slower than an inclusive one? I mean -in my understanding- it combines the best of both worlds, it stores evicted data (for later access) and maximizes cache capacity, so what gives?
Exclusivity allows more cache to be available at a given level, but some of the downsides are that it takes extra work to maintain. Strict exclusivity can cause a higher level of cache to load one line from the lower cache, causing the lower cache to invalidate its line. Inserting the new line in the higher cache would likely evict a line from there, which must now move to the lower cache since it is a victim line.
There is more communication and data movement needed in both directions, and while it would be nice to assume the lines could trade places, I'm not sure if the capacity is automatically available because that end of the transaction may be tens of nanoseconds out and the core needs to keep moving. These operations are occurring over an internal pipeline and may be arbitrating with other traffic and sharing cores before the operation is fully completed. It's possible further eviction activity could propagate down if the victim cache keeps to its own LRU policy and shuffles yet another dirty line out. That makes the cost of otherwise unremarkable movement of cache lines through the hierarchy less predictable and expensive in terms of implementation complexity and data movement.

For a fully inclusive cache, it's already known there's room allocated, since it's inclusive. Also, clean lines in the higher levels can be silently and quickly invalidated without significant bandwidth consumption or synchronization with other parts of the chip. It's generally predictable and skips work considered obligatory for the victim cache. There is one form of complication in the case of an inclusive LLC, if sufficient thrashing causes it to drop a line that is also higher in the hierarchy. Back-invalidation of shared lines is something that a fully exclusive hierarchy would by default not have to worry about, although if implemented like Intel's inclusive cache the in-use bits would indicate what cores need to have invalidations sent to them.

As process nodes advanced for single core and low-count systems, the utility of having every line in the hierarchy unique reduced. At the same time, the need for speed and power efficiency meant that the higher cache levels were constrained in size and timing budget for other reasons, making the bigger lower level of cache grow even more relative to them.
Higher core counts are what could raise the cost of the LLC to where it might be worth evaluating a victim cache.

Did you manually link to his profile? If you want to generate an alert for him to the post, you need to simply put @ in front of the username. @3dilettante

Was that supposed to create an alert for me in the usual spot, or somewhere else? I don't recall seeing anything pop up for me.
On the other hand, I did wind up here, somehow...
 
Last edited:
Do prefetches generally feed into the L3 first and from there L2 is populated or is it a general rule that you prefetch into L2 directly even when you have an inclusive LLC behind it?
 
Was that supposed to create an alert for me in the usual spot, or somewhere else? I don't recall seeing anything pop up for me.
On the other hand, I did wind up here, somehow...
In the Alerts menu (the flag on the top bar), it will state "[username] mentioned you in [thread title]"
 
the new i9 7900x tested against the Ryzen 7 1700 and the i7-7700. Doesn't look that good.

 
Back
Top