Intel i9 7900x CPUs

Do prefetches generally feed into the L3 first and from there L2 is populated or is it a general rule that you prefetch into L2 directly even when you have an inclusive LLC behind it?

From the high level summary of the caching model, it would be undesirable for the L2 to have a line populated and accessible before the L3 and its coherence/snoop information is populated with a state consistent with the L2's status and core-use information.

The L2 prefetcher's fetches create an L2 miss, which I think a more straightforward implementation would then go the L3 to see if the data is there.
If not there, then the L3 slice could generate an L3 miss that would then send a request to memory or broadcast a request if it is an SMP setup.
The listed behavior would readily fall out of this chain if the sequence is maintained, and the rule is that the prefetcher's miss can be discarded and the L3 slice's miss cannot.

I'm not sure if that means a message or signal is sent out to actively cancel the L2's request or the L2 makes a note of ignoring or rejecting it. Ignoring it might work since Intel's level of inclusion is not total, and the cores can silently evict lines without telling the L3. This would be like a preemptive eviction of a line. At worst, that leads to redundant snoops or invalidates that yield nothing.

That might not be what the hardware necessarily does. There are events separated by variable amounts of time, and it may be possible to shift events around or bypass stages as long as the arbitrating hardware properly isolates intermediate states or recovers from problems. For Intel's inclusive L3, the slices each have agents that manage that arbitration, so it would seem like the L3 would be the nearest place to update when the caching agent starts processing the transaction.

In the Alerts menu (the flag on the top bar), it will state "[username] mentioned you in [thread title]"

I'm not sure if any of the mentions yesterday were meant to have gone through or if mentions don't archive, since my list doesn't seem to have any from this thread.
 
Last edited:
Be careful with the OC...

DFOcZtJXsAImmwd.jpg


That was 1.25v on a 7800X
 
There's something else wrong witht that CPU/Board then. 1.25v is default VID for 4,5 GHz 2c-TBM3 in SKX.
 
Meanwhile I also got a retail 7820X octa core.
I can confirm it indeed has both FMA AVX 512 units enabled.

Here a AVX 512 julia/mandelbrot real time zoomer I made up to date.
Computations done in double precision.
Compared to a Titan XP, it runs twice faster :)

Warning, when running all 8 cores at 4 Ghz, CPU power is up to 208 Watt !

Edit
- Replaced with a slightly less optimized version, to reduce the heat, 10% less heat and speed.
(my cooler can't cope, with CPU at ~100 degrees celcius)
- Added a fall back to AVX2 if no AVX512 is present
- Added a missing libmmd.dll ( had to use an Intel compiler and didn't find a way to get rid of this dll)
 
Last edited:
Meanwhile I also got a retail 7820X octa core.
I can confirm it indeed has both FMA AVX 512 units enabled.

Here a AVX 512 julia/mandelbrot real time zoomer I made up to date.
Computations done in double precision.
Compared to a Titan XP, it runs twice faster :)

Warning, when running all 8 cores at 4 Ghz, CPU power is up to 208 Watt !

Edit
- Replaced with a slightly less optimized version, to reduce the heat, 10% less heat and speed.
(my cooler can't cope, with CPU at ~100 degrees celcius)
- Added a fall back to AVX2 if no AVX512 is present
- Added a missing libmmd.dll ( had to use an Intel compiler and didn't find a way to get rid of this dll)

Very interesting! What happens if you keep the optimized code path, but downclock and undervolt the CPU a bit?
 
4 GHz at AVX512 load with unlimited power by UEFI sounds much like the MSI X299 board. Other boards enforce the 140 watt TDP, downclocking the 7900X for example to 3,1-3,2 GHz in AVX512 loads.
 
Very interesting! What happens if you keep the optimized code path, but downclock and undervolt the CPU a bit?

I'll be adding the fully optimized version, so it can be tried on CPUs with more safe settings.
The additional optimization is 4 way interleaving of computations. The fractal computations are one long dependency chain and the FMAs have 4 or 6 cycles latency. Interleaving and SMT mitigates the dependencies.
The less optimized version does only 2 way interleaving.

I'd like to keep my CPU at 4 Ghz for AVX512, only for this extreme kind of code it is a problem.
 
4 GHz at AVX512 load with unlimited power by UEFI sounds much like the MSI X299 board. Other boards enforce the 140 watt TDP, downclocking the 7900X for example to 3,1-3,2 GHz in AVX512 loads.

Indeed, the board is a MSI X299 Tomahawk. I'm running with the Enhanced Turbo on, which means all cores run normally at 4.3 Ghz. AVX512 would not run at that frequency. To fix that I put 'AVX offset' to -3, which causes frequency to be reduced to 4 Ghz when running AVX/AVX512.
 
I see. 4.0 GHz still is all-core turbo and not what non-insane UEFIs do use. No wonder you're having problems cooling that amount of heat with air. :)
Any chance you could make that more optimized torture version of your mandelbrot/julia renderer available again? And does it tax GPUs equally heavy? For now, your Waves3D is hammering GPUs the most, even though it largely depends on bandwidth.
 
I see. 4.0 GHz still is all-core turbo and not what non-insane UEFIs do use. No wonder you're having problems cooling that amount of heat with air. :)
Any chance you could make that more optimized torture version of your mandelbrot/julia renderer available again? And does it tax GPUs equally heavy? For now, your Waves3D is hammering GPUs the most, even though it largely depends on bandwidth.

I'll be adding the fully optimized version tonight.
The CPU and GPU code are very similar. On GPUs there is no explicit interleaving of computations but I would think the inherent threading takes care of FMA dependencies, so it's likely optimal on GPUs too.
 
Thanks! If I find the time, I'll test it against my current worst case tomorrow (but with a more tame UEFI that's honoring the 140 Watt TDP - I'm measuring achieved clock rates here instead) :)
 
http://www.anandtech.com/show/11687/coffee-lake-not-supported-by-intels-200series-motherboards

Forthcoming Coffee Lake (6-core / 12 threads non HEDT consumer chips) needs new motherboards. This makes HEDT 6/8-core and Ryzen much more appealing upgrade options for many consumers, since you can't simply plug the new Coffee Lake 6-core to your existing Skylake 6600K/6700K socket. Someone needs to upgrade the Wikipedia page (https://en.wikipedia.org/wiki/LGA_1151).

I was also considering the highest clocked 6-core Coffee Lake as an cost effective upgrade path for our non-programmers (we all have Skylake 6700K now). I will get myself a Threadripper in any case, but now it seems that Threadripper would be a pretty good upgrade path for all of us (that 12-core / 24 thread model at 799$ is very aggressively priced).
 
Last edited:
http://www.anandtech.com/show/11687/coffee-lake-not-supported-by-intels-200series-motherboards

Forthcoming Coffee Lake (6-core / 12 threads non HEDT consumer chips) needs new motherboards. This makes HEDT 6/8-core and Ryzen much more appealing upgrade options for many consumers, since you can't simply plug the new Coffee Lake 6-core to your existing Skylake 6600K/6700K socket. Someone needs to upgrade the Wikipedia page (https://en.wikipedia.org/wiki/LGA_1151).

I was also considering the highest clocked 6-core Coffee Lake as an cost effective upgrade path for our non-programmers (we all have Skylake 6700K now). I will get myself a Threadripper in any case, but now it seems that Threadripper would be a pretty good upgrade path for all of us (that 12-core / 24 thread model at 799$ is very aggressively priced).

What do you use the large amount of threads for ?
 
What do you use the large amount of threads for ?
UE4 code recompile takes 25 minutes on 6700K. Shader recompile (console target) takes over an hour (UE4 has so many shader permutations). Data cooking is also slow on quad (I have fast SSD obviously). Many console platforms + PC + debug/release, so there's plenty of these operations happening. Quad loses 30+ min of your time every day, and more than an hour in bad days.
 
Back
Top