Intel Skylake Platform

So do you think in the bottom region (page 3 top left core) the two yellowish sram banks are L2 and the bluish one are L3? I had thought all the cells there would be L3, but this would imply more mingling of the banks. (It would make sense to reduce latency for wires going back and forth from L3 to L2.)
SWZcmA5.png
 
Another possible consideration for the changed CPU arrangement is packing efficiency. The small dark blue rectangles at the bottom of the die, past the edges of the arrays could be dead space.
Keeping to the original row of CPUs with the height of the GPU and System Agent would add length to the die and leave more silicon below unused.

If the GPU and System Agent were adjusted so that they could match the height of the CPU/cache section in order to eliminate the dead space, the die would be even more rectangular, which might have implications for how flexibly the GPU could scale if it stretched even further, and for the blank area at the top of the die past the end of the memory interface.

The width of the ring-bus they could run through both the CPU and GPU arranging the L3 like this could be wider (literally) than the prior layout as well.
 
Does anyone think there could be any truth to this?

http://wccftech.com/intel-inverse-hyper-threading-skylake/
I vote no. What you probably see is that SPECCPU 2006 470.lbm likes a lot memory bandwidth and that this gets quickly saturated as you increase the number of threads. Given that L2 BW increased and that good DDR4 BW is larger than DDR3 BW, such a result is not that surprising.

That lack of increase above one core for 470.lbm is for instance mentioned here: Improving Cache Utilization Using Acumem VPE by Erik Hagersten, Mats Nilsson, and Magnus Vesterlund. You can read the article on Google Books.
 
More skylake:

link

Nothing was mentioned about inverse hyper-threading so it's probably bogus. The eDRAM sounds much improved and there's new instructions for better cache usage as well.
 
I vote no. What you probably see is that SPECCPU 2006 470.lbm likes a lot memory bandwidth and that this gets quickly saturated as you increase the number of threads.
Or they pushed the single thread turbo more than last time so that the downclocks as other threads/cores load up hurts more?
 
Or they pushed the single thread turbo more than last time so that the downclocks as other threads/cores load up hurts more?
There is a slide that clearly indicates this.
Depending on how you look at it, you can see it as good or bad, but if you want to model reasonably sustained performance, you had better be aware of the issues when benchmarking, or your conclusions will be way off.
 
An interesting post from RWT about the eDRAM fill policy:

The fill policy is very different:
Gwn 8:
Some SoC products include embedded DRAM (EDRAM), bundled into the SoC’s chip packaging. For example, the Intel processor graphics gen7.5-based Intel Iris Pro 5200 and the Intel processor graphics gen8 based Intel Iris Pro 6200 products bundle a 128 megabyte EDRAM. The EDRAM operates in its own clock domain and can be clocked up to 1.6GHz. The EDRAM has separate buses for read and write, and each are capable of 32 byte/EDRAM-cycle. EDRAM supports many applications including low latency display surface refresh. For both CPU architecture and the compute architecture of Intel processor graphics gen8, EDRAM further supports the memory hierarchy by serving as a large “victim cache” behind LLC. Compute data first populates LLC. Cacheline victims that are evicted from LLC will spill into the EDRAM. If later reads/writes occur to cachelines stored in EDRAM, they are quickly reloaded into LLC, and read/writing then proceeds as usual.

Gwn 9:
Some SoC products may include 64-128 megabytes of embedded DRAM (EDRAM), bundled into the SoC’s chip packaging. For example, the Intel processor graphics gen8 based Intel Iris Pro 6200 products bundle a 128 megabyte EDRAM. The EDRAM operates in its own clock domain and can be clocked up to 1.6GHz. The EDRAM has separate buses for read and write, and each are capable of 32 byte/EDRAM-cycle. EDRAM supports many applications including low latency display surface refresh. For the compute architecture of Intel processor graphics gen9, EDRAM further supports the memory hierarchy by serving as a “memory-side” cache between LLC and DRAM. Like LLC, EDRAM caching is shared by both Intel processor graphics and by CPU cores. On an LLC or EDRAM cache miss, data from DRAM will be filled first into EDRAM. (An optional mode also allows bypass to LLC.) Conversely, as cachelines are evicted from LLC, they will be written back into EDRAM. If compute kernels wish to read or write cachelines currently stored in EDRAM, they are quickly re-loaded into LLC, and read/writing then proceeds as usual.
 
Won't that increase the latency for a cache miss? :neutral:

It says dram requests can be bypassed to the LLC, so wouldn't expect that to be the case.

For the CPU cores it will almost certainly result in lower apparent latency of the LLC:

In Gen 8 the L4 was a victim cache, the GPU would write to the LLC which would spill into the eDram L4. There was no way for the GPU to bypass the LLC without also bypassing the L4.

Filling something like a G-buffer would fill the LLC multiple times, spilling to L4.

In Gen 9 the GPU can bypass the LLC and write directly to L4 (or rather directly to DRAM and the L4 intercepts the traffic). This means data in the LLC isn't flushed which will result in higher performance *and* lower power consumption.

It also allows for more aggresive prefetching to the L4, since it has much larger capacity.

Cheers
 
I wonder how much programmer control is given over LLC bypass policies as they mention "optional modes." This might be what some of the new cache instructions in Skylake are for.
 
There are SSE instructions to bypass some cache levels, I dunno whether they extended those or added some more.
 
Filling something like a G-buffer would fill the LLC multiple times, spilling to L4.

In Gen 9 the GPU can bypass the LLC and write directly to L4 (or rather directly to DRAM and the L4 intercepts the traffic).
I think I was looking at that backwards :oops:
Certainly makes sense to avoid thrashing L3 with all the GPU data also, I was thinking on the assumption of me never expecting to use a CPU without discrete GPU so not considering that usage.
 
Some reports are coming out that the thinner PCB substrate used in Skylake 1150-series CPUs is bending under the pressure exerted by some coolers designed for older socket versions.

Not sure if there's been any outright processor failures as a result of this, but it's definitely somewhat alarming seeing visible deformation in the substrate of an affected CPU. I can imagine it would be very problematic moving a deformed processor from one socket to another; that probably would result in failure; either burnt components, or (hopefully) just a system that can't boot.
 
I know Intel has provided guidance in the past regarding maximum "socket pressure" for aftermarket cooling assemblies; I wager they've updated this detail for Skylake. For years, especially with the Pentium 4 era, aftermarket coolers have introduced absurd clamping pressures to the point of warping the motherboard.

I'm going to have a hard time "blaming" Intel for taking reasonable steps to drop z-height of the processor. Clamping force does play a part in heat transfer, but some of the solutions are too much to be reasonable.
 
For years, especially with the Pentium 4 era, aftermarket coolers have introduced absurd clamping pressures to the point of warping the motherboard.
No idea how other coolers do it, but the Noctuas I'm using in my PCs have a very sturdy rear plate they screw into which creates a sturdy backing/reinforcement for the CPU socket. If a cooler only clips into the motherboard, that might well cause warping if excessive clamping pressure is applied...
 
I wonder if this could be improved by some kind of stiffener or nibs along the margin of the package for socketed chips that could come to rest on offsets on a modified socket. I suppose the forces would balance out best if the bottom of the heatspreader aligned so that force can be transmitted vertically and symmetrically through the substrate.
Possibly, columns interspersed among the lands?

Increasingly complex packages, and maybe for a product like an AMD interposer-based system, might have more that can be affected.
The package for Xeon Phi is really oddly shaped, although I don't think there is an enthusiast cooler market for it.
Possibly the decline in socketed consumer products in general would make this moot.
 
Heck I've chipped the edge of amd Athlon die when putting a cooler years ago. It still works but it no longer have sharp edges lol. BTW some aftermarket cooler is very heavy. So it may be have proper pressure while stationary, but what when moving?
 
Back
Top