Intel Gen9 Skylake

Turtle 1 · Aug 7, 2015

Ryan Smith said:
I beg your pardon? All of our platforms used memory at their normal, supported frequencies and timings. That's DDR3-1600 for Haswell, and DDR4-2133 for Skylake (with DDR3L-1866 for the DDR3L vs. DDR4 comparison). Those are the highest officially supported frequencies for both platforms.

I understand what your trying to do With highest Supported memory. But these are K chips . Enthusaist. People buy these for O/c and highest supported memory has really nothing to do with anything , The boards support XMP Right. So it supports higher memory speeds . Latency means a lot . My really old DDR1 BH1 timing was 2-2-2-2 -2 it matters . I ran those timing while overclocking the Memory as far as I know I was the only person to run that timing above 20 O/c . I also held the record at Hw bot for awhile I had the Gold . Been 10 years I believe 6,000 points on PC mark 6 . About sandy stock speed. I sure if I went and looked I could find it . But 6,000 back than had everyone shook up . I believe A gave thaT setup to an AT member. My thinking is keep CPU stock and run the memory as hard as it will go . That what we all want to see . These are not the Stock desktop . Who do you believe makes the better memory controller Intel or AMD . If Amd runs the default higher . Than match it with the intel . These are performance chips treat them as such . You got all the other Chips to run at default. It just your result were the worst out of everyones. If I was Intel you would never get another K type chip. Me and my wifes brother inlaw talked about just that earlier. You tried to O/c the cpu Without uping the dram speed . Keeping the base memory speed one would think your O/Cing would be Better. I will buy that Chip from you if its retail . I pretty sure 4.7 be easy . What a lot of people need to do is Go back and reread Anand SB review . and the orginal IPC that was stated . This Great SB chip was laughed at back than . Read the replies from that review . Everyone had it below 10% IPC I believe AT had it the lowest . My chip was showing around 16% IPC increase . Go to your forum we had a big debate on it. Today the way people talk SB was the second coming . I had right back than everyone else missed Big time . Same with Conroe I hit the IPC Right on the numbers that to is in your forum backlog . Go read the Forum topic I was Banned for saying it would cream AMD 64 . That's after intel had already shown us bench marks 5 months earlier . I just figured Intel was telling truth and they were . Everyone else was saying intel lied. Yet ban hammer. I pretty much stay away from hardware forums now days . This release got me active. Lets see if there is an Angel canyon as my wifes brother inlaw tells me .

Turtle 1 · Aug 7, 2015

Infinisearch said:
Are you sure about this? I thought the ddr3 was at 1600mhz and the ddr4 was at 2133mhz.
edit - the DDR3 is at 1866 cl9 and the DDR4 is at 2133 cl15.
edit 2 - IIRC DDR3 latency = 4.82ns DDR4 will equal 7.03ns
edit 3 - I was corrected below haswell @ 1600 which makes sense as it's the default mem clock for it.

I would like to see the math on that if you would . I little confused here as this DDR3 is at 1866 cl9 and the DDR4 is at 2133 cl15. is exactly what others were saying and that's as I read it. Did they make a mistake an correct it . Where did you get those numbers from . Same exactly as I had it . My memory is really good and Reading comprehension is better yet . So latency was lower with DDR3 Using 1866. But if they used 1600 OK I believe that's more to it than that . I believe same topic was going on at AT today and they used a different formula. I stayed out of it as much as I could as I banned there . I deserved it to . I told Mod in PM to go screw himself and I wished him Harm LOL I was sick of my user name anyway . I steel post there . As I have 4 land lines here 1 in the house 3 in the shop 3 fiber optic 1 DSl. I will go find there math and copy paste the results here.

Turtle 1 · Aug 7, 2015

mczak said:
Latency typically means squat for the IGP (or at least way less important than bandwidth).
FWIW I've wondered about some gpu scores in the reviews - that is not just individual benchmarks but some reviews seemed to show not much of an improvement in general wheras others said it's like a 40% improvement in in just about anything...
Also I think intel only officially supporting up to ddr4-2133 is "slightly" on the low side and disappointing - that's literally the slowest ever built! Sure the server platform doesn't support more neither but it would be quite expected this platform lags behind on that aspect. Granted for the cpu part it doesn't really matter but the IGP could always benefit from it...

I just did a quick look at ATs set up here is what I found
Corsair DDR4-2133 2x8
G.Skill DDR4-2133 2x8
G.Skill DDR3-1866 4x4
*Memory Timings used were the supported frequencies of each architecture,
except DDR3L vs DDR4 testing, which used DDR3-1866 C9.
For Skylake's DDR3L requirement, this was a DDR3 kit running with an undervolt to 1.42V.
At 1.5V, the system failed to boot.

Infinisearch said:
Are you sure about this? I thought the ddr3 was at 1600mhz and the ddr4 was at 2133mhz.
edit - the DDR3 is at 1866 cl9 and the DDR4 is at 2133 cl15.
edit 2 - IIRC DDR3 latency = 4.82ns DDR4 will equal 7.03ns
edit 3 - I was corrected below haswell @ 1600 which makes sense as it's the default mem clock for it.

Originally Posted by Walter E Kurtz

Hardware Canucks uses DDR3 1866 Cas 11 and DDR4 2666 CAS 13(!). That is not even close. Hothardware and PcPer don't even post CAS latency under test system setup so god knows what they are testing. There are plenty of reviews where the memory difference between Haswell / Skylake comparison was reduced as much as possible, including here on anandtech showing the real IPC gain to be nowhere near the "usual" 10-15%
Let's see what Anandtech said about their RAM, since you allege that Anandtech reduced difference in RAM "as much as possible".
http://anandtech.com/show/9483/intel...h-generation/7
How to measure performance, according to AT:
Quote:
Normally in our DRAM reviews I refer to the performance index, which has a similar effect in gauging general performance:
DDR3-1600 C11: 1600/11 = 145.5
DDR4-2133 C15: 2133/15 = 142.2
As you have faster memory, you get a bigger number, and if you reduce the CL, we get a bigger number also. Thus for comparing memory kits, if the difference > 10, then the kit with the biggest performance index tends to win out, though for similar kits the one with the highest frequency is preferred.
Performance index=frequency/CAS, supposedly.

And now the RAM they chose:
Quote:
For these tests, both sets of numbers were run at 3.0 GHz with hyperthreading disabled. Memory speeds were DDR4-2133 C15 and DDR3-1866 C9 respectively.
DDR4: 2133/15=142.2
DDR3L: 1866/9=207.3
A difference of 65 in favor of DDR3L

Compare these to the "not even close" RAM that Hardware Canucks chose:
DDR4: 2666/13=205.1
DDR3L: 1866/11=169.6
A difference of 39 in favor of DDR4

It looks to me like Hardware Canucks' choice of RAM is actually significantly closer than Anandtech's.

Grall · Aug 7, 2015

Turtle 1 said:
Latency means a lot . My really old DDR1 BH1 timing was 2-2-2-2 -2 it matters .

Depends on how you define "a lot"; in almost all workloads, the fastest, most expensive, lowest-possible latency boutique RAM will only buy you a handful percent speed increase, because most memory accesses hit the CPU's caches and not main RAM. Most people wouldn't agree that's "a lot", and would think it a terrible waste of money (it is, btw.)

Your old DDR1 latency had only a couple clocks' latency because the memory was very slow-clocked. Don't stare yourself blind at clock cycle numbers, what really matters is actual latency (as measured in ns), not clock cycle counts.

gongo · Aug 7, 2015

Guys ...can we get back to..snooping around with Skylake dynamic clocks/fivr(lack of)...

ram talk is boring...

sebbbi · Aug 7, 2015

Kaarlisk said:
Wow.
The performance increase over Haswell is really impressive. In some applications, not all. I do wonder what they did to accomplish that?
Did Intel solve a bottleneck (memory bandwidth, for example), and if they did, then which? Or is it that the performance increase is only in ALU-limited workloads?

Some reviews mentioned backbuffer color compression for Gen9 GPU. Color compression is huge for integrated GPUs, as the bandwidth of dual channel DDR3/DDR4 is limited to 25.6 GB/s - 34.1 GB/s (tenth of a high end discrete GPU). Color compression would be an easy 30%+ improvement for purely bandwidth bound cases.

Infinisearch · Aug 8, 2015

Sorry gongo.

Turtle 1 said:
I would like to see the math on that if you would .

I was trying from memory (most probably failed) at approximating the latency to the critical first word. I didn't take into account cmdrate since I didn't remember how. In addition I just used the CAS latency since I don't remember the timing diagrams anymore. So I quickly did a 1/f to get period and multiplied by the CAS latency. If someone can point me to a good timing diagram or operational description for DDR3/4 DIMM's it would be appreciated. Thanks.

Skylake performance in some benchmarks is really impressive, I'm really impressed with Intel given they're still extracting IPC improvements from x86... I wonder how long thats going to last given the same bog standard cache hierarchy? Can't wait for the 72EU EDRAM skylake benchies, I want a cheap laptop that I can play some games on. Does anyone know offhand how much EDRAM adds to the price of an intel CPU at the same clock without EDRAM?

Paran · Aug 16, 2015

https://software.intel.com/sites/de...ure-of-Intel-Processor-Graphics-Gen9-v1d0.pdf

pixelio · Aug 17, 2015

Nice find!

The changes from Gen8 that grabbed my attention are:

Preemption is interesting.

But if the EU thread scheduling scheme in pre-Gen9 wasn't round-robin then what was it?

Deleted member 13524 · Aug 17, 2015

Isn't the GPU proportion getting bigger in each iteration?
If that's a chip with 24 EUs, then the 72 EU chip will be almost twice as big, not to mention the eDRAM in a separate chip.

Kaarlisk · Aug 17, 2015

ToTTenTranz said:
New Isn't the GPU proportion getting bigger in each iteration?

Yup.
In Haswell, a GT2 GPU is approximately 3.4 times larger than a single CPU core.
In Skylake, a GT2 GPU is approximately 5.4 times larger than a single CPU core.
However, as always, the math might be different. And it may also be impossible to make the CPU core much larger.

sebbbi · Aug 17, 2015

pixelio said:
But if the EU thread scheduling scheme in pre-Gen9 wasn't round-robin then what was it?

I don't remember that the old document described it exactly. Previously one EU had 7 HW threads (waves) ready to execute (each of these was either SIMD8/16/32 or SIMD4x2). There was two SIMD4 execution units. On a single cycle the two execution units could not both take an instruction from the same HW thread. If I understood correctly there was no other scheduling limitations. I didn't find any instruction latency chart either.

My guess would be that the new hardware has more strict scheduling limitations, allowing more efficient hardware implementation.

Kaarlisk · Aug 17, 2015

Another change: eDRAM is now a "memory-side cache", not a "victim cache". The eDRAM controller is now a part of the system agent (previously it was a separate stop on the ring).

On an LLC or EDRAM cache miss, data from DRAM will be filled first into EDRAM. (An optional mode also allows bypass to LLC.) Conversely, as cachelines are evicted from LLC, they will be written back into EDRAM. If compute kernels wish to read or write cachelines currently stored in EDRAM, they are quickly re-loaded into LLC, and read/writing then proceeds as usual.

Paran · Aug 17, 2015

ToTTenTranz said:
Isn't the GPU proportion getting bigger in each iteration?
If that's a chip with 24 EUs, then the 72 EU chip will be almost twice as big, not to mention the eDRAM in a separate chip.

35,2% for the GPU doesn't look much different to me. Assuming ~120 mm² is correct, GT2 Gen9 is only 42 mm² big.

Kaarlisk · Aug 17, 2015

Paran said:
35,2% for the GPU doesn't look much different to me.

It was 31% of the die for the GPU in Haswell. There is rather a lot of unidentified space in the Skylake die image, unlike Haswell.

Paran said:
Assuming ~120 mm² is correct, GT2 Gen9 is only 42 mm² big.

Yup.
Still, unlike the CPU cores, which shrank, the GT2 GPU has grown. Skylake GT2 is the first one that has become performant enough that it actually makes sense. Before Skylake, it was either don't care about graphics, so a GT1 will do, or go cheap discrete. Maybe except in the case of 4K desktop or something else specific.

3dilettante · Aug 17, 2015

pixelio said:
Nice find!

The changes from Gen8 that grabbed my attention are:

Preemption is interesting.

But if the EU thread scheduling scheme in pre-Gen9 wasn't round-robin then what was it?

Maybe it has to do with a flattening of the prioritization of instruction issue, which may help with quality of service and possibly preemption.
Going further back: http://www.realworldtech.com/sandy-bridge-gpu/5/

The thread scheduling within a core is primarily hardware managed. The highest priority thread with a ready instruction is sent down the pipeline and can execute for several cycles. A thread will stall if an instruction is still waiting for operands and will be switched out.

This description seems to indicate that a thread with sufficient priority is able to take successive issue cycles. This does sound more complex to manage than round-robin, and might allow a thread that didn't stall to dominate execution time, which might raise fairness issues.

sebbbi · Aug 18, 2015

3dilettante said:
This description seems to indicate that a thread with sufficient priority is able to take successive issue cycles. This does sound more complex to manage than round-robin, and might allow a thread that didn't stall to dominate execution time, which might raise fairness issues.

That is my assumption as well. Round robin is simpler and more fair. And it gives the threads more time to hide instruction latency (assuming some complex instructions need this). Downside of course is that on average round robin finishes threads slightly slower, potentially causing slightly more resource and cache contention (depending of course on data access patterns).

Paran · Aug 18, 2015

All slides available here: https://hubb.blob.core.windows.net/...AN3rsvP4v+JmU=&se=2015-08-19T17:25:06Z&sp=rwd

Kaarlisk · Aug 18, 2015

Paran said:

This is weird. I was pretty sure and I checked a couple of places, IronLake was 45nm (the GMCH was 45nm, the CPU 32nm).

Also new in Skylake: EU simplified to “scalar” mode.
Is that about those different SIMD widths or something else?

moozoo · Aug 19, 2015

I'm just sad that besides not implementing fp64 in opencl they are also capping the DP flops below or similar values to the CPU core flops.
i.e. 1/4 ratio means that 1152 Gflops fp32 -> 288 Gflops fp64
Note fp64 is available in DirectX compute shaders, C++ Amp and OpenGL computer shaders . It's not a hardware issue.

Intel Gen9 Skylake

Turtle 1

Turtle 1

Turtle 1

Grall

Invisible Member

gongo

sebbbi

Infinisearch

Paran

pixelio

Deleted member 13524

Guest

Kaarlisk

sebbbi

Kaarlisk

Paran

Kaarlisk

3dilettante

sebbbi

Paran

Kaarlisk

moozoo

Similar threads