Intel Gen9 Skylake

3dilettante · Aug 19, 2015

Page 21 has some figures for preemption between the generations, which looks like it could be the switching delay for task preemption between a thread group and mid-thread preemption.
The before time periods are reminiscent of some of the early evaluations for PS4 GPU audio compute by Sony's audio engineer, with App 2 being in the same range. I would imagine he'd have been happier if the numbers given back then were closer to Gen 9's usecs versus the tend of msecs that Gen8 and the PS4 GPU (loaded latency) provided at the time.

Deleted member 13524 · Aug 19, 2015

So Skylake's GT4e is already in the same ballpark as the 8th generation of consoles, with a fraction of the TDP.

That was fast.

sebbbi · Aug 19, 2015

ToTTenTranz said:
So Skylake's GT4e is already in the same ballpark as the 8th generation of consoles, with a fraction of the TDP.

Yes, the new 72 EU GPU is getting close to consoles in performance. If there still are developers that do not validate and optimize their games for Intel GPUs, now is the time to start doing it. Intel is definitely now a competitor in the laptop gaming market

GT4e requires an external EDRAM die, so it is more expensive to manufacture than the single chip console solutions. But Intel CPUs have huge (L3 and L4) shared caches and much faster CPU than the consoles... And TSX. AVX-512 is the best SIMD instruction set I have seen for ages. It would be nice to have all these goodies on consoles

Deleted member 13524 · Aug 19, 2015

sebbbi said:
GT4e requires an external EDRAM die, so it is more expensive to manufacture than the single chip console solutions.

The SoC+eDRAM chips are probably more expensive (though they're a lot smaller but the cost for using 14nmn is probably higher?).

But the whole system may actually be cheaper to make? For starters, its peak performance is achieved using a much narrower system memory bus. 128bit 2133MHz DDR4 on Skylake vs. 256bit 2133MHz DDR3 on the XBone. This means a less complex PCB, less memory chips and less power consumption.
And then the fact that it consumes less power and generates less heat means they can save on the power regulation and the cooling system.

Of course, this is highly speculative and it'll always be. As they stand right now, Intel would never subject their chips to the low-margin deals that a console requires.

Newguy · Aug 19, 2015

It's probably even higher considering they used a 1GHz clock as the base, don't Intel usually use roughly 1100MHz-1300MHz clocks for their higher end igpus? Very impressive performance I'm sure.

Kaarlisk · Aug 19, 2015

Newguy said:
It's probably even higher considering they used a 1GHz clock as the base, don't Intel usually use roughly 1100MHz-1300MHz clocks for their higher end igpus?

Depends on whether it will be power limited. If there is a 95W TDP GT4e, it might run at the full 1300.

ToTTenTranz said:
But the whole system may actually be cheaper to make?

Yup. The premium for GT4e (instead of GT2) may be as low as $50-$100. That may actually make it competitive with quad+discrete. Before Broadwell-C, the Haswell-R wasn't socketed and the systems it was in were too expensive.

3dilettante · Aug 19, 2015

sebbbi said:
GT4e requires an external EDRAM die, so it is more expensive to manufacture than the single chip console solutions. But Intel CPUs have huge (L3 and L4) shared caches and much faster CPU than the consoles... And TSX. AVX-512 is the best SIMD instruction set I have seen for ages. It would be nice to have all these goodies on consoles

This came up in the 16-core thread, but it seems like Intel has bifurcated its core line, with one path being client and the other being server. There is a slide in one of the latest IDF presentations to the effect that cores are being specialized between the two markets.
Is there evidence that the client and not the server Xeon line will get AVX-512? So far it has only come up for Xeon and Xeon Phi. Perhaps it's AVX 512/2?

sebbbi · Aug 19, 2015

3dilettante said:
Is there evidence that the client and not the server Xeon line will get AVX-512? So far it has only come up for Xeon and Xeon Phi. Perhaps it's AVX 512/2?

The fastest existing i7 consumer Skylake only supports AXV2 (http://ark.intel.com/products/88195/Intel-Core-i7-6700K-Processor-8M-Cache-up-to-4_20-GHz). So I assume there will not be AVX-512 in any i7 models. I believe that workstation Xeons (4-8 cores) have AVX-512 (in addition to big server chips). But there is no official info yet.

Andrew Lauritzen · Aug 19, 2015

moozoo said:
I'm just sad that besides not implementing fp64 in opencl they are also capping the DP flops below or similar values to the CPU core flops.

IMHO it still has far too much DP FLOPs and other modern GPUs tend to agree. I'd easily trade another factor or two in DP FLOPs for more SP/3D throughput. Frankly I see this the other way around - if you really need heavy DP (and I still argue in most cases you don't, or only for a small part of the computation) then there are many suitable targets, including as you note CPUs which are quite competent. Xeon/Xeon Phi if you want to go nuts

iMacmatician · Aug 19, 2015

3dilettante said:
There is a slide in one of the latest IDF presentations to the effect that cores are being specialized between the two markets.

Which slide is this?

Paran · Aug 19, 2015

Power Optimization in Intel® Graphics Technology, Gen9
https://hubb.blob.core.windows.net/...t1gUR9aDypS1c=&se=2015-08-20T19:10:07Z&sp=rwd

Kaarlisk · Aug 19, 2015

Andrew Lauritzen said:
IMHO it still has far too much DP FLOPs and other modern GPUs tend to agree.

If it is just a GPU, absolutely.
However, one might also position this as an alternative to AVX-512:
On server CPUs, it is more efficient to implement high-throughput units within the cores, as they do not need an iGPU
On client CPUs, as they already have the GPU, it is cheaper to add DP capability to the GPU instead of implementing AVX-512
So the issue of DP not being exposed through OpenCL may be even more important, as it limits developers when experimenting and figuring out whether they have a use for the GPU for pure compute applications.

iMacmatician said:
Which slide is this?

Slide 10

AlexV · Aug 19, 2015

Kaarlisk said:
If it is just a GPU, absolutely.
However, one might also position this as an alternative to AVX-512:
On server CPUs, it is more efficient to implement high-throughput units within the cores, as they do not need an iGPU
On client CPUs, as they already have the GPU, it is cheaper to add DP capability to the GPU instead of implementing AVX-512
So the issue of DP not being exposed through OpenCL may be even more important, as it limits developers when experimenting and figuring out whether they have a use for the GPU for pure compute applications.

Slide 10

GPUs are many things, but drop-in replacements for established CPU SIMD/vector bits they are not. Also I share Andrew's opinion that people frequently misuse doubles i.e. "it makes the bug go away, clearly I need MOAR PRECISION". What would be beneficial is a better understanding and mastery of numerical analysis, as opposed to more double throughput to be thrown at the problem.

Kaarlisk · Aug 19, 2015

AlexV said:
GPUs are many things, but drop-in replacements for established CPU SIMD/vector bits they are not.

Still, isn't there at least some overlap in possible applications?

AlexV said:
Also I share Andrew's opinion that people frequently misuse doubles i.e. "it makes the bug go away, clearly I need MOAR PRECISION".

I certainly did not mean to dispute this.

entity279 · Aug 20, 2015

1/4 DP rate is a very decent rate for a consumer GPU. AFAIR, 1/4 should mean no investment into DP and no crippling either. 1/2 would have needed some more hardware to be added by Intel.

Intel, I and i expect the great majority of programers would rather see some invetsment in extra hardware at the CPU vector extension side, as they are far more widely applicable.

sebbbi · Aug 20, 2015

entity279 said:
1/4 DP rate is a very decent rate for a consumer GPU. AFAIR, 1/4 should mean no investment into DP and no crippling either.

1/4 DP rate is perfect. I don't understand the need of 1/2 rate DP. A good programmer analyses his data and formulas and knows where the extra precision is important and where it doesn't give you any improvements. Even if you need DP, an optimized program should be a mix of 32 bit float, 64 bit float and 16/32 bit integer. In consumer applications (such as games and image processing), you get very good results by mixing 32 bit floats and 16 bit floats (and some 16/32 bit integers of course).

In games (and image processing applications) we analyze our data carefully and frequently use formats such as 8/16 bit normalized (fixed point) and 10/11/16 bit float to optimize our memory bandwidth. Double rate 16 bit float processing (and halved GPR cost of 16 bit registers) is much more important than DP for us. Intel's fp16 implementation is now top notch. This combined with 2x improved integer processing rate (in Broadwell) and 1/4 rate DP (which is better compared to other consumer GPUs) make their GPUs very good for multiple purposes, both consumer workloads and workloads that need DP. For brute force pure DP workloads you should consider a Tesla (1/3 DP rate) or a Xeon Phi.

pjbliverpool · Aug 20, 2015

Does anyone know how many ROPs and texture units are in GT2?

Ryan Smith · Aug 20, 2015

8 pixels per clock per slice (with 1 slice for GT2).

pjbliverpool · Aug 20, 2015

Thanks, that's more than I was expecting, so potentially 24 ROP's for GT4e. Accounting for the higher clock speed that's in PS4 fill rate territory!

Andrew Lauritzen · Aug 20, 2015

Yes, and alpha blend of most formats is also full rate on SKL (ex. 8ppc/slice for 32bpp). Texturing is 12 bilinear/clk/slice (like Broadwell).

More slides from this morning (I have no idea why this link is so wacky and I hope it works):

https://hubb.blob.core.windows.net/e5888822-986f-45f5-b1d7-08f96e618a7b-published/54f4f27e-62d8-4b7b-8364-fa8f110b1664/GVCS004 - SF15_GVCS004_100f.pdf?sv=2014-02-14&sr=c&sig=Cv7l/gyeEHCeyeBY+26YNU+bhh2HgcazoGBTkobMU10=&se=2015-08-21T18:15:08Z&sp=rwd

Intel Gen9 Skylake

3dilettante

Deleted member 13524

Guest

sebbbi

Deleted member 13524

Guest

Newguy

Kaarlisk

3dilettante

sebbbi

Andrew Lauritzen

Moderator

iMacmatician

Paran

Kaarlisk

AlexV

Heteroscedasticitate

Kaarlisk

entity279

sebbbi

pjbliverpool

B3D Scallywag

Ryan Smith

pjbliverpool

B3D Scallywag

Andrew Lauritzen

Moderator

Similar threads