Intel Gen9 Skylake

Page 21 has some figures for preemption between the generations, which looks like it could be the switching delay for task preemption between a thread group and mid-thread preemption.
The before time periods are reminiscent of some of the early evaluations for PS4 GPU audio compute by Sony's audio engineer, with App 2 being in the same range. I would imagine he'd have been happier if the numbers given back then were closer to Gen 9's usecs versus the tend of msecs that Gen8 and the PS4 GPU (loaded latency) provided at the time.
 
So Skylake's GT4e is already in the same ballpark as the 8th generation of consoles, with a fraction of the TDP.

That was fast.
 
So Skylake's GT4e is already in the same ballpark as the 8th generation of consoles, with a fraction of the TDP.
Yes, the new 72 EU GPU is getting close to consoles in performance. If there still are developers that do not validate and optimize their games for Intel GPUs, now is the time to start doing it. Intel is definitely now a competitor in the laptop gaming market :)

GT4e requires an external EDRAM die, so it is more expensive to manufacture than the single chip console solutions. But Intel CPUs have huge (L3 and L4) shared caches and much faster CPU than the consoles... And TSX. AVX-512 is the best SIMD instruction set I have seen for ages. It would be nice to have all these goodies on consoles :)
 
GT4e requires an external EDRAM die, so it is more expensive to manufacture than the single chip console solutions.

The SoC+eDRAM chips are probably more expensive (though they're a lot smaller but the cost for using 14nmn is probably higher?).

But the whole system may actually be cheaper to make? For starters, its peak performance is achieved using a much narrower system memory bus. 128bit 2133MHz DDR4 on Skylake vs. 256bit 2133MHz DDR3 on the XBone. This means a less complex PCB, less memory chips and less power consumption.
And then the fact that it consumes less power and generates less heat means they can save on the power regulation and the cooling system.

Of course, this is highly speculative and it'll always be. As they stand right now, Intel would never subject their chips to the low-margin deals that a console requires.
 
It's probably even higher considering they used a 1GHz clock as the base, don't Intel usually use roughly 1100MHz-1300MHz clocks for their higher end igpus? Very impressive performance I'm sure.
 
It's probably even higher considering they used a 1GHz clock as the base, don't Intel usually use roughly 1100MHz-1300MHz clocks for their higher end igpus?
Depends on whether it will be power limited. If there is a 95W TDP GT4e, it might run at the full 1300.

But the whole system may actually be cheaper to make?
Yup. The premium for GT4e (instead of GT2) may be as low as $50-$100. That may actually make it competitive with quad+discrete. Before Broadwell-C, the Haswell-R wasn't socketed and the systems it was in were too expensive.
 
GT4e requires an external EDRAM die, so it is more expensive to manufacture than the single chip console solutions. But Intel CPUs have huge (L3 and L4) shared caches and much faster CPU than the consoles... And TSX. AVX-512 is the best SIMD instruction set I have seen for ages. It would be nice to have all these goodies on consoles :)
This came up in the 16-core thread, but it seems like Intel has bifurcated its core line, with one path being client and the other being server. There is a slide in one of the latest IDF presentations to the effect that cores are being specialized between the two markets.
Is there evidence that the client and not the server Xeon line will get AVX-512? So far it has only come up for Xeon and Xeon Phi. Perhaps it's AVX 512/2?
 
Is there evidence that the client and not the server Xeon line will get AVX-512? So far it has only come up for Xeon and Xeon Phi. Perhaps it's AVX 512/2?
The fastest existing i7 consumer Skylake only supports AXV2 (http://ark.intel.com/products/88195/Intel-Core-i7-6700K-Processor-8M-Cache-up-to-4_20-GHz). So I assume there will not be AVX-512 in any i7 models. I believe that workstation Xeons (4-8 cores) have AVX-512 (in addition to big server chips). But there is no official info yet.
 
I'm just sad that besides not implementing fp64 in opencl they are also capping the DP flops below or similar values to the CPU core flops.
IMHO it still has far too much DP FLOPs and other modern GPUs tend to agree. I'd easily trade another factor or two in DP FLOPs for more SP/3D throughput. Frankly I see this the other way around - if you really need heavy DP (and I still argue in most cases you don't, or only for a small part of the computation) then there are many suitable targets, including as you note CPUs which are quite competent. Xeon/Xeon Phi if you want to go nuts :)
 
IMHO it still has far too much DP FLOPs and other modern GPUs tend to agree.
If it is just a GPU, absolutely.
However, one might also position this as an alternative to AVX-512:
On server CPUs, it is more efficient to implement high-throughput units within the cores, as they do not need an iGPU
On client CPUs, as they already have the GPU, it is cheaper to add DP capability to the GPU instead of implementing AVX-512
So the issue of DP not being exposed through OpenCL may be even more important, as it limits developers when experimenting and figuring out whether they have a use for the GPU for pure compute applications.

Which slide is this?
Slide 10
 
If it is just a GPU, absolutely.
However, one might also position this as an alternative to AVX-512:
On server CPUs, it is more efficient to implement high-throughput units within the cores, as they do not need an iGPU
On client CPUs, as they already have the GPU, it is cheaper to add DP capability to the GPU instead of implementing AVX-512
So the issue of DP not being exposed through OpenCL may be even more important, as it limits developers when experimenting and figuring out whether they have a use for the GPU for pure compute applications.


Slide 10
GPUs are many things, but drop-in replacements for established CPU SIMD/vector bits they are not. Also I share Andrew's opinion that people frequently misuse doubles i.e. "it makes the bug go away, clearly I need MOAR PRECISION". What would be beneficial is a better understanding and mastery of numerical analysis, as opposed to more double throughput to be thrown at the problem.
 
GPUs are many things, but drop-in replacements for established CPU SIMD/vector bits they are not.
Still, isn't there at least some overlap in possible applications?

Also I share Andrew's opinion that people frequently misuse doubles i.e. "it makes the bug go away, clearly I need MOAR PRECISION".
I certainly did not mean to dispute this.
 
1/4 DP rate is a very decent rate for a consumer GPU. AFAIR, 1/4 should mean no investment into DP and no crippling either. 1/2 would have needed some more hardware to be added by Intel.

Intel, I and i expect the great majority of programers would rather see some invetsment in extra hardware at the CPU vector extension side, as they are far more widely applicable.
 
1/4 DP rate is a very decent rate for a consumer GPU. AFAIR, 1/4 should mean no investment into DP and no crippling either.
1/4 DP rate is perfect. I don't understand the need of 1/2 rate DP. A good programmer analyses his data and formulas and knows where the extra precision is important and where it doesn't give you any improvements. Even if you need DP, an optimized program should be a mix of 32 bit float, 64 bit float and 16/32 bit integer. In consumer applications (such as games and image processing), you get very good results by mixing 32 bit floats and 16 bit floats (and some 16/32 bit integers of course).

In games (and image processing applications) we analyze our data carefully and frequently use formats such as 8/16 bit normalized (fixed point) and 10/11/16 bit float to optimize our memory bandwidth. Double rate 16 bit float processing (and halved GPR cost of 16 bit registers) is much more important than DP for us. Intel's fp16 implementation is now top notch. This combined with 2x improved integer processing rate (in Broadwell) and 1/4 rate DP (which is better compared to other consumer GPUs) make their GPUs very good for multiple purposes, both consumer workloads and workloads that need DP. For brute force pure DP workloads you should consider a Tesla (1/3 DP rate) or a Xeon Phi.
 
Thanks, that's more than I was expecting, so potentially 24 ROP's for GT4e. Accounting for the higher clock speed that's in PS4 fill rate territory!
 
Back
Top