Intel Skylake Platform

You could just knock down the screen resolution. A heavy-duty GPU running even the most advanced games software at max settings is unlikely to become bottlenecked at say, 720P for example...

In that case game are ran at 1080p with no AA, with an absolutely ridiculously powerful GPU (What do you mean, 300GB/s and 250 watts? Where's my pair of Voodoo2?)

That is somewhat realistic and is interesting in that some people might actually want to run the games that way (use a 1080p 120Hz or 144hz display)
 
This thread is about Skylake, not your misperception of the usefulness of a thousand cores.

This (sub)thread is about high performance desktop processors, and our desire to get more performance. CPU frequency has hit a wall of 4 Ghz, for nearly a decade, so improvement there is not to be expected. IPC increase also has hit a wall.

So what remains ?
Wider SIMD and more cores. Both of these require an effort in software design, but that is a doable exercise if top performance is required.
Wider SIMD should have been AVX-512, it's in Skylake, but conveniently disabled and shifted to the Xeon range.
More cores, yes of course!
If we don't get better SIMD, we expect at least something significant for the top dollar charged.
 
Last edited:
Wider SIMD should have been AVX-512, it's in Skylake, but conveniently disabled and shifted to the Xeon range.
I expect that AVX-512 is so large that it makes economical sense to have a separate design and set of masks without it at all. But that's guessing only :)
 
I expect that AVX-512 is so large that it makes economical sense to have a separate design and set of masks without it at all.
Really? How big could it possibly be - AMD's half-rate FP execution units aren't very large individually (after all they fit 1500+ on a single chip), and a quad-core CPU would only have four AVX units...

The floorplans I've seen of past CPU generations show that most of the CPU core is either OoOE-related junk, or cache... Not actual execution hardware. But, what the hell do I know! Maybe I'm wrong, and AVX512 is gigantic. :D
 
Really? How big could it possibly be - AMD's half-rate FP execution units aren't very large individually (after all they fit 1500+ on a single chip), and a quad-core CPU would only have four AVX units...

The floorplans I've seen of past CPU generations show that most of the CPU core is either OoOE-related junk, or cache... Not actual execution hardware. But, what the hell do I know! Maybe I'm wrong, and AVX512 is gigantic. :D
The impact is not only on functional units, but all over the place: you have to widen the registers (Haswell has 144 AVX registers), all the paths to register file, and so on. So frequency might be impacted too. Again I might be wrong, it's a guess ;)

A counter-example is that KNL is supposed to have ~60 AVX-512 units on a chip, but it's possible the impact is less due to less aggressive (or none at all?) OoOE for SIMD stuff. And anyway that will most likely be a gigantic chip with reduced yields and outrageous price :D
 
Good point about KNL, it's even more extreme, as it has 72 cores with each of them 2 x AVX-512 for a grand total of 144 AVX-512 units. If AVX-512 shows up in the low end 4 core Xeons, then that's a good indication it's also in the consumer parts.
 
Good point about KNL, it's even more extreme, as it has 72 cores with each of them 2 x AVX-512 for a grand total of 144 AVX-512 units. If AVX-512 shows up in the low end 4 core Xeons, then that's a good indication it's also in the consumer parts.

What am I missing here? I thought an AVX-512 unit would be capable of 64 SP flops per cycle so 144 of them would push over 9 TFLOPS even at just 1Ghz but all the articles are quoting about 6 TFLOPs.
 
What am I missing here? I thought an AVX-512 unit would be capable of 64 SP flops per cycle so 144 of them would push over 9 TFLOPS even at just 1Ghz but all the articles are quoting about 6 TFLOPs.
How do you get to 64 SP FLOPS/cycle/unit? I get 32: 16 SP / vector x 2 for FMA.

So two units would be 64 SP FLOPS/cycle and 6 TFLOPS would mean a frequency of about 1.3 GHz. Does that make sense?
 
How do you get to 64 SP FLOPS/cycle/unit? I get 32: 16 SP / vector x 2 for FMA.

So two units would be 64 SP FLOPS/cycle and 6 TFLOPS would mean a frequency of about 1.3 GHz. Does that make sense?

I understand AVX2 (256bit) can handle 32 flops per cycle:

AVX2-640x152.png

http://www.extremetech.com/computing/136219-intels-haswell-is-an-unprecedented-threat-to-nvidia-amd

And Intel claimed 2x Haswell flops/cycle from AVX-512:

https://software.intel.com/en-us/blogs/2013/07/10/avx-512-instructions

Intel said:
The evolution to Intel AVX-512 contributes to our goal to grow peak FLOP/sec by 8X over 4 generations: 2X with AVX1.0 with the Sandy Bridge architecture over the prior SSE4.2, extended by Ivy Bridge architecture with 16-bit float and random number support, 2X with AVX2.0 and its fused multiply-add (FMA) in the Haswell architecture and then 2X more with Intel AVX-512.

Obviously the 1.3Ghz clock rate with only 64 FLOPs/cycle that you quote above makes more sense (and must be correct from the rumours I've seen) but I'm not seeing where I'm going wrong :(
 
When in doubt, trust in maths:
  • AVX-512 is 512-bit wide, and SP FLOPs are computed on 32-bit operands;
  • 512/32 = 16 operations/cycle, and by convention, one FMA counts as two FLOPs, since it contains a multiplication and an addition;
  • So you get 32 FLOPs/cycle.
 
When in doubt, trust in maths:
  • AVX-512 is 512-bit wide, and SP FLOPs are computed on 32-bit operands;
  • 512/32 = 16 operations/cycle, and by convention, one FMA counts as two FLOPs, since it contains a multiplication and an addition;
  • So you get 32 FLOPs/cycle.
You need to multiply by two because there are two FMA ports (so four flop/lane starting from Haswell). In Sandy and Ivy Intel had one ADD and one MUL port, so two flop/lane. AMD Jaguar also can co-issue ADD+MUL per cycle (8 flop/cycle total per core).
 
You need to multiply by two because there are two FMA ports (so four flop/lane starting from Haswell). In Sandy and Ivy Intel had one ADD and one MUL port, so two flop/lane. AMD Jaguar also can co-issue ADD+MUL per cycle (8 flop/cycle total per core).

For the whole core, yes, but I was talking about a single AVX-512 unit.
 
Based purely on the table presented by Anand, the $1k Ivy Bridge-E with a 257mm^2 die would end up being more expensive per-mm^2 than the $300 SkyLake at 122mm^2..

Yes sure, but the current Skylakes are mainstream 4 core desktop processors.
The 6+ cores -E and Xeons are in a different category with there own, indeed, soon to be discovered records.
 
Back
Top