22 nm Larrabee

I think that those new Atom could be a good building block for Intel. I've just answered Nick in another thread, and I think that may be Intel could sort of blend Haswell double FMA unit those Atom.
They would widen the data path on the atom core to 256bit to match the SIMD width (8 wide), and have 2 FMA units. If I get it right that is 16 DP FLOPS per cycle. They would introduce the same support for the gather instruction as in LBNx ISA (so better than what Haswell does).
[…]
I expect Intel to level out the playing field between its different product before possibly going wider the 256bit.
I hadn't thought of 2x 256-bit per core in a future Xeon Phi, and the last sentence seems like a good reason to do so (the same theoretical throughput as the current Xeon Phi but more similar to Haswell). 2x 512-bit per core for Knights Landing, on second thoughts, may give too high a FP performance compared to the 3+ TFLOPS projection, at least if it uses Atom cores.
 
On the Intel Xeon Processor road map plan for hpc it lists Haswell Xeons as 500Gflop DP peak.
My i7 4770 gets 90Gflops single precision and 45 Glfops double using FlopsCL and Intel OpenCL.

So when they are talking 500Gflops for Xeons are they talking 10 core, quad socket machines? 1 x $500 HD 7970 blows that away.
 
On the Intel Xeon Processor road map plan for hpc it lists Haswell Xeons as 500Gflop DP peak.
My i7 4770 gets 90Gflops single precision and 45 Glfops double using FlopsCL and Intel OpenCL.

So when they are talking 500Gflops for Xeons are they talking 10 core, quad socket machines? 1 x $500 HD 7970 blows that away.

Not comparable at all.

4770K flops are measured, which is lower than peak, and based on your numbers, it isn't even taking advantage of AVX2.

3.5GHz x 4 cores x FMA x 8 DP Flops/cycle = 224GFlops DP

There will be significant number of applications that will probably favor the 4770K in DP.
 
So PR for the win, you could the max trhoughput on one side disregarding TDP and perfs per watts that have nothing to do with the power cost of really reaching 3TFLOPS. Though not really a scheme as both are not lies without further details, and out of the 2 the more valuable data is perf per watts on real workload (even if you don't get close to the max throughput of the chip... if I speak clearly enough...

Sounds reasonable, except if they want to reach the "Petaflop system with 40MW in 2018" goal it really needs to have that kind of efficiency by then. The same can probably be said of Nvidia that's expecting similar DP Flops/watt in that timeframe.

They said to make such a system it needs to achieve 10x the DP Flop/watt effiency that current accelerators have, which is roughly 50 DP GFlops/watt.

I think so because with their 14GFLOPS per watts and 3TFLOPS I get ~220Watts for the chips and that sounds unrealistic for 80 cores running on "all cylinders" @2.4GHz when +60 cores at 1GHz already burn 300 even taking in account the impact on GDDR5 if they look at the whole platform.

Actually there are parts that achieve more than that. Xeon Phi 5110P with 1.011TFlops @ 225W, and Xeon Phi 7120X/7120P with 1.208TFlops @ 300W. And while 7000 parts have 1.33GHz Turbo, the rated Flops are at the Base 1.23GHz.

I expect them to go the "more cores versus frequency" route, so imagine a 120+ core with frequency more or less similar to what we have today. Remember they'll be at 14nm with Knights Corner, meaning die space freedom. I think the 15W ULT Haswell parts using 40EUs(aka die size) for lower power strategy is one they'll be pursuing here too. :)
 
Last edited by a moderator:
Not comparable at all.
4770K flops are measured, which is lower than peak, and based on your numbers, it isn't even taking advantage of AVX2.

"Intel OpenCL* SDK 2013 introduces performance improvements that include full code generation on the Intel Advanced Vector Extensions (Intel AVX and Intel AVX2). The Implicit CPU Vectorization Module generates the Intel AVX and Intel AVX2 code depending on what the target platform supports."
OpenCL has a fma instruction.
So I see no reason an opencl kernel couldn't exploit avx2 fma

Can you point me at a benchmark that I can run that will come up with this magical 224GFlops DP number?
 
Intel's MKL Linpack benchmark should be able to get ~177 GFLOPS from a Haswell 4770:

http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download

I get 68-83 Gflops DP for linpack xeon64.
An OpenCL program called MaxFlopsCl gives ~67Gflops for Dp multiplication.
Its not working for fma. I will investigate.
With FMA I guess that figure would be doubled.. i.e ~140Gflop DP

In any case GPU DP flops >> CPU

I believe Kaveri is suppose to be ~1Tflop single. If they built one with a 1/4 ratio (they won't) they could get 256Gflop DP.

---------------------
Intel(R) Optimized LINPACK Benchmark data

Current date/time: Sun Jul 07 12:58:18 2013

CPU frequency: 3.898 GHz
Number of CPUs: 1
Number of cores: 4
Number of threads: 8
.......
Performance Summary (GFlops)

Size LDA Align. Average Maximal
1000 1000 4 66.9419 70.5405
2000 2000 4 73.4502 74.0641
3000 3000 4 76.9867 77.1281
4000 4000 4 78.7903 79.2638
5000 5000 4 79.4511 79.8972
10000 10000 4 82.1355 83.2612
15000 15000 4 78.4566 79.3601
20000 20000 4 78.0148 78.2111
25000 25000 4 77.3551 77.4088
30000 30000 4 79.2122 79.2122
35000 35000 4 78.8542 78.8542
40000 40000 4 78.5863 78.5863

--------------------------
 
I don't have it right now to test, but I think disabling Hyperthrading may improve LINPACK performance.
 
I expect them to go the "more cores versus frequency" route, so imagine a 120+ core with frequency more or less similar to what we have today.

I expect them to go for stronger cores, probably based on Silvermont cores from the new Atom.


Developing micro-architecture specification for next generation high performance computing MIC architecture (Xeon Phi(TM)). Specific focus area will be either x86 Atom Core or Network Fabric.
http://jobs.intel.com/job/Hillsboro-MicroarchitectVerification-Engineer-Job-OR-97006/2592907/
 
I don't have it right now to test, but I think disabling Hyperthrading may improve LINPACK performance.

No, it doesn't. The benchmark should use thread affinity properly to avoid co-scheduling two threads on the same core. Even if it doesn't most OS schedulers do this job quite well.
 
No, it doesn't. The benchmark should use thread affinity properly to avoid co-scheduling two threads on the same core. Even if it doesn't most OS schedulers do this job quite well.
In a hyperthreaded environment, you have more available threads than cores. Thus, if you schedule one thread per hardware thread, you will maximize hardware thread utilization. However, enabling hyperthreading reduces the amount of resources available per thread, thus, in some cases, you might get better results with hyperthreading disabled and scheduling fewer threads.

Thread affinity does nothing to fix the issue that hyperthreading reduces the amount of resources available per thread.
 
I don't have access to a Haswell system right now (I'm at home now), so I tested with my home rig (a Sandy Bridge Core i7 2600).

One problem on my system with the default testing script in the latest version is the affinity setting. It's set to compact, but that's wrong. It should set to scatter. Setting to compact would restrict all threads to two physical cores instead of all four cores. I don't know the reason why the script is wrong, though.

Restricting to 4 threads and allowing 8 threads do have very small difference (93.6 GFLOPS vs 92.2 GFLOPS, ~86% to the peak). Testing again with task manager running, I think that's because Intel's LINPACK now automatically restricts the number of threads when Hyperthreading is detected.

For those with Haswell to test, I suggest modifying the script (runme_xeon64.bat) and change:

Code:
set KMP_AFFINITY=nowarnings,compact,granularity=fine
to
set KMP_AFFINITY=nowarnings,scatter,granularity=fine

About the worse efficiency of Haswell, I think it's natural that a FMA+FMA system is going to have worse efficiency to the peak performance compared to a MUL+ADD system.
 
On top of it if I compare the old atom with 2 way SMT and the new one, it seems that OoOE is to have a greater impact on performances than 2 way SMT. With rendering outside of the picture my bet is that out of order execution as implemented in the new Atom is all they need now.
Intel claims that the overhead for OoO is no greater than the one for 2 way SMT so I would assume lesser than for four way SMT..

Silvermont changes more than just adding OoOE, which is btw for the integer pipelines only and not the FP/SIMD or memory pipelines. I expect that it's precisely because it's for the ALUs only that the OoOE die impact is so small, but that's no good at all for something like Xeon Phi. You can't use Saltwell vs Silvermont performance as an evaluation of performance of OoO vs SMT.

Saltwell's SMT is pretty different from what's in Xeon Phi anyway, not to mention that the typical workloads are very different. If you're in a situation where you can require that there are at least two threads loaded on the thread in order to get good performance then hiding latency and filling execution unit occupancy with simple SMT is going to be a much better choice than doing so with OoOE. Xeon Phi falls under that category, unless Intel really does want to do a unified core like Nick thinks.
 
Silvermont changes more than just adding OoOE, which is btw for the integer pipelines only and not the FP/SIMD or memory pipelines. I expect that it's precisely because it's for the ALUs only that the OoOE die impact is so small, but that's no good at all for something like Xeon Phi. You can't use Saltwell vs Silvermont performance as an evaluation of performance of OoO vs SMT.

Saltwell's SMT is pretty different from what's in Xeon Phi anyway, not to mention that the typical workloads are very different. If you're in a situation where you can require that there are at least two threads loaded on the thread in order to get good performance then hiding latency and filling execution unit occupancy with simple SMT is going to be a much better choice than doing so with OoOE. Xeon Phi falls under that category, unless Intel really does want to do a unified core like Nick thinks.
OK, I think it is going to be interesting to compare how Jaguar and Silvermont compares in that regard.
--------------------------------------
EDIT
By the way what is your take on Intel performances goals? It is quite a jump in both throughput and power efficiency.
To some extend it has me to wonder about what David C stated, how about beefy iGPU, the first Intel iGPU to support DP calculation.
---------
EDIT 2 One thing I find bothering with all this talk about Intel actually widening its SIMD is that actually its iGPU are actually highly threaded machines but the executions units /SIMD are pretty narrow (I would think 8 wide).
 
Last edited by a moderator:
In a hyperthreaded environment, you have more available threads than cores. Thus, if you schedule one thread per hardware thread, you will maximize hardware thread utilization.

A decent LAPACK/LINPACK implementation for hyper-threaded processors only spawns as many threads as the number of cores, and sets thread affinity to make sure that each thread runs on a separate core.
 
A decent LAPACK/LINPACK implementation for hyper-threaded processors only spawns as many threads as the number of cores, and sets thread affinity to make sure that each thread runs on a separate core.

I think the point is that for some highly optimized such as LINPACK, Hyperthreading is not useful so you have to try to not use it.
 
I also wonder if Intel is planning on meeting the 6GFlops/watt figure for Knights Corner. Maybe a refresh next year with 225W, 1.3TFlop part?
From CPU World: "Intel plans to expand Xeon Phi 3100 and 5100 series in 2014."

Gennadiy Shvets (CPU World) said:
New 31xx cards will come with 12 GB of RAM, or twice as much of on-board RAM as current Xeon Phi 3120A and 3120P coprocessors. The rest of the specifications will not change.
[…]
Future Xeon Phi 51xx SKUs will double the size of RAM to 16 GB. Intel plans to keep the rest of the specs unchanged from 5110P and 5120D products.
They can't do the same thing with the 71xx series since they already have 16 GB, but I'm still thinking there might be a clock speed bump for those next year, possibly with all 62 cores enabled, to reach higher performance and perf/W ( ≥ 5 DP GF/W?).

Also, I have a question. Since Knights Landing is going to use DDR4, what sorts of memory bus widths and memory capacities can we expect to see on it?
 
Back
Top