22 nm Larrabee

It appears to be..not
What is calculated there?
The system reaches ~119 TFlop/s in the Benchmark and has a theoretical pak of 181 TFlop/s and consists of 140 nodes, each with 2 Xeon E5-2670 (8 cores, 2.6GHz) and one MIC card (where 54 cores were active, btw.).
The CPUs have a peak of ~47 TFlop/s, which leaves 134 TFlop/s for 140 MIC cards. That's just short of a single TFlop/s. But final version of the hardware will have 62 cores enabled, probably lifting the theoretical peak above 1 TFlop/s. During the HPL run the MIC cards probably provided 700 to 800 GFlop each.

The system's overall power efficiency was 119 TFlop/s / 101 kW = ~1.18 GFlops/W.
 
What is calculated there?
The system reaches ~119 TFlop/s in the Benchmark and has a theoretical pak of 181 TFlop/s and consists of 140 nodes, each with 2 Xeon E5-2670 (8 cores, 2.6GHz) and one MIC card (where 54 cores were active, btw.).
The CPUs have a peak of ~47 TFlop/s, which leaves 134 TFlop/s for 140 MIC cards. That's just short of a single TFlop/s. But final version of the hardware will have 62 cores enabled, probably lifting the theoretical peak above 1 TFlop/s. During the HPL run the MIC cards probably provided 700 to 800 GFlop each.

The system's overall power efficiency was 119 TFlop/s / 101 kW = ~1.18 GFlops/W.

Eh...I don't know where he got the reference about the cores count in Discovery...

It seems to be based on this calculation Xeon Phi has a...rather lower frequency?
 
Last edited by a moderator:
Eh...I don't know where he got the reference about the cores count in Discovery...
His calculation doesn't make any sense at all. He divided the theoretical peak by Rmax to figure out the peak performance per MIC card! I mean, WTH?!? He doesn't appear to have a clue at all.
It seems to be based on this calculation Xeon Phi has a...rather lower frequency?
What do you mean by "low frequency"? It appears to be ~1.1 GHz in the pre production versions. What did you expect, 3 GHz?
The final versions will use reportedly 62 cores instead of the current 54 (+15%) and may or may not get a small speed bump too. The final peak performance per card could therefore reach ~1.2 TFlop/s (if intel is not significantly sandbagging with the cards in that cluster). That's not too bad and shows that the 22nm manufacturing can overcome some design choices to a certain degree.
 
Could this be a "complete" Larrabee with the fixed-function units for 3D rendering still intact, or a completely new chip?
 
I'm disappointed, but not surprised.

1 TFLOP isn't very high. An 8-core Haswell chip at 3.9 GHz should have the same peak throughput, but should achieve higher performance in practice thanks to out-of-order execution and more cache space per thread (thus higher cache hit rates). And then there's also TSX to synchronize between cores more efficiently and with less headaches for developers.

Hence the biggest threat to Intel's MICs, is Intel's CPUs. Haswell-EP is even supposed to sport 16 cores.
 
It's possible that without a truly new core design, Larrabee can't implement all the features needed for it to be as power-efficient at a process node so far below what the P54 targeted.

It's also possible that, at least in terms of peak FLOPS, that if a design targets FP throughput above all else and probably requires 200-250W of power, that maybe the numbers chosen for the 8-core Haswell to match it are a touch optimistic.
 
It's possible that without a truly new core design, Larrabee can't implement all the features needed for it to be as power-efficient at a process node so far below what the P54 targeted.
Rumor is, it's more an Atom like core. Nevertheless, I really doubt the scalar portion of each core is contributing significantly to the peak power consumption.
I would see the culprit more in the large amount of high speed data transfers necessary for the design to work. Such things like the shuffle unit between the register file and the vector ALU isn't going to help power efficiency neither (as the data needs to pass through it each time and makes the physical distances larger). Nvidia's shuffle instruction appears to be a nice compromise fullfilling the same purpose in a much more efficient manner.
That's what I meant above, when I said that the 22nm process is obviously able to overcome such design decisions to a certain degree. But it's hard for intel to back off from some choices (as it would probably disrupt the instruction set).
 
I thought the limited instruction set support indicated the scalar instruction capability predated Atom, but the labelling of the tables indicates that this may only apply to 64-bit mode, for some reason.
 
It's also possible that, at least in terms of peak FLOPS, that if a design targets FP throughput above all else and probably requires 200-250W of power, that maybe the numbers chosen for the 8-core Haswell to match it are a touch optimistic.
The i7-3770K has a max turbo frequency of 3.9 GHz and is rated at 77 Watt TDP, which includes the GPU. So it seems quite realistic to me that some number of Haswell cores at a certain frequency can keep up with Knights Corner in peak FLOPS, while consuming a comparable amount of Watts.

And as we all know, peak performance / Watt is useless. Only the actual performance / Watt is relevant, and out-of-order execution and TSX could be a big plus.

Yes, out-of-order execution consumes power, but think of all the RAM accesses that can be avoided by having to run fewer threads and thus improving cache locality and hit rate. Also, the power consumption could be reduced with AVX-1024, whereby 1024-bit instructions are executed in four cycles on the 256-bit units, which then lowers the required front-end throughput and can reduce switching activity in the schedulers, approaching the efficiency of in-order execution.

And while that converges the CPU closer to the MIC, only option for the MIC is to also converge closer to the CPU. Any new feature will lower the computing density, in an attempt to improve the effective performance.
 
The i7-3770K has a max turbo frequency of 3.9 GHz and is rated at 77 Watt TDP, which includes the GPU.
I wouldn't dispute that at least a subset of an 8-core chip could reach that high a turbo, but that's not a guaranteed sustained clock and Haswell is from its description going to have a decent amount of additional hardware.
The max CPU turbo can be reduced if load and thermal conditions require it.
We're comparing a chip with a max sustained clock and the hoped-for-but-no-promises max clock of another.


Yes, out-of-order execution consumes power, but think of all the RAM accesses that can be avoided by having to run fewer threads and thus improving cache locality and hit rate.
Pushing anything to ~4 GHz is going to burn power.
Also, we need better comparisons with Larrabee's large shared L2, which can soak up a lot RAM accesses.

Also, the power consumption could be reduced with AVX-1024, whereby 1024-bit instructions are executed in four cycles on the 256-bit units, which then lowers the required front-end throughput and can reduce switching activity in the schedulers, approaching the efficiency of in-order execution.
That's great for whatever future chip past Haswell that decides to try this.
 
Last edited by a moderator:
Rumor is, it's more an Atom like core.

FWIW, they said that single thread performance is far away from the Core variants, but will close somewhat when they start to use Atom cores. Isn't that enough of saying that it's not an Atom-like core?
 
FWIW, they said that single thread performance is far away from the Core variants, but will close somewhat when they start to use Atom cores. Isn't that enough of saying that it's not an Atom-like core?
Probably the jump to conclude an Atom like core was a bit too far. So maybe it will be true for the next version then.
 
1 TFLOP isn't very high. An 8-core Haswell chip at 3.9 GHz should have the same peak throughput,
8(core) * 3.9GigaCycles *2op/cycle * 4(Double SIMD) and I end up with 249GFlops.
if you assume intel will double the execution units for FMA, then we'll end up with ~500GFlops for doubles. am I missing something?

If I did not count wrong, then I hardly see any reason to be disappointed, it's the fastest processor for double calculations (7970 is ~900GFlops and GK110 ~500GFlops ?)

I also don't think in-order is of disadvantage on MIC, Forsyth said in his presentation about larrabee that you can hide all latency with 3 threads running, while you have 4. and if one is waiting for memory, resources are freed for the remaining threads. I think, in combination with more registers, both CPUs will be ending up with the similar utilization.
(smaller cache is probably compensated by faster memory glued on the MIC board).
Haswell will have probably some percentage more, but MIC offers more brute force power.


I also rather believe in one 16Core @ 2GHz haswell, than an 8core @ ~4GHz. But that Xeon EX version will probably be released way later than the desktop Haswells :(
 
The i7-3770K has a max turbo frequency of 3.9 GHz and is rated at 77 Watt TDP, which includes the GPU. So it seems quite realistic to me that some number of Haswell cores at a certain frequency can keep up with Knights Corner in peak FLOPS, while consuming a comparable amount of Watts.

And as we all know, peak performance / Watt is useless. Only the actual performance / Watt is relevant, and out-of-order execution and TSX could be a big plus.

Yes, out-of-order execution consumes power, but think of all the RAM accesses that can be avoided by having to run fewer threads and thus improving cache locality and hit rate. Also, the power consumption could be reduced with AVX-1024, whereby 1024-bit instructions are executed in four cycles on the 256-bit units, which then lowers the required front-end throughput and can reduce switching activity in the schedulers, approaching the efficiency of in-order execution.

And while that converges the CPU closer to the MIC, only option for the MIC is to also converge closer to the CPU. Any new feature will lower the computing density, in an attempt to improve the effective performance.
isn't that the whole point of using VEX on the intel core series, to adapt the larrabee/MIC vector units? it slowly progresses in that direction
sandy bridge 256bit VEX float instructions
ivy bridge 16bit float load
haswell 256 int + gather/scatter
skylake/skymont will probably have 512bit and by then cover the whole LRB/MIC instructions (beside maybe "lane masking", but maybe it's the skymont feature?)


regarding the 77W on ivy bridge, I have some hunch.
I think it clearly shows that with that heterogenous CPU/GPU die, you end up wasting a lot of die and compute power. and there are rumors arising, the skylake GPU might use LRB/MIC cores as ALUs. wouldn't it be most power efficient to have two front ends (x86 and rasterizer/vertex assembly) and share the same compute units? even if those would be somehow less efficient than dedicated GPU units, you could utilize most of the die to do the work, no matter if some 3dmark that keeps the cpu idle or cinebench that keeps the gpu frontend idle.

Is my idea weird or am I capt obvious?
 
8(core) * 3.9GigaCycles *2op/cycle * 4(Double SIMD) and I end up with 249GFlops.
if you assume intel will double the execution units for FMA, then we'll end up with ~500GFlops for doubles. am I missing something?

Yea, you are right. He's thinking of SP, while Knight's Corner's throughput is in DP. And I doubt we'll see anything higher than 3.5GHz for 8 core Haswell, nevermind rumored 12/16 core variants.
 
Back
Top