22 nm Larrabee

OpenGL guy · Jun 22, 2012

A1xLLcqAgt0qc2RyMz0y said:
Nowhere in your linked article does it state 1 TFLOPS DP.

Also when ECC is enabled performance drops.

ECC does not affect compute performance on Tahiti.

rpg.314 · Jun 22, 2012

A1xLLcqAgt0qc2RyMz0y said:
Nowhere in your linked article does it state 1 TFLOPS DP.

Also when ECC is enabled performance drops.

Not even the memory performance?

rapso · Jun 22, 2012

A1xLLcqAgt0qc2RyMz0y said:
Nowhere in your linked article does it state 1 TFLOPS DP.
.

the presentation in the background states it's 1TFlop DP.

@rpg.314
ECC should just increase the latency slightly afaik, but that's what GPUs suppose to hide, so it could be slower, but it shouldn't be visible in normal cases.

but AMD just released some 1TFlop+ GPU also:
http://www.anandtech.com/show/6025/radeon-hd-7970-ghz-edition-review-catching-up-to-gtx-680

I still think the Phi is nothing to be disappointed bout, I'd love to have some x86 cpus with that power and nice instruction set.

RecessionCone · Jun 22, 2012

rapso said:
ECC should just increase the latency slightly afaik, but that's what GPUs suppose to hide, so it could be slower, but it shouldn't be visible in normal cases.

GPUs hide latency by using bandwidth. ECC, at least on Nvidia GPUs, reduces bandwidth, so it also reduces the ability to hide latency. This causes visible performance impact on GPUs.

Gipsel · Jun 22, 2012

A1xLLcqAgt0qc2RyMz0y said:
Nowhere in your linked article does it state 1 TFLOPS DP.

Look at the slide seen on the wall!

A1xLLcqAgt0qc2RyMz0y said:
Also when ECC is enabled performance drops.

Memory performance drops, peak throughput does not.

Edit: There was already a new page.

UniversalTruth · Jun 22, 2012

Guys, one question if you don't mind.

So, what is the purpose of this thingie? Only supercomputers, right?

We don't expect Intel can compete with AMD and NV, regarding drivers, DirectX support, etc. My point being is that they will never offer a gaming card based on this. :???:

rapso · Jun 22, 2012

UniversalTruth said:
Guys, one question if you don't mind.

So, what is the purpose of this thingie? Only supercomputers, right?

I guess that's it, would explain why it's called "Xeon" (nothing like a gaming or multimedia device).
I wouldn't be surprised if they ripped out all texture units out of the MIC, not even opencl might work properly (I mean, technically yes, but usually texture sampling as optimization would slow it down,as it would be emulated).
if history repeat, it will end like the Itanium.

We don't expect Intel can compete with AMD and NV, regarding drivers, DirectX support, etc. My point being is that they will never offer a gaming card based on this.

no DX for sure. but who knows, maybe there will be some MIC device for consumer, kind of like there was a CELL on (my lovely) winfast pxvc1000.
like I said the page before, I hope skylake will have all the LRB juice in it. 512bit SIMD on 4 consumer cores with ~4GHz might end up with 1TFlop SP.

rpg.314 · Jun 25, 2012

rapso said:
like I said the page before, I hope skylake will have all the LRB juice in it. 512bit SIMD on 4 consumer cores with ~4GHz might end up with 1TFlop SP.

Putting wide SIMD in a 3GHz, quad issue OoO core will defeat the entire purpose of SIMD. You want a simple in order core to go with the vector units.

rapso · Jun 25, 2012

rpg.314 said:
Putting wide SIMD in a 3GHz, quad issue OoO core will defeat the entire purpose of SIMD. You want a simple in order core to go with the vector units.

4x float or 8x float SIMD is ok, but a "wide" 16x float SIMD is defeating it's purpose? I cannot really come up with any idea why you might think that, can you elaborate?

Nick · Jun 25, 2012

rapso said:
8(core) * 3.9GigaCycles *2op/cycle * 4(Double SIMD) and I end up with 249GFlops.
if you assume intel will double the execution units for FMA, then we'll end up with ~500GFlops for doubles. am I missing something?

Indeed Haswell is believed to have two FMA units per core. Current architectures have two floating-point execution ports; one MUL and one ADD. Any configuration with just one FMA unit would lead to lower performance for legacy code, or port contention for new code, also resulting in lower throughput. So it should approach 500 GFLOPS DP for an 8-core.

But I was blatantly wrong to think that Knights Corner's performance was for SP. Intel previously announced Larrabee to reach 1 TFLOP as well. That was for SP, hence the confusion.

Nick · Jun 25, 2012

rpg.314 said:
Putting wide SIMD in a 3GHz, quad issue OoO core will defeat the entire purpose of SIMD. You want a simple in order core to go with the vector units.

As far as I know the "purpose of SIMD" and out-of-order execution are completely orthogonal. We have GPUs with single-issue SIMD, dual-issue SIMD, VLIW SIMD, and we have CPUs with in-order or out-of-order SIMD. And the choices seem unrelated to the width of the vectors. It would actually make sense for wider vectors to be paired with more dynamic scheduling since it would increase efficiency at a lower relative cost.

MfA · Jun 25, 2012

rapso said:
4x float or 8x float SIMD is ok, but a "wide" 16x float SIMD is defeating it's purpose?

If you are running a highly data parallel application the superscalar hardware will just be idling ... if you are running scalar code the SIMD will just be idling.

With a heterogenous solution, say Ivy Bridge and MIC, you can have a mix of SIMD code and scalar code without half your hardware idling ...

rapso · Jun 26, 2012

MfA said:
If you are running a highly data parallel application the superscalar hardware will just be idling ... if you are running scalar code the SIMD will just be idling.

why would the superscalar HW be idle with SIMD code? the throughput should be as high as with scalar code, but on the other side, you'll have way more bandwidth demand from the caches and main memory, chances are higher that data is not available and the order of data reads by the memory controller is not strictly related to the request order.
It's still a x86/CISC instruction set, your SIMD code does not work on registers only, you address operants straight from memory. I think there is way more the ooo units will have to work on. Especially with a lot of cores that share/trash the same memory controller.

rapso · Jun 26, 2012

(I've split my reply into two, for better quoting, and this one could be kind of off-topic?)

With a heterogenous solution, say Ivy Bridge and MIC, you can have a mix of SIMD code and scalar code without half your hardware idling ...

It's actually well know that it's the other way around, it's nearly impossible to have a producer and a consumer and keep both equally busy on heterogenous solutions (usually the producer ends up idle).

I sadly cannot dig out any MIC benchmark, allow me to project this to some heterogenous/homogenous OpenCl benchmark (showing Sandra 2012):
http://www.tomshardware.com/reviews/ivy-bridge-benchmark-core-i7-3770k,3181-6.html

you see
1. ("homogenous") Sandy bridge has 75|165 MPix/s of compute power.
2. (heterogenous) running on the IvyBridge GPU has 10|251 MPix/s of compute power, taking 75% of space on die of the CPU (regarding: http://www.chip-architect.com/news/2012_04_19_Ivy_Bridges_GPU_2-25_times_Sandys.html )

if you'd use the same space to add 3 cores (75%, although mal balanced), you'd end up with estimated 131|288 MPix/s
while I agree that this is not a fair compare, as the GPU has some minor space also used for fixed function HW, at the same time I argue it's not a nicely vectorized and optimized code for cpu SIMD.it's also a compare of pure compute power that does not reflect what you might gain with smarter algorithms that would be a win for CPU but bite the GPU.

below you also see the luxmark, which is more of a real world test. The CPU versions seem to be faster, If OpenCL would run on GPU and CPU at the same time, it would lead to the best results in that case, but there is no advantage. you could rather build a homogenous system, running with natively optimized binaries and it would probably be even more of a win.

Yes, I know about the 7970 (or Kepler)

, with 10x the memory bandwidth and 250W TDP instead of those ~40W TDP those 4 Ivy Bridge cores take, and Sandra runs 2.5|8 x times faster ( http://www.tomshardware.com/reviews/geforce-gtx-690-benchmark,3193-12.html ), but it doesn't look like more efficient, it's rather linear scaling with the extended power/bandwidth limits AND there is always the CPU running and producing, that could do the work if it had the same (high) limits.

I also use heterogenous systems rather than homogenous, but simply 'cause I can get a 7950, overclock it to 1.2Ghz and pay ~300euro, while I would have just a 6core 3930k for ~500euro. Buying a homogenous system with enough power is irrational from the price point of view. If I had the free choice to get a 7970 or 250W of haswell cores (250W/40W*4Cores->25Cores -> ~1.5DP/3.0SP TFlop/s and 350GB/s mem), I would choose the 2nd one.
Sadly the Xeon Phi seems to be also out of question because of the probably high price tag.

Exophase · Jun 26, 2012

Wow, you think a Haswell core will use 10W at 3.75GHz? I really doubt that.. 3.7GHz base frequency Ivy Bridge Xeon is 87W TDP, that is of course with GPU fused off. I expect the number to be closer to 20W per core than 10W.

You might think that because i7-3770K (3.5GHz base speed) is rated for 77W TDP that the cores only use 40W because the IGP can use such a big chunk of that. Probably in practice the IGP can't use anywhere close to its full thermal budget when all four cores are running full-tilt. Sure, the other integrated stuff uses some power, but at least some of that has to scale with core count (L3, not just capacity but complexity, memory channels to get your huge bandwidth, etc). Of course it's kind of moot since I doubt Haswell would scale to 25 cores anyway.

And part of that big price tag is justified because these chips would be huge. Tahiti and Kepler are around 2x larger than IB, can you imagine how big your hypothetical 25 core Haswell would be? I doubt it could even be manufactured..

whitetiger · Jun 26, 2012

UniversalTruth said:
Guys, one question if you don't mind.

So, what is the purpose of this thingie? Only supercomputers, right?

We don't expect Intel can compete with AMD and NV, regarding drivers, DirectX support, etc. My point being is that they will never offer a gaming card based on this.

The purpose of Larrabee is to deny profits to Intel's competitors - primarily NV, but also AMD
- there's really not a big enough market for these things for the amount of effort Intel are putting in
- the potential profits are tiny compared to Intels CPU Profits
- whereas NV is piggy-backing it on the GPU designs, and for NV this area could represent a reasonable profit opportunity.
- but for Intel, really, Haswell, or extended Haswell could do just as well

So, the whole purpose is to stop NV getting a foothold at the top-end
- which would allow them more profits, with-which to hire more, better engineers...

Exophase · Jun 26, 2012

I don't think it can be denied that Intel originally wanted Larrabee to be a real high end graphics card, and who knows what they had their sights set on with this.. perhaps even consoles.

Designing it specifically for the HPC market may not be a sensible investment, but salvaging what they can from an already established design surely is. Even if there's additional design cost in migrating it to 22nm and scaling it up a bit.

ninelven · Jun 27, 2012

can you imagine how big your hypothetical 25 core Haswell would be? I doubt it could even be manufactured.

Well, you could fit ~10 Ivy Bridge cores in 160mm^2 without the L3 and GPU. So without the L3 and GPU it could probably be manufactured. Whether it should be on the other hand...

Exophase · Jun 27, 2012

ninelven said:
Well, you could fit ~10 Ivy Bridge cores in 160mm^2 without the L3 and GPU. So without the L3 and GPU it could probably be manufactured. Whether it should be on the other hand...

Have you ever noticed how even the really low end Celerons still have L3 cache? That's because the L3 cache not just a performance optimization but actually plays a central role in the function of the processor and you can't omit it. The L3 is what separates the cores from the memory interface, alleviating the difficult problem of having to interface all of those separately (of course, the L3 itself needs to have a very wide ring bus interface made up of several slices). It also plays the central role in coherency (and will play the central role in transactional memory). And since the L3 is inclusive it needs to be at least as large as the L1 + L2, or in other words 256KB + 32KB + 32KB per core. In practice they probably work with larger granularities than this, pushing the minimum to at least 512KB per core. I don't think I know of a processor with slices smaller than 1MB the minimum imposed by the architecture may be higher.

The processor rapso described would also need much more space for the very wide memory controllers.

ninelven · Jun 27, 2012

That's because the L3 cache not just a performance optimization but actually plays a central role in the function of the processor and you can't omit it.

Sure you can. Whether it is wise thing to do is another matter, but you can certainly make a multicore processor without L3.

The processor rapso described would also need much more space for the very wide memory controllers.

I included the controllers in my estimation.

22 nm Larrabee

OpenGL guy

rpg.314

rapso

RecessionCone

Gipsel

UniversalTruth

rapso

rpg.314

rapso

Nick

Nick

MfA

rapso

rapso

Exophase

whitetiger

Exophase

ninelven

PM

Exophase

ninelven

PM

Similar threads