Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 22-Jun-2012, 00:31   #1001
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,291
Send a message via ICQ to OpenGL guy
Default

Quote:
Originally Posted by A1xLLcqAgt0qc2RyMz0y View Post
Nowhere in your linked article does it state 1 TFLOPS DP.

Also when ECC is enabled performance drops.
ECC does not affect compute performance on Tahiti.
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 22-Jun-2012, 02:55   #1002
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,079
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by A1xLLcqAgt0qc2RyMz0y View Post
Nowhere in your linked article does it state 1 TFLOPS DP.

Also when ECC is enabled performance drops.
Not even the memory performance?
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 22-Jun-2012, 06:24   #1003
rapso
Senior Member
 
Join Date: May 2008
Posts: 80
Default

Quote:
Originally Posted by A1xLLcqAgt0qc2RyMz0y View Post
Nowhere in your linked article does it state 1 TFLOPS DP.
.
the presentation in the background states it's 1TFlop DP.

@rpg.314
ECC should just increase the latency slightly afaik, but that's what GPUs suppose to hide, so it could be slower, but it shouldn't be visible in normal cases.

but AMD just released some 1TFlop+ GPU also:
http://www.anandtech.com/show/6025/r...-up-to-gtx-680

I still think the Phi is nothing to be disappointed bout, I'd love to have some x86 cpus with that power and nice instruction set.
rapso is offline   Reply With Quote
Old 22-Jun-2012, 07:38   #1004
RecessionCone
Member
 
Join Date: Feb 2010
Posts: 173
Default

Quote:
Originally Posted by rapso
ECC should just increase the latency slightly afaik, but that's what GPUs suppose to hide, so it could be slower, but it shouldn't be visible in normal cases.
GPUs hide latency by using bandwidth. ECC, at least on Nvidia GPUs, reduces bandwidth, so it also reduces the ability to hide latency. This causes visible performance impact on GPUs.
RecessionCone is offline   Reply With Quote
Old 22-Jun-2012, 10:34   #1005
Gipsel
Senior Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 1,017
Default

Quote:
Originally Posted by A1xLLcqAgt0qc2RyMz0y View Post
Nowhere in your linked article does it state 1 TFLOPS DP.
Look at the slide seen on the wall!
Quote:
Originally Posted by A1xLLcqAgt0qc2RyMz0y View Post
Also when ECC is enabled performance drops.
Memory performance drops, peak throughput does not.

Edit: There was already a new page.
Gipsel is offline   Reply With Quote
Old 22-Jun-2012, 11:17   #1006
UniversalTruth
Senior Member
 
Join Date: Sep 2010
Posts: 1,055
Default

Guys, one question if you don't mind.

So, what is the purpose of this thingie? Only supercomputers, right?

We don't expect Intel can compete with AMD and NV, regarding drivers, DirectX support, etc. My point being is that they will never offer a gaming card based on this.
UniversalTruth is offline   Reply With Quote
Old 22-Jun-2012, 12:11   #1007
rapso
Senior Member
 
Join Date: May 2008
Posts: 80
Default

Quote:
Originally Posted by UniversalTruth View Post
Guys, one question if you don't mind.

So, what is the purpose of this thingie? Only supercomputers, right?
I guess that's it, would explain why it's called "Xeon" (nothing like a gaming or multimedia device).
I wouldn't be surprised if they ripped out all texture units out of the MIC, not even opencl might work properly (I mean, technically yes, but usually texture sampling as optimization would slow it down,as it would be emulated).
if history repeat, it will end like the Itanium.

Quote:
We don't expect Intel can compete with AMD and NV, regarding drivers, DirectX support, etc. My point being is that they will never offer a gaming card based on this.
no DX for sure. but who knows, maybe there will be some MIC device for consumer, kind of like there was a CELL on (my lovely) winfast pxvc1000.
like I said the page before, I hope skylake will have all the LRB juice in it. 512bit SIMD on 4 consumer cores with ~4GHz might end up with 1TFlop SP.
rapso is offline   Reply With Quote
Old 25-Jun-2012, 01:08   #1008
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,079
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by rapso View Post
like I said the page before, I hope skylake will have all the LRB juice in it. 512bit SIMD on 4 consumer cores with ~4GHz might end up with 1TFlop SP.
Putting wide SIMD in a 3GHz, quad issue OoO core will defeat the entire purpose of SIMD. You want a simple in order core to go with the vector units.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 25-Jun-2012, 09:24   #1009
rapso
Senior Member
 
Join Date: May 2008
Posts: 80
Default

Quote:
Originally Posted by rpg.314 View Post
Putting wide SIMD in a 3GHz, quad issue OoO core will defeat the entire purpose of SIMD. You want a simple in order core to go with the vector units.
4x float or 8x float SIMD is ok, but a "wide" 16x float SIMD is defeating it's purpose? I cannot really come up with any idea why you might think that, can you elaborate?
rapso is offline   Reply With Quote
Old 25-Jun-2012, 18:37   #1010
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by rapso View Post
8(core) * 3.9GigaCycles *2op/cycle * 4(Double SIMD) and I end up with 249GFlops.
if you assume intel will double the execution units for FMA, then we'll end up with ~500GFlops for doubles. am I missing something?
Indeed Haswell is believed to have two FMA units per core. Current architectures have two floating-point execution ports; one MUL and one ADD. Any configuration with just one FMA unit would lead to lower performance for legacy code, or port contention for new code, also resulting in lower throughput. So it should approach 500 GFLOPS DP for an 8-core.

But I was blatantly wrong to think that Knights Corner's performance was for SP. Intel previously announced Larrabee to reach 1 TFLOP as well. That was for SP, hence the confusion.
Nick is offline   Reply With Quote
Old 25-Jun-2012, 19:09   #1011
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by rpg.314 View Post
Putting wide SIMD in a 3GHz, quad issue OoO core will defeat the entire purpose of SIMD. You want a simple in order core to go with the vector units.
As far as I know the "purpose of SIMD" and out-of-order execution are completely orthogonal. We have GPUs with single-issue SIMD, dual-issue SIMD, VLIW SIMD, and we have CPUs with in-order or out-of-order SIMD. And the choices seem unrelated to the width of the vectors. It would actually make sense for wider vectors to be paired with more dynamic scheduling since it would increase efficiency at a lower relative cost.
Nick is offline   Reply With Quote
Old 25-Jun-2012, 20:11   #1012
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,328
Send a message via ICQ to MfA
Default

Quote:
Originally Posted by rapso View Post
4x float or 8x float SIMD is ok, but a "wide" 16x float SIMD is defeating it's purpose?
If you are running a highly data parallel application the superscalar hardware will just be idling ... if you are running scalar code the SIMD will just be idling.

With a heterogenous solution, say Ivy Bridge and MIC, you can have a mix of SIMD code and scalar code without half your hardware idling ...
__________________
Cinematic is the new streamlined.
MfA is offline   Reply With Quote
Old 26-Jun-2012, 00:08   #1013
rapso
Senior Member
 
Join Date: May 2008
Posts: 80
Default

Quote:
Originally Posted by MfA View Post
If you are running a highly data parallel application the superscalar hardware will just be idling ... if you are running scalar code the SIMD will just be idling.
why would the superscalar HW be idle with SIMD code? the throughput should be as high as with scalar code, but on the other side, you'll have way more bandwidth demand from the caches and main memory, chances are higher that data is not available and the order of data reads by the memory controller is not strictly related to the request order.
It's still a x86/CISC instruction set, your SIMD code does not work on registers only, you address operants straight from memory. I think there is way more the ooo units will have to work on. Especially with a lot of cores that share/trash the same memory controller.
rapso is offline   Reply With Quote
Old 26-Jun-2012, 00:09   #1014
rapso
Senior Member
 
Join Date: May 2008
Posts: 80
Default

(I've split my reply into two, for better quoting, and this one could be kind of off-topic?)
Quote:
With a heterogenous solution, say Ivy Bridge and MIC, you can have a mix of SIMD code and scalar code without half your hardware idling ...
It's actually well know that it's the other way around, it's nearly impossible to have a producer and a consumer and keep both equally busy on heterogenous solutions (usually the producer ends up idle).

I sadly cannot dig out any MIC benchmark, allow me to project this to some heterogenous/homogenous OpenCl benchmark (showing Sandra 2012):
http://www.tomshardware.com/reviews/...0k,3181-6.html

you see
1. ("homogenous") Sandy bridge has 75|165 MPix/s of compute power.
2. (heterogenous) running on the IvyBridge GPU has 10|251 MPix/s of compute power, taking 75% of space on die of the CPU (regarding: http://www.chip-architect.com/news/2...es_Sandys.html )

if you'd use the same space to add 3 cores (75%, although mal balanced), you'd end up with estimated 131|288 MPix/s
while I agree that this is not a fair compare, as the GPU has some minor space also used for fixed function HW, at the same time I argue it's not a nicely vectorized and optimized code for cpu SIMD.it's also a compare of pure compute power that does not reflect what you might gain with smarter algorithms that would be a win for CPU but bite the GPU.

below you also see the luxmark, which is more of a real world test. The CPU versions seem to be faster, If OpenCL would run on GPU and CPU at the same time, it would lead to the best results in that case, but there is no advantage. you could rather build a homogenous system, running with natively optimized binaries and it would probably be even more of a win.

Yes, I know about the 7970 (or Kepler) , with 10x the memory bandwidth and 250W TDP instead of those ~40W TDP those 4 Ivy Bridge cores take, and Sandra runs 2.5|8 x times faster ( http://www.tomshardware.com/reviews/...k,3193-12.html ), but it doesn't look like more efficient, it's rather linear scaling with the extended power/bandwidth limits AND there is always the CPU running and producing, that could do the work if it had the same (high) limits.

I also use heterogenous systems rather than homogenous, but simply 'cause I can get a 7950, overclock it to 1.2Ghz and pay ~300euro, while I would have just a 6core 3930k for ~500euro. Buying a homogenous system with enough power is irrational from the price point of view. If I had the free choice to get a 7970 or 250W of haswell cores (250W/40W*4Cores->25Cores -> ~1.5DP/3.0SP TFlop/s and 350GB/s mem), I would choose the 2nd one.
Sadly the Xeon Phi seems to be also out of question because of the probably high price tag.
rapso is offline   Reply With Quote
Old 26-Jun-2012, 15:36   #1015
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,636
Default

Wow, you think a Haswell core will use 10W at 3.75GHz? I really doubt that.. 3.7GHz base frequency Ivy Bridge Xeon is 87W TDP, that is of course with GPU fused off. I expect the number to be closer to 20W per core than 10W.

You might think that because i7-3770K (3.5GHz base speed) is rated for 77W TDP that the cores only use 40W because the IGP can use such a big chunk of that. Probably in practice the IGP can't use anywhere close to its full thermal budget when all four cores are running full-tilt. Sure, the other integrated stuff uses some power, but at least some of that has to scale with core count (L3, not just capacity but complexity, memory channels to get your huge bandwidth, etc). Of course it's kind of moot since I doubt Haswell would scale to 25 cores anyway.

And part of that big price tag is justified because these chips would be huge. Tahiti and Kepler are around 2x larger than IB, can you imagine how big your hypothetical 25 core Haswell would be? I doubt it could even be manufactured..
Exophase is offline   Reply With Quote
Old 26-Jun-2012, 18:22   #1016
whitetiger
Junior Member
 
Join Date: Feb 2012
Posts: 57
Default

Quote:
Originally Posted by UniversalTruth View Post
Guys, one question if you don't mind.

So, what is the purpose of this thingie? Only supercomputers, right?

We don't expect Intel can compete with AMD and NV, regarding drivers, DirectX support, etc. My point being is that they will never offer a gaming card based on this.
The purpose of Larrabee is to deny profits to Intel's competitors - primarily NV, but also AMD
- there's really not a big enough market for these things for the amount of effort Intel are putting in
- the potential profits are tiny compared to Intels CPU Profits
- whereas NV is piggy-backing it on the GPU designs, and for NV this area could represent a reasonable profit opportunity.
- but for Intel, really, Haswell, or extended Haswell could do just as well

So, the whole purpose is to stop NV getting a foothold at the top-end
- which would allow them more profits, with-which to hire more, better engineers...
whitetiger is offline   Reply With Quote
Old 26-Jun-2012, 19:19   #1017
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,636
Default

I don't think it can be denied that Intel originally wanted Larrabee to be a real high end graphics card, and who knows what they had their sights set on with this.. perhaps even consoles.

Designing it specifically for the HPC market may not be a sensible investment, but salvaging what they can from an already established design surely is. Even if there's additional design cost in migrating it to 22nm and scaling it up a bit.
Exophase is offline   Reply With Quote
Old 27-Jun-2012, 00:23   #1018
ninelven
PM
 
Join Date: Dec 2002
Posts: 1,381
Default

Quote:
can you imagine how big your hypothetical 25 core Haswell would be? I doubt it could even be manufactured.
Well, you could fit ~10 Ivy Bridge cores in 160mm^2 without the L3 and GPU. So without the L3 and GPU it could probably be manufactured. Whether it should be on the other hand...
__________________
//
ninelven is offline   Reply With Quote
Old 27-Jun-2012, 00:42   #1019
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,636
Default

Quote:
Originally Posted by ninelven View Post
Well, you could fit ~10 Ivy Bridge cores in 160mm^2 without the L3 and GPU. So without the L3 and GPU it could probably be manufactured. Whether it should be on the other hand...
Have you ever noticed how even the really low end Celerons still have L3 cache? That's because the L3 cache not just a performance optimization but actually plays a central role in the function of the processor and you can't omit it. The L3 is what separates the cores from the memory interface, alleviating the difficult problem of having to interface all of those separately (of course, the L3 itself needs to have a very wide ring bus interface made up of several slices). It also plays the central role in coherency (and will play the central role in transactional memory). And since the L3 is inclusive it needs to be at least as large as the L1 + L2, or in other words 256KB + 32KB + 32KB per core. In practice they probably work with larger granularities than this, pushing the minimum to at least 512KB per core. I don't think I know of a processor with slices smaller than 1MB the minimum imposed by the architecture may be higher.

The processor rapso described would also need much more space for the very wide memory controllers.
Exophase is offline   Reply With Quote
Old 27-Jun-2012, 03:11   #1020
ninelven
PM
 
Join Date: Dec 2002
Posts: 1,381
Default

Quote:
That's because the L3 cache not just a performance optimization but actually plays a central role in the function of the processor and you can't omit it.
Sure you can. Whether it is wise thing to do is another matter, but you can certainly make a multicore processor without L3.

Quote:
The processor rapso described would also need much more space for the very wide memory controllers.
I included the controllers in my estimation.
__________________
//
ninelven is offline   Reply With Quote
Old 27-Jun-2012, 03:18   #1021
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,636
Default

Quote:
Originally Posted by ninelven View Post
Sure you can. Whether it is wise thing to do is another matter, but you can certainly make a multicore processor without L3.
Of course you can make a multicore processor without L3. But you can't make an Ivy Bridge without L3. The point is that its L3 performs a vital function and if you remove it you have to replace it with something else. Who knows how much it'd change the design. It makes it really hard to guess areas with.

Quote:
Originally Posted by ninelven View Post
I included the controllers in my estimation.
Can you break that down for me?
Exophase is offline   Reply With Quote
Old 27-Jun-2012, 05:09   #1022
ninelven
PM
 
Join Date: Dec 2002
Posts: 1,381
Icon Rolleyes

Believe what you want.
__________________
//

Last edited by ninelven; 27-Jun-2012 at 05:31. Reason: Just don't care...
ninelven is offline   Reply With Quote
Old 27-Jun-2012, 07:55   #1023
aaronspink
Senior Member
 
Join Date: Jun 2003
Posts: 2,571
Default

Quote:
Originally Posted by ninelven View Post
Well, you could fit ~10 Ivy Bridge cores in 160mm^2 without the L3 and GPU. So without the L3 and GPU it could probably be manufactured. Whether it should be on the other hand...

so what you are saying is that you want 10 cores that have no viable way of talking to the outside world...
__________________
Aaron Spink
speaking for myself inc.
aaronspink is offline   Reply With Quote
Old 27-Jun-2012, 07:58   #1024
aaronspink
Senior Member
 
Join Date: Jun 2003
Posts: 2,571
Default

Quote:
Originally Posted by ninelven View Post
Believe what you want.
It isn't a matter of believing what someone wants. You cannot throw out the area for the L3 and not include it as the area actually covers the interconnect functionality between the cores and cores to the rest of the system.
__________________
Aaron Spink
speaking for myself inc.
aaronspink is offline   Reply With Quote
Old 27-Jun-2012, 10:13   #1025
Blazkowicz
Senior Member
 
Join Date: Dec 2004
Location: Toulouse
Posts: 4,223
Default

even the celeron includes L3, 2MB for two cores (tip : the current celeron is incredibly fast, it's about like a core2duo E8500)

an updated Bulldozer will work without L3, as in Trinity.
but it's a fat design anyway and you now end up with big L2, which *bridge and similar don't have
these days an ivy bridge core has half the L2 of an Atom core.
Blazkowicz is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 02:39.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.