Nehalem - 16 Thread x86 Monster!

pjbliverpool

B3D Scallywag
Legend
I havn't been hearing much love for Nehalem here recently. Isn't anyone excited about this Chip? It sounds like a beast!

The high end version is currently slated to include 8 dual threaded cores (total of 16 threads) and a quad channel on board memory controller for over 33GB/s of memory bandwidth using DDR3 1033Mhz. Its also likely that the chip will come with 24-32MB of onboard cache.

Not only that but each core induvidually sounds like it could be incredibly powerful with FP units signifcantly beefed up over Peryn, which itself is supposed to have a 40% advantage over Conroe.

With all that FP power across 8 cores, could we be looking at the first x86 chip to outrun Cell in FP intensive code?
 
I'm more interested in the short term because the Intel "Penryn" family will mean one thing above all:
Cheap "Kentsfield" quad-cores for everyone in the second half of 2007. :D
 
With all that FP power across 8 cores, could we be looking at the first x86 chip to outrun Cell in FP intensive code?

On code Cell's very good at, I'd doubt it.
An 8-core Nehalem would be closer than other x86 chips to the 90nm Cell at vectorized code the IBM chip excels at, but I doubt it could really beat it on home turf.

There are those headaches associated with managing memory access and cache in a very multicore environment, the suboptimal ISA for the workload Cell does well at (the next set of media instructions might remedy part of this), and the tiny ISA-visible register set.

There are a large number of apps that Cell loses a good portion of peak performance, but is still better than current x86 that Nehalem should convincingly win.

Any code that isn't well optimized should be a win for Nehalem.

Let's not forget though that the price of victory over today's cell is a design two process nodes ahead, many times larger in transistor budget, and most probably only possible with one of the 120W+ TDP chips.

One could only imagine what Cell 2 or Cell 2.5 could do if given the same opportunities.

Edit:
An interesting wrinkle might be if future Intel designs start packing more specialized hardware. If Intel's Larabbee project leads to an improved x86 vector engine, it might make Nehalem powerful enough to match today's Cell. Since Nehalem will be more modular, it is possible it can do even better.

Not much info to go on right now, though.
 
Last edited by a moderator:
On code Cell's very good at, I'd doubt it.
An 8-core Nehalem would be closer than other x86 chips to the 90nm Cell at vectorized code the IBM chip excels at, but I doubt it could really beat it on home turf.

There are those headaches associated with managing memory access and cache in a very multicore environment, the suboptimal ISA for the workload Cell does well at (the next set of media instructions might remedy part of this), and the tiny ISA-visible register set.

...

Let's not forget though that the price of victory over today's cell is a design two process nodes ahead, many times larger in transistor budget, and most probably only possible with one of the 120W+ TDP chips.

Its pretty amazing that they managed to come up with something thats so far ahead of the competition performance wise while using only a fraction of the resources (transistors, heat, power, cost). All I can say is IBM and Toshiba must have some very talented engineers!
 
Specialization with a clean slate allows for wonderful results.

Backwards compatibility and general performance requirements can be incredibly limiting.

Nehalem isn't just running 2008 programs coded for Nehalem, it has to run apps programmed over a decade ago, and run them well.
That means running code that existed before there was Nehalem to optimize for.

Cell doesn't have to run anything from before Cell existed.
In cases where such code has been made to run, things are not pretty.
 
Are you kidding? For all practical purposes nehalem will be a total fing monster compared to Cell.

It will probably about 10 times as big, for starters.

This will begin to encroach on Nvidia too.

I believe eventually Intel will be the only company left in both CPU and GPU.
 
It will probably about 10 times as big, for starters.
How does the die size relate *at all* to the question?
With all that FP power across 8 cores, could we be looking at the first x86 chip to outrun Cell in FP intensive code?
In fact, I fail to see the reasoning for the rest of your post too... Intel killing everybody else *might* happen, but your train of thought is quite bizarre.
 
The 40% boost claim on Tech Report for SSE4 is interesting. Wonder what they're up to there. SSE2 was the last really interesting revision of SSE.
 
Are you kidding? For all practical purposes nehalem will be a total fing monster compared to Cell.
It will be monstrous in ways that won't help it win in the niche of programs that Cell excells at.

It will be a lot closer for most of that set, and it will probably beat the 90nm version of Cell for most code that Cell is not quite as good at.

By that time, however, it won't be the original Cell that Nehalem will be competing with.

It will probably about 10 times as big, for starters.
Which it are you referring to, Cell or Nehalem?
 
How does the die size relate *at all* to the question?
The "outrun" question?
If four or more Cells can be built in the area, power and cost envelope of one Nehalem, it's going to be a hollow victory at best. Cell was designed to out-scale pure SMP, and certainly x86 SMP, in terms of throughput, power and cost.
 
The "outrun" question?
If four or more Cells can be built in the area, power and cost envelope of one Nehalem, it's going to be a hollow victory at best. Cell was designed to out-scale pure SMP, and certainly x86 SMP, in terms of throughput, power and cost.

There actually is a Cell2 in the works. It was revealed by IBM at a conference, and it is stated it will hit 1 TFLOP all by itself.
 
I sort of wonder what all those 16 threads wil ldo in any PC close to a normal one.

Even or server type situations will there be anything ablr to take advantage of that many threads?

Multiple processors are still very new in any sort of mainstream way after all.

I do remember dual processors were sort of a fad during a time of the pentium days and then again around the P3 500MHz mark (I guess there were cheap enough dyual CPU chipsets then) but it's not exactly been common so to spealk.

Presumably the 8 core 16 thread chip will be the monster xeon variety with the equally monstrous pricetag. Nothing any mere mortal could hope to afford.
Peace.
 
Even or server type situations will there be anything ablr to take advantage of that many threads?

Yes. Yes, very much yes.

Multiple processors are still very new in any sort of mainstream way after all.

Mainstream = desktop? Multiple processors have been mainstream in servers and high-performance computing for decades.
 
There actually is a Cell2 in the works. It was revealed by IBM at a conference, and it is stated it will hit 1 TFLOP all by itself.

If we assume this chip doubles Conroes theoretical peak FP performance - which is a big assumption at present - then at 3Ghz it would peak out at about 1/3 of that new Cell - although still well above the current Cell (384 GFLOPS).

If it maintains Conroes peak but merely gets closer to full utilisation then at 3Ghz it would peak out at 192 GFLOPs which is just under the current Cell. We can expect that kind of theoretical performance out of Barcalona later this year I would imagine.

Thing is, I thought x86 was better able to reach its peaks than Cell simply because of its fundamental design (OoO, super branch prediction etc...). So when its peak is so close, shouldn't we be expecting better results? Or are there other factors as play which im not getting? LS is huge part of Cells performance potential I understand but what are the reasons why shared (or local for that matter) cache can't substitute when its larger? Is it much slower or is it simply a matter of control?

EDITL For the record im not saying a SMP x86 is evergoing to out perform a Cell like model using the same resources in vectorised code. But as 3dilettante says, there are huge advantages to being specialised and starting with a clean slate, so I think its an achievment to be marked when a more generalised, backwards compatable CPU can actually beat that specialised CPU in its area of specialisation.

Kinda like the first x68 to out "GFLOP" the Emotion Engine or the first GPU to match or exceed PS2's edram. Sure they are much later but I find it interesting to see how long it takes more generalised technology to catch up with the specialised power.
 
Mainstream = desktop? Multiple processors have been mainstream in servers and high-performance computing for decades.

True. But Nehalem is targeted at high volume, ergo laptop and desktop use.
Servers run several similar priority threads by nature.
High-performance computing focuses on solving one problem quickly, but we should pay attention to the fact that the problems that can profitably take advantage of large scale multiprocessing is a narrow subset of the calculational tasks we have in science and engineering. Amdahls law is merciless, and for someone like me who has been around since supercomputing was "single core", the narrowing focus is very tangible.

For desktop use, going from 1 -> 2 processors helps the user experience quite a bit, mostly due to the dubious quality of how multitasking is handled on the major consumer OS. The actual performance advantage for any single given task is typically negliegeable. Going from 2 -> 4 makes less sense. We are still dealing with an almost exclusively single threaded code base, we have the same memory subsystem, now called on to serve 4 cores, and no good strategy to avoid contention issues so even if the problem can be suitably partitioned, there are other hard limitations. The number of scenarios where a 4 core processor will provide cost-effective benefits over a 2 core one is quite small. Going to 8 cores, I'm not sure Intel will even bother offering it other than as a server processor option (although milking the lunatic PC fringe may be too lucrative to pass by).

Intel emphasized Nehalems scalability - wisely so IMHO. They will do all they can to keep ASPs (average selling prices) up, and so will push multiprocessing far beyond where it makes sense, but there are real costs associated with making the dies larger. From a consumer perspective, the most reasonable path is to level off at 2, or even 1 general purpose CPUs on the desktop, and reap the corresponding benefits in size, cost and power efficiency. That may not bring in the maximum profit to Intel or AMD however, so the crystal ball remains murky.
 
Not only that but each core induvidually sounds like it could be incredibly powerful with FP units signifcantly beefed up over Peryn, which itself is supposed to have a 40% advantage over Conroe.
Penryn isn't supposed to be 40% faster than Conroe. That figure is only if you use what you'd gain by using the new sse4 instructions, obviously it will depend on the application if those are actually useful. But otherwise, the sse units are largely unchanged, apart from the faster shuffle operations ("super shuffle" - again how much you gain depends on your instruction mix, though that will speed up code using "old" sse instructions too). intel quotes a 20% performance increase over conroe overall - that's with a 10% clock increase, larger cache, faster bus speed. So yes some tweaks here and there (the only other thing explicitly mentioned is faster radix-16 based divider) but nothing radically different. Not that this is a bad thing...
 
Presumably the 8 core 16 thread chip will be the monster xeon variety with the equally monstrous pricetag. Nothing any mere mortal could hope to afford.
Peace.

a month from now the monster quad core variety of core 2 duo will be at $530 thus affordable for many (though I wouldn't recommend it to most people).
So I don't think the 8 core Nehalem will be that monstrous.. but I also definitely think people will mostly get dual core or quad core variants.


anyway.. If we say quad or more cores are mostly useless for desktop usage, then why Cell should be less useless?
 
Penryn isn't supposed to be 40% faster than Conroe. That figure is only if you use what you'd gain by using the new sse4 instructions, obviously it will depend on the application if those are actually useful. But otherwise, the sse units are largely unchanged, apart from the faster shuffle operations ("super shuffle" - again how much you gain depends on your instruction mix, though that will speed up code using "old" sse instructions too). intel quotes a 20% performance increase over conroe overall - that's with a 10% clock increase, larger cache, faster bus speed. So yes some tweaks here and there (the only other thing explicitly mentioned is faster radix-16 based divider) but nothing radically different. Not that this is a bad thing...

True, but my example was in specific reference to the peak effectiveness of the FP engine which would presumably include optimisations for the available SSE instruction set.

To give an example, im thinking aling the lines of, what if they built a F@H client specifically for Peryn (not that they ever would), would it be 40% faster than one built specifically for Conroe? at the given clock speeds?

Or perhaps more interestingly, could the 8 core varient of Nehelam keep pace with todays Cell in F@H? Assuming a fairly optimsed client for both.
 
True, but my example was in specific reference to the peak effectiveness of the FP engine which would presumably include optimisations for the available SSE instruction set.

To give an example, im thinking aling the lines of, what if they built a F@H client specifically for Peryn (not that they ever would), would it be 40% faster than one built specifically for Conroe? at the given clock speeds?

The best one to answer is someone who's profiled FAH on Conroe.
If the percent utilization of Conroe's math capability is >60%, then the answer is likely no, since Penryn doesn't add any new math resources.

There may be some leeway because the shuffle unit, if some of the info given is accurate. Penryn allegedly can process a number of shuffle instructions as a single op, whereas Conroe sometimes must do such operations as multiple ops that take up issue bandwidth.

However, if the hard math unit utilization is above 60%, then the answer is a flat-out no.

edit: this is per-clock

Or perhaps more interestingly, could the 8 core varient of Nehelam keep pace with todays Cell in F@H? Assuming a fairly optimsed client for both.

A lot can happen with Nehalem. If the FP architecture is not significantly expanded over Penryn, I think peak execution resources will lag somewhat, but by that time Nehalem is likely to have more memory bandwidth to play with.

If all else were equal, I really think the SPEs are better at a lot of the calculations, but as FAH said, there are some calcs that thus far only standard CPUs are able to run. They're working to expand the set of calculations, but there are just some types of work Cell isn't running at all at present.

Those remaining calculations I'd expect to be very well taken care of by Nehalem.
 
Last edited by a moderator:
Back
Top