Leaked Intel Nehalem performance projections over AMD Shanghai

Farhan · Feb 29, 2008

3dilettante said:
What could be interesting, if some rumors prove true, is that Nehalem's L2 will only be 256 KiB.
Nehalem will likely have a cache advantage still, but the relative difference wouldn't be anywhere near as bad as the current cache disparity.
The downside is that Nehalem would have replaced that advantage with a potentially superior memory controller, and I have a sneaking suspicion that Intel's L3 isn't going to have the pathetic latency numbers Barcelona's last level has.
Let's hope Shanghai's latencies improve.

I wouldn't be surprised if Nehalem's L2s are around 256KB-512KB. They are probably optimized for bandwidth/latency, given the beefy cores they have to feed. The large L3 will provide coverage.

suryad · Mar 2, 2008

Is it just me or are these cores just getting out of hand? I mean i love technology as much as the next guy but I dont know what the average joe would be doing with quad cores or octacore systems! Also software has not caught up with the hardware evolution. Its not that I am anti technology...but it would be good to see software performance and parallelization etc etc and better programming languages and better compiles...the whole shebang. I know all tasks cant be parallelized. I wonder how much imrpovement these cores will bring in terms of single threaded performance.

AlNom · Mar 2, 2008

Perhaps for a consumer, but the topic at hand is dealing with servers.

Bludd · Mar 2, 2008

If the new micro architecture in Nehalem is to Core 2 what Core 2 was to Pentium 4, well then... :drool:

ShaidarHaran · Mar 2, 2008

Bludd said:
If the new micro architecture in Nehalem is to Core 2 what Core 2 was to Pentium 4, well then... :drool:

Only in a multi-thread environment, unfortunately. Single-thread performance is said to be on the order of 10-25% faster than Penryn at the same clock speed.

Bludd · Mar 3, 2008

ShaidarHaran said:
Only in a multi-thread environment, unfortunately. Single-thread performance is said to be on the order of 10-25% faster than Penryn at the same clock speed.

ISVs have to step up to the plate and work hard to get a good multi-thread baseline framework in place.

Man, I am really excited about Nehalem. It will be good to see the FSB finally buried.

ShaidarHaran · Mar 3, 2008

Bludd said:
ISVs have to step up to the plate and work hard to get a good multi-thread baseline framework in place.

True, but not all code is more than trivially parallelizable, and some not at all. Plus it's no easy task.

Bludd said:
Man, I am really excited about Nehalem. It will be good to see the FSB finally buried.

I also look forward to Nehalem, but am confident my dual-core Penryn will keep me happy until then.

Bludd · Mar 3, 2008

ShaidarHaran said:
True, but not all code is more than trivially parallelizable, and some not at all. Plus it's no easy task.

Yes, but it doesn't mean people shouldn't work hard at solving the problems. There are people who try solving NP-complete problems too.

ShaidarHaran · Mar 3, 2008

I didn't mean to imply that the problems that face devs should not attempt to be solved, simply that they are not easily solvable and thus we should not expect solutions overnight.

3dilettante · Mar 5, 2008

There are now shots of wafers with Shanghai cpus on them.

Others have eyeballed the count, and my glancing at it seems to point to the following rough number of complete chip widths/lengths along the longest and widest parts of the grid.

Shanghai on a 300 mm wafer: 20x15
Nehalem on a 300 mm wafer: 20x14

I'd like some folks who are more dilligent to double check my skimming, but it seems to indicate that the chips are closer than the 20-30% disparity brought up earlier.
From a cadidate die per wafer perspective, the advantage Shanghai has over Nehalem doesn't seem to match that gap.

3dilettante · Mar 7, 2008

http://chip-architect.com/news/Shanghai_Nehalem.jpg

I'm not sure about the memory channel count labeled on the Nehalem, but I'm not sure which variant this is supposed to be.

I don't think Hans DeVries would be too far off on the numbers I'm interested in:

The die sizes are almost the same, not the 20-30% difference being rumored.

Maybe there is some margin of error over which core variant is involved, but the transistor count on the Nehalem seems high enough to be right.

It's notable just how poor AMD's L3 cache density is, particularly compared to Intel's.

The L3 is actually the same density as the L2 for AMD, kind of negating a significant part of the reason to have it.

The_Wolf_Who_Cried_Boy · Mar 7, 2008

3dilettante said:
It's notable just how poor AMD's L3 cache density is, particularly compared to Intel's.

The L3 is actually the same density as the L2 for AMD, kind of negating a significant part of the reason to have it.

Not that I'm remotely an expert in such matters but just guessing it's a sweet spot for performance/power consumption/leakage (?) If I'm remembering correctly IBM have demonstrated much higher SRAM cache densities on their POWER series on a given process node relative to AMD so it's obviously not an innate deficiency for SOI.

In regards to Nehalem's L3, isn't density optimised SRAM as a general rule at the slower and leakier end of the spectrum for characteristics?

3dilettante · Mar 8, 2008

It might be the sweet spot for power/yield for AMD.

For an SRAM to operate reliably, whole swaths of cells along the bit lines must all function correctly under a wide range of conditions.

Manufacturing variation becomes an issue, even with redundancy.

Yields for chips with a lot of SRAM can be influenced by the voltage the SRAM runs at and the size of the SRAM cells as well.

Higher voltage means higher tolerance for variation amongst the components, and larger cells are more resistant to variation because they simply have more bulk to tolerate it.

Higher voltage means higher power consumption that AMD doesn't want, and larger cells means poorer density.

IBM's SRAM is for chips that are meant for high end servers, where the volumes need not be all that high. IBM can charge an arm and a leg per chip and then charge for system services. It can toss a lot more chips and tolerate way higher TDPs than AMD can allow.

That means IBM in a much better position to tolerate less manufacturability and higher power draw than AMD is.

Realworldtech has an article on Barcelona that also indicates that signal integrity forced a design compromise on the cache cells.

http://www.realworldtech.com/page.cfm?ArticleID=RWT051607033728&p=8

Significantly, it is apparently the case that SOI actually did hurt AMD's cache density, at least for Barcelona.

What is telling is that Barcelona also used the same cache cells for the L2 and L3, which means similar compromises may have been made for Shanghai as well, since it too has the same density for both caches.

The_Wolf_Who_Cried_Boy · Mar 8, 2008

The L2 and L3 cache share many design elements, including the SRAM cells. The L2/3 cells are 0.81um2 and are also single ended for stability, which is unusual. One of the difficulties that AMD’s SRAM designers faced is that because they use the same die across all product lines, the likelihood of a read disturbance (i.e. reading the wrong data) must be very small. Specifically, a 5 sigma margin across the entire 0.7-1.3V range is required. Unfortunately, the floating body effect of SOI silicon precluded a more efficient small swing read design. According to AMD’s presentation, using a small swing read cell, they were only able to achieve a 4.53 sigma margin. The single ended design which was chosen had larger margins that were sufficient for actual product use.

As a complete mathmatic illiterate I take a 5 sigma variance to mean a tolerance of a fifth of one standard deviation?

Also curious as to why the current northbridge is so clock limited if the cache cells are identical to the L2.

MTd2 · Mar 8, 2008

3dilettante said:
It's notable just how poor AMD's L3 cache density is, particularly compared to Intel's.

Hmm, are you sure it's lower? On close inspection you can see larger "road", like avenues, on Shanghai, dividing blocks of higher denside. Whereas on Nehalem is everything packed. If this thing works like the road traffic, Shanghai should have less traffic gems, less bottlenecks.

Sigma means standard deviation:

http://en.wikipedia.org/wiki/Standard_deviation

The confidence intervals are as follows:
σ 68.26894921371%
2σ 95.44997361036%
3σ 99.73002039367%
4σ 99.99366575163%
5σ 99.99994266969%
6σ 99.99999980268%
7σ 99.99999999974%

That means 1 every 5 million elements, transistors I think, are incorrectly printed, on average.

ShaidarHaran · Mar 8, 2008

Erm, I think it's a pretty fair assumption that Intel not only has better density than AMD, but fewer defects per given metric. That's always been the case at every given process node. Intel defines MPU manufacturing. Everyone else just tries to keep up (and fails).

Farhan · Mar 8, 2008

MTd2 said:
Hmm, are you sure it's lower? On close inspection you can see larger "road", like avenues, on Shanghai, dividing blocks of higher denside. Whereas on Nehalem is everything packed. If this thing works like the road traffic, Shanghai should have less traffic gems, less bottlenecks.

Sigma means standard deviation:

http://en.wikipedia.org/wiki/Standard_deviation

The confidence intervals are as follows:
σ 68.26894921371%
2σ 95.44997361036%
3σ 99.73002039367%
4σ 99.99366575163%
5σ 99.99994266969%
6σ 99.99999980268%
7σ 99.99999999974%

That means 1 every 5 million elements, transistors I think, are incorrectly printed, on average.

I think it refers to the read error rate, not directly physical defects.

The single ended read instead of a small swing read probably means that they are not using differential bitlines and sense amps to amplify the small signal between them. So basically the cells have to be larger and/or the bitlines have to be shorter because the cells have to drive these (long) wires all the way down or close to a logical 0 which means there is more I/O overhead because the cell arrays can't be very large (and that's probably why you see more gaps between the cell blocks in the Barcelona/Shanghai). In a differential read cell there is a sense amplifier which senses a small voltage difference between a bit and _bit line for each cell and amplifies that signal. The cell only has to swing the long bitline a small amount, usually around 100-200mV. So that means smaller cells and/or longer bitlines (larger cell arrays) and probably lower power (for reads).

Farhan · Mar 8, 2008

3dilettante said:
http://chip-architect.com/news/Shanghai_Nehalem.jpg

I'm not sure about the memory channel count labeled on the Nehalem, but I'm not sure which variant this is supposed to be.

I don't think Hans DeVries would be too far off on the numbers I'm interested in:

The die sizes are almost the same, not the 20-30% difference being rumored.

Maybe there is some margin of error over which core variant is involved, but the transistor count on the Nehalem seems high enough to be right.

It's notable just how poor AMD's L3 cache density is, particularly compared to Intel's.

The L3 is actually the same density as the L2 for AMD, kind of negating a significant part of the reason to have it.

Looks like there will probably be 2MB L3 versions of Shanghai, from the L3 layout.

MTd2 · Mar 8, 2008

Farhan said:
I think it refers to the read error rate, not directly physical defects.

Yes! I just saw the sigma error and thought about the imprinting errors! I was very lazy

What you said makes sense.

crystall · Mar 10, 2008

3dilettante said:
It's notable just how poor AMD's L3 cache density is, particularly compared to Intel's.

The L3 is actually the same density as the L2 for AMD, kind of negating a significant part of the reason to have it.

It is likely that just as with Barcelona the L3 reuses the same cells designed for the L2. This is certainly suboptimal as L3 cells can usually be optimized for size instead of speed as L3 read latency will be usually dominated by the interconnect latency, not cell performance. However given the cost and time constraint AMD currently has that sounds like a good decision. It is also possible that AMD will respin the processor later with a 'better' L3 once the 45nm process matures and its designers are more familiar with it.

Leaked Intel Nehalem performance projections over AMD Shanghai

Farhan

suryad

AlNom

Moderator

Bludd

Experiencing A Significant Gravitas Shortfall

ShaidarHaran

hardware monkey

Bludd

Experiencing A Significant Gravitas Shortfall

ShaidarHaran

hardware monkey

Bludd

Experiencing A Significant Gravitas Shortfall

ShaidarHaran

hardware monkey

3dilettante

3dilettante

The_Wolf_Who_Cried_Boy

3dilettante

The_Wolf_Who_Cried_Boy

MTd2

ShaidarHaran

hardware monkey

Farhan

Farhan

MTd2

crystall