Intel adapts Performance Rating. A sign of slowing clockspee

Druga Runda · Mar 25, 2004

Rugor said:
Precott is like Willamette, released at too low a clockspeed to really show its performance.

Having said that, I do think the Netburst was the wrong way to go.

comparing the improvements they made to it from wilamette to prescott... it seems OK, the only problem is that AMD pushed Intel so hard that they used all the ammunition with Northwood and now are stuck with Prescott... but overall HT and latest PIV's with wide FSB are good performers accross the line.

Saem · Mar 25, 2004

There seems to be a lot of forgetfull people. A little while after willy was released Intel slowed down their speed grade releases because AMD couldn't keep up. When AMD finally started to turn on the heat again, Northwood showed up. Towards the end of Northwoods life and the latest and greatest K7, AMD is behind by a fairly large margin, many folks then started really questioning the CPU ratings.

As for netburst being a poor choice, hardly! I believe it was demonstrated in research presented by the Alpha team which showed that for 0.13u a 18 stage pipeline would be around optimal -- If only I could find a link. Intel seems to be bang on. AMD has had trouble scalling frequencies, they're relying on some very exotic materials and really pushing their process. Further more, a attaining a high clock speed isn't helped when you have x86 decoders in the critical path. And it doesn't help if you can execute many instructions per cycle when the ISA doesn't lend itself to more than 3 ops in parallel. You can't just make a high IPC processor when the code doesn't facilitate it. Right now AMD's performance is comming in tests which are either memory controller intensive, FPU intensive and/or branch heavy.

And the guy who doesn't think Intel fabs are likely the best in the world, get a clue. They don't need all the exotic materials to achieve their goals, their fabrication seems to do quite well without resorting to high performance/high cost materials. It's IBM (G5), Transmeta (IBM fabs their stuff) and AMD that are being forced to rely on expensive materials to keep up with Intel.

Rugor · Mar 26, 2004

I"ll agree that Intel's fabs are top-line, maybe not the best in the world but certainly at the head of the top tier. I don't know enough about all the fabs to definitely pick a single best and doubt many other people do too.

As to Netburst, in many ways it's predicated on the idea that increasing clockspeed is in and of itself a good and desirable thing: To which I say "bollocks." As a user I don't care whether my next system has a higher clock speed than the current one or not, and if systems weren't sold by the GHz neither would anyone else. When people talk about wanting a faster system, they don't really mean clockspeed, because the system's clock isn't what they interact with. What they want is a system that runs their applications faster, that responds to their inputs faster.

There are two ways to achieve higher performance within a processor core, one is to increase clockspeed, the other to increase IPC. When dealing with a system, you have to add a third, namely to reduce or remove bottlenecks. Needless to say the best way to increase performance overall is through a balanced approach.

The difference I see between the current AMD approach and Intel's Netburst approach is that AMD is attempting to increase CPU and system performance as their priority, where Intel is increasing clockspeed as a priority. Clockspeed is certainly higher on Intel's priority lists than performance as can be seen by Prescott, which cannot match Northwood clock for clock, but will eventually be able to scale to higher speeds despite its ruinous power budget.

In short I don't like Netburst because it puts marketability, in the way of higher speeds, above useability in the way of higher performance.

zidane1strife · Mar 26, 2004

And the guy who doesn't think Intel fabs are likely the best in the world, get a clue. They don't need all the exotic materials to achieve their goals, their fabrication seems to do quite well without resorting to high performance/high cost materials. It's IBM (G5), Transmeta (IBM fabs their stuff) and AMD that are being forced to rely on expensive materials to keep up with Intel.

tsk, tsk, clearly I did not mean to diss intel, what I meant was that real obstacles can arise in the real world... and that it's not necessarily those at the top that will obtain the best solutions all the time... they might get stuck here and there... and this could in turn be seen in the products that are delivered

Saem · Mar 26, 2004

Rugor, you're just stating a whole lot of obvious. First of the third way to gain performance that you mention is garbage, that's just good design.

Second your thoughts on Prescott are way off. It has larger caches and improve cache hierarchy, improved HT, improved execution resources and you think they're not raising IPC? Hell look at the willy, that's your baseline. They raised the clock. Northwood came out the clock went even higher and the IPC went up to, then the C cores have taken it a step further. I don't see your point.

arjan de lumens · Mar 26, 2004

Saem said:
Second your thoughts on Prescott are way off. It has larger caches and improve cache hierarchy, improved HT, improved execution resources and you think they're not raising IPC?

Take a look at actual Prescott vs Northwood benchmarks - at 3.2 GHz, the Prescott generally falls behind Northwood by about 1-3 percent most of the time, so it definitely has a worse IPC at that clock speed. There are many changes done in Prescott that are meant to help clock speed, but also do hurt IPC: L1 and L2 cache latencies are doubled, the execution pipeline is 55% longer (from 20 to 31 steps), the integer ALUs are apparently no longer double-pumped, and many common FPU/SSE instructions have got 1 clock extra latency.

Rugor · Mar 26, 2004

Saem:

Let's follow the evolution of the Netburst core shall we?

Going from Willamette to Northwood, they doubled the cache, added HT, and increased the processor's available bandwidth through doubling the FSB. All of these features increased the IPC regardless of clockspeed. If you were to compare a P4 2.4C to a Celeron 2.4 (which while it isn't exactly a Willamette shares all the same core features except for cache size) you would see that the Northwood's performance is significantly better.

Northwood's overall performance increase over Willamette is due to increasing both clock speed and IPC. Northwood's actually a very good example of how to improve the performance of a CPU since it hits all three of my points. The smaller process allowed for improved scaling, HT improved the IPC, as did the larger onboard cache, while increasing the FSB removed a system bottleneck. I differentiate that from the cache because it only involves changing settings, not the core.

Prescott is a different beast entirely, when you compare it to Northwood at 2.8GHz it's a clear loser, running hotter, drawing more power, and seriously underperforming. It needs much higher speeds to show its best and that will bring along its own set of problems. In fact most of its changes are optimized for increasing speeds.

One issue is the deeper pipes. The longer the pipeline the greater the amount of bandwidth needed to keep those pipes full. At the moment that doesn't seem to be an issue as Prescott doesn't appear to be suffering from bandwidth starvation, in fact seeming to use it more efficiently, but it's going to need a lot more as it scales. Then we may well see problems, because neither memory nor hard-drive speeds are scaling anywhere near as quickly as processors.

I still think a more balanced approach of increasing IPC and clockspeed would be the way to go.

Deadmeat · Apr 3, 2004

...

Intel to consolidate chips for desktops and notebooks in 2007
[digitimes.com] 11:32

Intel is likely to give up its current practice of launching processors for desktops and notebooks respectively. Instead the chipmaker is expected to launch a brand new processor, dubbed Merom, for all PCs starting 2007, according to sources at Taiwanese motherboard makers.
The Merom will be made using a 0.65nm process and will run under the current architecture used by Intel's Pentium M processors, said the sources. The Merom will be offered with different amounts of cache for varying markets.

Intel's Netburst architecture that supports the Pentium 4 processors is likely to be phased out from the PC market when the Merom comes online, according to the sources. Under Netburst, Intel's Prescott processors have been constantly haunted by problems such as heat dissipation and high power consumption.

Charles Chou, Steve Shen

NetBurst is dead? Back to P6 microarchitecture??? Another testamony of slowing CPU clockspeed...

Saem · Apr 3, 2004

Take a look at actual Prescott vs Northwood benchmarks - at 3.2 GHz, the Prescott generally falls behind Northwood by about 1-3 percent most of the time, so it definitely has a worse IPC at that clock speed. There are many changes done in Prescott that are meant to help clock speed, but also do hurt IPC: L1 and L2 cache latencies are doubled, the execution pipeline is 55% longer (from 20 to 31 steps), the integer ALUs are apparently no longer double-pumped, and many common FPU/SSE instructions have got 1 clock extra latency.

http://www.aceshardware.com/read.jsp?id=60000315

According to the first chart the L1 and L2 cache latencies seem unchaged.

Yes, they made those changes for clockspeed, but the larger caches and improvements to HT do feed the CPU better. They also added more execution units and various other improvements in terms of scheduling and issuing.

One issue is the deeper pipes. The longer the pipeline the greater the amount of bandwidth needed to keep those pipes full. At the moment that doesn't seem to be an issue as Prescott doesn't appear to be suffering from bandwidth starvation, in fact seeming to use it more efficiently, but it's going to need a lot more as it scales. Then we may well see problems, because neither memory nor hard-drive speeds are scaling anywhere near as quickly as processors.

Huh? Couched within your statements seems to be that higher IPC doesn't require more bandwidth as well. Overall, the desired effect you SEEM to be aiming to get with your post is bunk. More computation per unit time strongly correlates to the need of more data movement per unit time. With deeper pipelined architectures you can better hide memory latencies and NOT rely less on faster memory architectures and more speculative execution. Then again, not all software solutions are suited towards this.

I still think a more balanced approach of increasing IPC and clockspeed would be the way to go.

Based on what, a gut feelling, which in turn is based on what, it sounds reasonable if one disregars pragmatics? x86 code has a tendency to yield low parallelizeable instructions (3 usually), this is why Intel is getting on the HT bandwagon to take their execution resources further by getting parallelism across threads.

Northwood's overall performance increase over Willamette is due to increasing both clock speed and IPC. Northwood's actually a very good example of how to improve the performance of a CPU since it hits all three of my points. The smaller process allowed for improved scaling, HT improved the IPC, as did the larger onboard cache, while increasing the FSB removed a system bottleneck. I differentiate that from the cache because it only involves changing settings, not the core.

Yes, but sustaining IPC at high clockrates is the real issue. Yes, at lower clock rates the Northwood is showing up the Prescott, point conceded; my predictions were otherwise. This, however, doesn't cover the fact that at higher clock rates Northwood can't sustain its IPC while the Prescott can. There were many a fool saying Intel should have just extended the PIII architecture, rather than goign with the Willimette core, the issue with that is an extend PIII architecture, saying improved, scheduling, HT,issue, clock distribution, longer pipeline and great execution resources is a new core -- why not make a new core, oh, wait, they did! A PIII with improvements made for clock doesn't make sense since your IPC would just drop, like a rock too! Sustaining a level of IPC at higher clock rates. In otherword, let the Prescott scale a bit, clock up a Northwood and you'll see the Prescott's IPC fall off more gracefully by comparision.

NetBurst is dead? Back to P6 microarchitecture??? Another testamony of slowing CPU clockspeed...

That article says nothing about the configuration of this MPU, one cannot say that Netburst is dead. Who knows, the Banais might see a more gradual shift into a Netburst-esque beast than PIII to P4 transition. In anycase, there will be differing computing needs in that time, likely computing pads will be all the range and thusly power will be a larger factor in governing MPU design.

Rugor · Apr 4, 2004

The real issue is neither clockspeeds nor IPC nor even bandwidth. The real issue is increased performance for the end user. As a user, I don't care if you get them through increasing clockspeeds or IPC; I merely care that the new system runs my applications faster and hangs less often than the old system.

I'll agree that an increase in IPC requires more bandwidth, however so do increases in both pipeline depth and clockspeed. In fact they can require greater increases in bandwidth than IPC alone. A deeper pipelined architecture is more likely to be able to trade off bandwidth for latency. Also, deeper pipelines require higher clockspeeds for maximum efficiency.

There are other issues involved too, especially power draw and thermals. Current Prescotts have the highest power draw and heat generation of any available processor. LGA 755 and later steppings will probably mitigate these issues, but they won't go away. If nothing else, Intel's push towards the BTX form factor is evidence of this. The Netburst Architecture's relentless focus on pure clockspeed is running into problems, and that's why Intel is slowly but surely moving away from it.

I am not doubting that future architectures won't owe a debt to Netburst, but they aren't going to focus on clockspeed the same way. The costs outweighed the benefits.

arjan de lumens · Apr 4, 2004

Saem said:
Take a look at actual Prescott vs Northwood benchmarks - at 3.2 GHz, the Prescott generally falls behind Northwood by about 1-3 percent most of the time, so it definitely has a worse IPC at that clock speed. There are many changes done in Prescott that are meant to help clock speed, but also do hurt IPC: L1 and L2 cache latencies are doubled, the execution pipeline is 55% longer (from 20 to 31 steps), the integer ALUs are apparently no longer double-pumped, and many common FPU/SSE instructions have got 1 clock extra latency.

Click to expand...

http://www.aceshardware.com/read.jsp?id=60000315

According to the first chart the L1 and L2 cache latencies seem unchaged.

Yes, they made those changes for clockspeed, but the larger caches and improvements to HT do feed the CPU better. They also added more execution units and various other improvements in terms of scheduling and issuing.

According to Intel, latencies are unchanged; according to
http://www.anandtech.com/cpu/showdoc.html?i=1956&p=8
the *measured* latencies of the caches have increased dramatically. So either Anandtech or Intel is feeding us lies.

As for the "added" units, the new barrel shifter is nice, the new multiplier still sucks (IMUL reduced from 14 to 10 cycles; athlon64 uses 3 cycles), and the new horizontal add instructions suck ass performance-wise (13 cycles latency; replacing them with SHUFPS->ADDPS sequences as you would do in Northwood is actually faster, making the new instructions useless). The only HT impovements I am aware of in Prescott are the new MONITOR and MWAIT instructions, which require at least an OS kernel patch and only really avoid spinlocks.

Saem · Apr 4, 2004

According to Intel, latencies are unchanged; according to
http://www.anandtech.com/cpu/showdoc.html?i=1956&p=8
the *measured* latencies of the caches have increased dramatically. So either Anandtech or Intel is feeding us lies.

I haven't been to Anandtech in a long time. That's interesting. So, could testing methodology be giving misleading results or you think Intel is being shady?

There are other issues involved too, especially power draw and thermals. Current Prescotts have the highest power draw and heat generation of any available processor. LGA 755 and later steppings will probably mitigate these issues, but they won't go away. If nothing else, Intel's push towards the BTX form factor is evidence of this. The Netburst Architecture's relentless focus on pure clockspeed is running into problems, and that's why Intel is slowly but surely moving away from it.

Agreed, power draw is a significant issue, but wouldn't a more brainy design with more execution resources suck up lots of power as well? Mind you execution resources don't just sit idley, AFAIK, to avoid radical power fluctuations they're given, "busy-work".

Yes, the BTX is a sign of need for greater cooling.

I think you're overstating your case, I will agree that Netburst is aggresive in trying to get higher clockspeed but you seem to say that's all it's doing, doesn't seem quite right. Currently, the A64 seems to be relying a fair bit on more exotic materials and a fair bit more process tweaking, of course they seem to have gotten over this, but pushing the clock seems to be a rather severe issue for them.

As for the "added" units, the new barrel shifter is nice, the new multiplier still sucks (IMUL reduced from 14 to 10 cycles; athlon64 uses 3 cycles)

I have a feeling that during simulations it was found that reducing the latency + higher clock speed might have been more economical -- transistor, financially or what, I don't know. BTW, does the P4 have a dedicate address calculation engine or does it use the existing integer logic?

and the new horizontal add instructions suck ass performance-wise (13 cycles latency; replacing them with SHUFPS->ADDPS sequences as you would do in Northwood is actually faster, making the new instructions useless). The only HT impovements I am aware of in Prescott are the new MONITOR and MWAIT instructions, which require at least an OS kernel patch and only really avoid spinlocks.

Hrm, I shouldn't have said HT, the various larger buffers and isntruction queue cover the improvements that I wished to state, mea culpa.

arjan de lumens · Apr 4, 2004

Saem said:
According to Intel, latencies are unchanged; according to
http://www.anandtech.com/cpu/showdoc.html?i=1956&p=8
the *measured* latencies of the caches have increased dramatically. So either Anandtech or Intel is feeding us lies.

Click to expand...

I haven't been to Anandtech in a long time. That's interesting. So, could testing methodology be giving misleading results or you think Intel is being shady?

Dunno; the given benchmarks may or may not be broken, but it would be too obvious if someone get different results with the same benchmarks.

As for the "added" units, the new barrel shifter is nice, the new multiplier still sucks (IMUL reduced from 14 to 10 cycles; athlon64 uses 3 cycles)

Click to expand...

I have a feeling that during simulations it was found that reducing the latency + higher clock speed might have been more economical -- transistor, financially or what, I don't know. BTW, does the P4 have a dedicate address calculation engine or does it use the existing integer logic?

For the cache, it has dedicated logic (all addressing modes are equally fast), but this logic is not accessible to the LEA instruction. So LEA is internally unrolled to whatever instruction sequence is needed to implement its operation - including a shift instruction if any of the scaled index addressing modes were used (ruining its performance on Northwoord). As for the multiply, you can generally trade off transistors versus performance over a rather wide performance range; the fastest known integer multiplier designs are about 3-4x slower than the fastest known integer adder designs at same operand sizes, so I expected Intel to reach a latency of about 4 or 5 or so cycles when they said that there would be a separate, faster, integer multiplier. 10 cycles sounds more like they just split out integer multiplies to ease instruction scheduling a bit (IIRC, Northwoord actually used the FPU for integer multiplies, a trick which is bad for performance but which Intel has been fond of in the past - Pentium1 used it, Itanium uses it etc).

Rugor · Apr 4, 2004

I may not be making myself quite clear. I don't think Netburst is pushing higher clockspeeds to the exclusion of all else, but I do think higher clockspeed scalability, and hence marketability has a higher priority than increasing performance.

As to power usage, yes, increasing IPC does increase power draw: everything does. The real question is which gives a greater performance increase for a given increase in power draw. Right now the performance increase of an Athlon64 over an Athlon XP of the same rating is much greater than its increase in power draw. Precott on the other hand has a much greater increase in power draw than performance over an equally clocked Northwood. This is a problem, and not one that's likely to go away soon. Additional pipeline stages draw power, and without increased execution resources those stages are not increasing performance. They are increasing potential performance (by improving scalability) but not actual performance.

These are just some of the reasons why I don't think the Netburst architecture was the best way for Intel to go.

AMD definitely did have problems with pushing the clock, and that was a real problem with the later XPs. However the A64 line not only seems to be scaling better, but also increased performance clock for clock over its predecessor. However, I don't think their SOI technology is that much more exotic than Intel's strained silicon.

Saem · Apr 4, 2004

AMD definitely did have problems with pushing the clock, and that was a real problem with the later XPs. However the A64 line not only seems to be scaling better, but also increased performance clock for clock over its predecessor. However, I don't think their SOI technology is that much more exotic than Intel's strained silicon.

Hrm, didn't know Intel was using Strained Silicon, if this is pertaining to the prescott, my haste prohibited me from getting this tidbit -- along with others.

In anycase, I'm not entirely convinced the A64 is scalling well, then again, you can't really tell unless a performance war starts up, we'll see how that goes.

10 cycles sounds more like they just split out integer multiplies to ease instruction scheduling a bit (IIRC, Northwoord actually used the FPU for integer multiplies, a trick which is bad for performance but which Intel has been fond of in the past - Pentium1 used it, Itanium uses it etc).

Well with lots of stages, filling pipes and killing bubbles is their primary focus, not sure if you can really say it's a bad design decision.

Rugor · Apr 4, 2004

Yes, strained silicon is pertaining to the Prescott. It's Intel's first if not the first processor released on a strained silicon process.

As to Athlon64's scaling, that's been demonstrated in two ways: First, tests on both Opteron and Athlon64 based systems have shown that the architecture scales very well with increasing clockspeed. Secondly, with the new FX-53 AMD has finally released a processor clocked above 2.2GHz. 2.4GHz may not be a lot by Intel's standards, but for AMD that's a huge step forwards.

jvd · Apr 4, 2004

The scaling is there for the a64s . There was info that before launch there were 2.7 ghzs versions air cooled.

Deadmeat · Apr 4, 2004

...

The NetBurst's demise can be blamed on the shortcoming of semiconductor technology and not on the architecture itself.

When the NetBurst architecture was being developed in the late 90's, no one foresaw the nasty power leakage problem; as far as the architect was concerned, the transistor density would continue to increase, while the power consumption would continue to go down. So why not put the massive transistor count to good use and develop a flexible and latency tolerantarchitecture??? Hence the NetBurst was born, a microprogrammable superpipelined architecture that could easily be adapted to add in new capability later(even new instruction sets).

Intel's strategy actually worked well for a while, until the transition to 90 nm began; now the transistors were leaking so much current that it took so much power just to keep the circuit switching. No one foresaw this, and the NetBurst went the way of Do Do in the changing semiconductor environment.

Transistors will continue to shrink no doubt, but you cannot jack up the clockspeed anymore and expect automatic performance boost from it. It will be the power consumption that will actually dictate the transistor count on the chip and not lithography.

Rugor · Apr 5, 2004

Interesting points Deadmeat, but I do consider it a flaw in the architecture, or at least the design strategy, that it would be held hostage by the semiconductor industry's ability to maintain its rate of performance increase through process shrinks.

Even when Willamette was introduced it became obvious that heat and power issues were going to be its Achilles Heels. However the architecture was predicated on the idea that these would always be relatively simple to overcome, and that scaling would continue indefinitely. Instead, the relentless pursuit of increased clockspeeds only brought those problems to a head sooner.

Intel would have been better served with continuing the Northwood strategy of increasing performance per clock as well as clockspeed, rather than pushing for even greater clockspeeds.

jvd · Apr 5, 2004

Problem was the huge clock speed lead the netburst needed compared to the athlons .

If it was only 100-200 mhz increase to stay even it would have been worth it . Even a 500mhz .

But the athlon 3200xp is clocked at 2.2 ghz and in about 30% of tests it surpasses a 3.2 ghz p4 .

That is just unacceptable .

With the a64 that 2.2ghz is now faster than the 3.2 ghz p4. Which is just another huge blow to nvidia

Intel adapts Performance Rating. A sign of slowing clockspee

Druga Runda

Sleepy Substitute

Saem

Rugor

zidane1strife

Saem

arjan de lumens

Rugor

Deadmeat

Saem

Rugor

arjan de lumens

Saem

arjan de lumens

Rugor

Saem

Rugor

jvd

Deadmeat

Rugor

jvd

Similar threads