Vince: no chips shipping in 2005 using 90nm? Look at the Sun

a4164 · Mar 1, 2003

I'm sorry to bring this up but didn't Itanic, err I mean Itanium have a horrible year in 2002:

"Just 3,500 of the estimated 4.5 million servers shipped last year used Intel's Itanium processor"

Thats pretty abysmal. 3,500 out of 4,500,000 in 2002 alone.

http://quote.bloomberg.com/fgcgi.cg...ddle=ad_frame2_technology&s=APluGqBU6SGV3bGV0

http://www.theinquirer.net/?article=7983

Panajev2001a · Mar 1, 2003

Gubbi said:
Panajev2001a said:

Gubbi, Intel and HP will deliver...
Itanium II is no performance joke and it is only Intel's 2nd generation IA-64 implementation while the upcoming Prescott is 7th generation ( and like the third upgrade to the 7th generation IA-32 implementation )...

IPF has the same OS as IA-32 has... Windows XP ( yes the 64 bits version ) and that means uniform programming model ( WIN32 API, Direct X, etc... ) and tools ( MS VS.NET, Intel reference compiler, etc... )...

Click to expand...

They haven't delivered yet. Merced was supposed to be faster than contemporary RISCs, it was about half the performance.

Second generation IPF was supposed to greatly exceed the competition, it is (very) competitive on floating point, and just barely keeping up on integer performance.

Merced was initially "supposed" to deliver that jump, but quite quickly Intel recognized that their lack of experience with the new architecture ( well compared to x86 ) and the fact that they were designing a CPU which had never had an implementation present in production environemnts ( using real software, doing meaningful tasks... that kind of information delivers better data to understand where we need to push a CPU... before we can make the common case fast we need to know what the common case is ) were not really yelding the planned results... and that's when the Itanic joke started...

Itanium 2 was supposed to be the real first IA-64 processor to go out as a full-fledged product and not as a demo product based on of the IA-64 ISA... and guess what the laughter IN the industry, the funny faces of chip designers all turned serious when the topic shifted to McKinley...

And Microsoft still has to prove that they can produce an enterprise class OS. Furthermore, what's the point of a 64 bit chip when the API is 32 bit? I am fairly confident that the serious applications for Win64 will use the NT native API and not WIN32.

First you use it because it is fast... 64 bits addressing or not... the bittness is relative in some cases and the chip is fast and running on the WIN32 API and WINDOWS XP means familiar tools and environment ( a reference model for ISV's )...

Panajev2001a said:
Panajev2001a said:

IPF in 130 nm will come and soon after that we will ( well relatively, 2005 ) receive the dual core revision with 9 MB of cache per core ( 90 nm, 18 MB total L3, 1 Billion of Transistors... it depends if they save the 65 nm process for the new IPF made by the ex Alpha guys or if they use it for the x86 platform.. internal rivarly of the IA-64 and IA-32 teams will play a role ) and the 130 nm Itanium 2 should ship at the end of this year with up to 6 MB of L3 cache and 1.5 GHz and then will be upgraded to 9 MB of L3 cache in 2004...

Click to expand...

But these are just shrinks of the existing core. All their competitors will shrink their designs as well, Power 4 is already dual core, Sparc will be.

Seeing that the single-core 7 stages 0.18 um Itanium 2 is keeping up well enough with the dual core POWER 4 is not a negative sign... and the point is not if other competitors will be able to shrink one day ( a die shrink is no easy job, btw... ), but who can keep the foot on the gas in terms of more and more aggressive manufacturing processes... and I do not see SUn faring THAT well in that regard...

Panajev2001a said:
Panajev2001a said:

Process wise if you compare the, as some have done, the .18 um Itanium 2 with the .18 um Pentium 4 ( Willamette ) it is quite clear Itanium 2 is not sucking THAT hard...

Click to expand...

Apples and oranges. Granted, the I-2 is faster but it also has a vastly larger die (and produces more heat) and a more aggresive memory subsystem.

Wait... let's compare it with the Pentium 4 Xeon based on the Willamette core in .18u first... that should level off some figures...

Itanium 2 has yes more cache, but the chip size remains still far from being badly bloated, Itanium 2 also includes in HW a x86 core designed to provide HW backward compatibility with x86 and that takes space and produces quite a lot of heat...

Panajev2001a said:
Panajev2001a said:

People stopped laughing after Merced... Itanium 2 showed IA-64 is no joke... come on, they are the same guys who kept making miracles and miracles for the x86 platform, and the IA-64 ISA is a more recent and better developed ISA...

Click to expand...

Actually they are not the same. I-2 was mostly developed by a former HP design team (now Intel). Of course the process people are all Intel.

As for IA-64 being a better ISA, time will tell. It was conceived at a time where people thought out of order schedulers wouldn't scale. This was proven wrong over time (the Pentium 4 being the prime example).

It was based on the premise that scheduling is BAD. So HP came up with EPIC, which esentially is compressed VLIW. Instructions are bundled together 3 at a time, template bits are used to describe where the instructions are supposed to be scheduled (what type of exec unit the individual instructions go to). Now, this only works if the processor has a full complement of exec units to match the different bundle types. This is why Merced sucks and McKinley doesn't: Merced lacks execution units and the instructions in the bundles has to be scheduled (extending the length of the pipeline) to the available exec units, and stalls are frequent.

McKinley has the exec units to match two whole bundles, hence bundles are fetched, template bits decoded and the instructions handed down to the execution units, - much faster.

I know the story of EPIC and the problems that McKinley fixed over Merced ( several things ): faster FSB, lower latency caches ( L1, L2 and L3 were really well balanced... L1 for latency, L2 for speed and L3 for size [and the latency to the on-chip L3 is nice too... an advantage came from embedding the L3 on the chip core, no need of an expensive off-chip SRAM module that uses PCB space too] ), two more Memory Units ( yues they can do standard Integer Math too ), shorter pipeline, etc...

The issue with IA-32 vs IA-64 is who is the better ISA... to me it appears that the IA-64 covers many of the mistakes that we ketp carrying with us since the beginning of x86 ( variable instruction lenght, low number of logical GPRs, etc... ) and we kept patching and patching and patching...

Scheduling ( out of roder ) is not necessarily bad ( in IPF loads and stores are allowed to be scheduled out of order in several cases ), but it is transistors heavvy and the thought was to take away complex HW based dynamic scheduling with this EPIC paradigm as RISC brought away the really complex instructons decoding characteristic of CISC...

They really believe that creating a nice and fully featured ISA and good enough compilers that they can do most of the work at compile time and evolving the CPU accordingly...

The two Memory units were added as in too many occasion Merced could not issue two bundles per cycle as you ould stall after you meet the first point of so called "resource over-subscription"...

MFI + MMI... we would issue only 4 instructions out of six here...
in the case of MMI + MII we only issue the first bundle and not the second one... lots of cases in whcih you cannot issue both bundles...

Here is Merced:

And here is McKinley:

( this is a nice article that covers it nicely: http://www.realworldtech.com/page.cfm?AID=RWT071901001629 )

I can see two problems in the future for IPF.

1.) Imagine you have profiled different applications and found that your IPF processor lacks integer performance. The problem is that you can increase performance by making the processor crack another bundle per cycle, -and to do that fast you need a full set of execution units. Hence you get an extra floating point unit, extra branch units and an extra load/store unit (demanding another port in the cache, impairing cycle time) which you don't really need or want.

This question is a bit too generic... what is the limitation ? Can it be covered bya more aggressive compiler and more FDO ( Feedback Oriented Optimiztion ) ? Lowering caches latency or increasing main memory bandwidth would help ? Would it help to add more cache ?
Should we need to work on increasing the clock frequency ?
Should we release a multi-core soution ?

There are things we can do before moving to 9 or 12 instructions per cycle ( 3 or 4 bundles )... and even when we do we might find interesting ways of doing it... Intel has amazed me in the past...

2.) Future implementations are likely to be multithreaded. The apparatus needed to schedule and track which instructions belong to which context is very similar to the register renaming and OOO scheduling you find in OOO cpus, which is why the Hyperthreading in P4 only took an extra 5-10% die area (similar numbers from the Alpha people regarding EV8). IPF, however, has none of this. You can of course build an OOO multithreaded implementation of IPF, but then the template bits in the bundles are basically discarded (ie. baggage). And your compiler is stuck with generating code for the bundle instruction format, filling each bundle with NOPs for every slot that it can't use, which in turn wastes I-cache and fetch/decode bandwidth.

You are only thinking about Simultaneous Multi Threading or SMT... that is not the only available MT solution... Switch On Event Multi-Threading ( also known as CMT or Coarse-grained Multi-Threading ) ...

The most basic type of TLP exploitation that can be incorporated into processor hardware is coarse grained multithreading (CMT), shown in Figure 1B. The processor incorporates two or more thread contexts (general purpose registers, program counter PC, process status word PSW etc.) in hardware. One process context is active at a time and runs until an exception occurs, or more likely, a high latency operation such as a cache miss during a load instruction. When this occurs, the processor hardware automatically flushes and changes the thread context, and switches execution to a new thread.

For contemporary MPUs, a memory operation initiated in response to a cache miss can take over a hundred clock cycles, which represents the potential execution of hundreds of instructions. A conventional in-order processor will simply stall and forever lose those hundreds of potential instructions slots waiting for memory to respond with needed data. A conventional out-of-order execution processor has the potential to continue to execute other instructions that werenâ€™t dependent on the missed load data. However, independent instructions tend to be quickly exhausted in most programs and the processor simply takes longer to stall.

But a coarse grained multithreaded processor has the opportunity to quickly switch to another thread after a cache miss and perform useful work while the first thread awaits its data from memory. Many programs spend considerable time waiting for memory operations and a coarse grained multithreaded processor has the opportunity to increase overall system throughput, compared to a conventional processor performing OS-based multitasking.

Who knows what other tricks people like the Alpha guys might have in store for IA-64... who knows how cheaper it will become to manufacture once it hits 90 nm and IA-32 is moved back to software emulation and the processor is fast enough to do it acceptably and we can use the transistors wsted to HW x86 compatibility for something else...

Panajev2001a said:
Panajev2001a said:

Alpha is dead...

I do not expect even the "would ahve been wonderful" EV7 to make a real dent in anyone's business...

Click to expand...

Sadly, yes.

Even though it appears to be sandbagged (low operating frequency), with a smaller die size and lower power consumption it still beats IPF on everything but some floating point applications (see recent SAP benchmarks ? )

Cheers
Gubbi

Yeah sadly Don Capellas and Carly Fiorina ( the I killed Lucent and then I left for HP lady

) seemed to have left the EV7 as a ghost of its former self... and still it shows potential

Panajev2001a · Mar 1, 2003

Itanium == joke

Itanium 2 == first serious product...

V3 · Mar 1, 2003

Itanium == joke

Itanium 2 == first serious product...

I think, Intel needs a third one before people take them seriously.

Gubbi · Mar 1, 2003

Panajev2001a said:
Itanium == joke

Itanium 2 == first serious product...

Agreed. Funny I was using the same Paul Demone article as a basis.

My main point is that the main thing IPF had going for it was hype. People doesn't buy that anymore. McKinley is competitive but not obliterating the competition as Intel and HP would have you believe.

The only market I see for IPF in the 2-4 CPU market is in high end workstations with _heavy_ emphasis on FP. In the 2-4 CPU server market Xeon and Hammer will dominate completely.

Then what remains are the larger server configuration and here IPF is up against Sparc, Power and Alpha (if you can convince HP to sell you one) all of which are system centric designs.

Cheers
Gubbi

Gubbi · Mar 1, 2003

a4164 said:
I'm sorry to bring this up but didn't Itanic, err I mean Itanium have a horrible year in 2002:

"Just 3,500 of the estimated 4.5 million servers shipped last year used Intel's Itanium processor"

Thats pretty abysmal. 3,500 out of 4,500,000 in 2002 alone.

Certainly dissapointing, but remember that these are Merced systems, -serious under performers. SGI's new Altix, HP's upcoming SuperDome upgrade and others are all based on McKinley and likely to be much more competitive.

Cheers
Gubbi

Crusher · Mar 1, 2003

Tagrineth said:
Servers and high-end workstations.

SPARC memory subsystem is quad-channel SDRAM. Throughput is completely insane.

Funny, I never felt that when using one. I would have gladly chucked each and every Ulta 10 workstation out of the window and broken them on the street out of frustration with how slow they were. Perhaps it was amplified by the fact that the only reason we were using them was for Java programming, which would have been faster and easier on a P266.

edit: I'm also a Solaris hater, but that probably had more to do with the fact that the user interface was about equivalent to linux running twm, and less to do with what the system was capable of.

a4164 · Mar 1, 2003

I'm sorry, prehaps I haven't been following the server market as much as I should but I thought McKinley (Itanium2) was already out since Summer 2002. Available in 900mhz (1.5mb L3) and 1GHz (1.5-3mb L3) both still on 0.18 micron. With Madison coming in Summer 2003 at 1.5GHZ (6mb L3) on 0.13 micron.

I thought those numbers for Itanium servers shipped in 2002 included McKinley based Itaniums.

http://www.intel.com/pressroom/arch...iid=ipp_srvr_proc_itanium2+info_20020708comp&

http://www.intel.com/products/server/processors/server/itanium2/index.htm?iid=sr+itanium&

http://www.intel.com/products/roadmap/index.htm?iid=ipp_srvr_proc_itanium2+prod_roadmap&

Gunhead · Mar 5, 2003

Tagrineth said:
And SUN's own 3D workstation cards (Creator3D, Expert3D)
are sickeningly powerful. Think 3Dlabs, squared.

Funny, the Expert3D specs don't look that hot...

6 MTri/s, 143 MPix/s, 128MB onboard... Of course those should be pretty quaranteed
rates even when the going gets tough and the models big. And of course there's the
option of throwing several cards into one box. But still, Tagrineth, are you sure you're
not over-estimating these?

Edit: I couldn't find anything on the Creator3D (odd, I've seen stuff there before), but
what I found on Elite3D was "47.2 MTri/s and 3.2 GPix/s" for an octuple-card
configuration, suggesting that a single card isn't blindingly impressive.

And Panajev, thanks a lot for creating the widescreen effect

Tagrineth · Mar 5, 2003

Gunhead, near-perfect HSR.

Gunhead · Mar 5, 2003

Hokay. Kinda expected you to have something up the sleeve there

Any linkage to Sun's HSR method? Always was fascinated by power tools ;-)

Tagrineth · Mar 5, 2003

Gunhead said:
Hokay. Kinda expected you to have something up the sleeve there

Any linkage to Sun's HSR method? Always was fascinated by power tools ;-)

Nope. I don't know if anyone knows just what it does... outside of SUN that is. ;P None of the tech docs mention it... but it's there alright.

Vince: no chips shipping in 2005 using 90nm? Look at the Sun

a4164

Panajev2001a

Panajev2001a

V3

Gubbi

Gubbi

Crusher

Aptitudinal Constituent

a4164

Gunhead

Tagrineth

murr

Gunhead

Tagrineth

murr