Larrabee delayed to 2011 ?

Voxilla · Dec 15, 2009

Nick said:
Voxilla said:

Larrabee likely is not able to come anywhere close to the 500+ Mtris/s rate of current GPUs, due to lack of hardware rasterizers and software rasterizing becoming hopelessly inefficient for small triangles.

Click to expand...

Wrong..

That is easy to say, I did a little test rendering 2 M visible textured triangles at 1920x1200.

With Swiftshader I'm getting 4 Mtris/s on a 3.2Ghz quad core.
WARP only manages 2 Mtris/s.

Assuming a Larrabee core could render small triangles as fast as a full blown OoO, higher clocked core, (which I doubt as the vector unit would be of little benefit in this case) than we arrive at 32 /4 *4 = 32 Mtris/s.
This of course is a ridiculous low number as a mainstream GPU is 10 times faster.
I can only assume that both Switshader and WARP are not optimized at rendering small triangles, even than making it 10 times faster seems like a big jump.

Gubbi · Dec 15, 2009

Nick said:
They must be there for a reason. Else they would have just emulated those as well. So people clearly actually put some CISC instructions to good use.

I think it has less to do with specific instructions, and more to do with ISA quirks, like setting the conditional flags on every ALU instruction. That's a lot of RAW hazards if you don't do something about them (ie. rename the be-jeezus out of them).

Nick said:
The Alpha was exceptional mainly due to being almost completely custom designed. It was also exceptionally expensive, and exceptionally power hungry due to extensive use of dynamic logic.

It was only exceptionally expensive to produce because DEC had to maintain a heavily under-utilized fab to produce them. It was above and beyond anything at the time, as was EV7 and EV8 would have been.

That's the one thing x86 has going for it: Volume.

Cheers

Nick · Dec 15, 2009

OpenGL guy said:
??? The internal bus to the cache is 1024 bits, that is what I am referring to. The external memory bus is probably far smaller than 1024 bits.

That's what I thought, but I wasn't expecting a discussion about the internal bus...

And Z doesn't always stay on chip, even for a TBDR. Shadow buffers are used in nearly every game these days. Yes, the overdraw will happen at chip BW speeds, but that will still be a limiting factor if you don't compress as I mentioned in my initial post.

How can that be a limiting factor if every core can read and write full width vectors every clock cycle?

3dilettante · Dec 15, 2009

Nick said:
They must be there for a reason. Else they would have just emulated those as well. So people clearly actually put some CISC instructions to good use.

The additional bits tended to be parts that would have tanked performance emulating some of the more difficult parts of x86 in software.

This is likely why the Godson processor has a huge swath of instructions to enhance x86 emulation.

Transmeta mainly attempted to improve efficiency using VLIW.
The Pentium 4 processors it competed with at that time had a very low IPC. So extracting static instruction parallelism made perfect sense. However, today's x86 CPUs have much improved IPC. And there's only so much static instruction parallelism that can be extracted. So today VLIW wouldn't offer any substantial benefit. You can even look at modern x86 cores as using dynamic VLIW, which comes with extra performance benefits which offset the overhead in the decoders and schedulers.

It was a significantly smaller chip with reduced power consumption.
It was weaker by half than a mobile Pentium per clock, but the advent of Atom seems to indicate that these days it may not be as important as it once was for certain markets.

I think a lot of people underestimate how hard it would be to beat today's x86 processors. Even with total ISA freedom, at least 90% of the die space will have to be identical to achieve that level of performance. Just look at this this Phenom II die shot. The four cores themselves only account for 25% of the entire die space, and even that still contains the L1 caches. Heck it would even be a massive accomplishment to shave off 5% whithout loss of performance. Off course the caches don't consume as much as the actual pipelines, but still I doubt there's much to gain using a different ISA.

One would have to count the L1s as part of the cores, given how intimately they are tied to the core, and the fact that they are physicaly bigger because of x86.

As a counterpoint, a majority of Larrabee is cores. Area and power savings there are more significant than they are for a design that is already power limited at 4 cores.

I wonder if they could offer the best of both worlds: scalable directory-based coherency, and message passing? The directory-based coherency would be slower than snooping but using speculation would hide latency. It's only there for backward compatibility and when you really need it.

That would be interesting to try. It can offer the opportunity to add dedicated data routes of different configurations than the rather coarse cache lines, and possibly additional software instructions for explicit communication.

The Alpha was exceptional mainly due to being almost completely custom designed. It was also exceptionally expensive, and exceptionally power hungry due to extensive use of dynamic logic.

That would put it in a similar category as the x86 chips of the time. x86 would have not been competitive without the significant amount of custom engineering needed to make it performant.

Early Alphas also didn't adhere to IEEE-754. The EV6 was the first to offer full conformance in hardware. x87 offered full conformance more than ten years earlier. It also offered an 80-bit format, and trigonometric functions.

The 80 bit format was rarely used and x87 essentially lopped of half of a chip's FP performance, in large part because compilers had to be overly conservative about spilling registers for the stack-based FP.

OpenGL guy · Dec 15, 2009

Nick said:
How can that be a limiting factor if every core can read and write full width vectors every clock cycle?

I don't think they can. 1024 bits divided between a maximum of 16 cores would leave only 64 bits per core per cycle. That's only 2 Zs per core, so 1/8th of the full vector width. Maybe the next incarnation of Larrabee will be different, but this was what was discussed previously.

Edit: I think I see what you mean now. You're implying that each vector unit can use its local memory to write out 16-wide vectors per clock. That would be 256 Zs per clock, assuming nothing else interfered.

rpg.314 · Dec 15, 2009

Nick said:
Sure, but it's not like a GPU's VPU can operate in a vacuum either. The overhead you can actually attribute to x86, for the entire die, is likely in the order of 10%.

They actually have quite a bit of freedom. Modern x86 processors are very different internally from the earliest x86 processors. And also on the ISA side they can always add new extensions, and deprecate or emulate old ones. x86 has faced numerous issues during its lifetime, but has always survived despite people prediciting its end. This really illustrates again that the ISA doesn't have that much of an impact.

Really? Tell me that when you see atom beating arm at perf/mm/W. Come to think of it, we'll see how much ISA matters when ARM cortex A9 faces off with intel atom next year.

AFAIK, atom will still be an OoM behind arm on static power next year.

Nick said:
Considering that Larrabee is ready as an HPC part but not as a GPU it's indeed possible they faced some texturing issues. Latency hiding is another software issue though; scheduling and adjusting the number of qquads per thread can make a big difference.

LRB's readiness for the HPC market is a matter of speculation at the moment.
http://forum.beyond3d.com/showpost.php?p=1367868&postcount=358

x86 might make sense for intel from a time-to-market pov. But it is clearly a luxury to developers in the near term, and it is not at all obvious if lrb can afford it. Also full cache coherency in hw is of limited utility when you have O(100) hw threads. Shared mutable state is a bane of parallel computing, then why should there be support for it in hw.

The other lrb ideas, ie tbdr based sw pipeline for rasterization and unification of on-chip memory pools are clearly a step in the right direction. Other gpu's will do well to follow these ideas to improve their overall efficiency.

Nick · Dec 15, 2009

psurge said:
Wasn't transmeta using VLIW + dynamic recompilation to run x86 code?

Yes.

If I'm not mistaken about that, then I'm not sure how this qualifies as being "mostly free" from x86 baggage. It seems to me that the dynamic recompilation engine would be limited in the scope and expense of the optimizations it could apply relative to a static compiler.

Why? There are x86 decompilers that reconstruct the C code perfectly (except formatting of course). They had every bit of opportunity to run this faster on a VLIW processor.

Such a recompiler also has no higher level source it can derive valid transformations from.

You mean to perform additional optimizations not performed on the x86 code? I'm sure Transmeta was able to run unoptimized x86 code faster than an x86 processor. Big deal. The real challenge is to run optimized x86 code faster. When that fails, let's blame the recompiler for not optimizing things the x86 compiler didn't spot either?

So when you claim that it failed to demonstrate substantial advantages over x86, do you mean it failed when executing x86 code, or that even when targeted by a native compiler, no substantial advantages were to be had?

It makes no real difference. With or without source code you could rewrite x86 assembly for any other ISA and the number of operations (not instructions) and their dependencies remain the same. So the real overhead of x86 is only in its decoders, which output RISC instructions. Note that for example PowerPC, a RISC architecture, actually has more instructions than x86. So it's not like x86 makes the execution units more complex than those of RISC architectures either.

Transmeta believed that x86 was so inefficient that using VLIW + dynamic recompilation would have no trouble outperforming it. They failed.

The fact that it consumed less power doesn't tell us much either. If it really had better performance/watt they should have demonstrated one that was faster under equal power consumption, or one that was equally fast while consuming less power. It's the only way to compare apples and oranges. Even between x86 processors it's easy to compare an ULV Core 2 Duo and a desktop Core 2 Duo and make false conclusions about which has the better architecture...

3dilettante · Dec 15, 2009

Nick said:
So the real overhead of x86 is only in its decoders, which output RISC instructions. Note that for example PowerPC, a RISC architecture, actually has more instructions than x86. So it's not like x86 makes the execution units more complex than those of RISC architectures either.

A performant x86 does make the caches, load/store, and execution units more complex. It is significantly worse with OoO, as multiple items like condition codes and exception tracking are renamed or duplicated.
There's also a signficant microcode store required to handle all the esoteric and old instructions.

It is difficult to have a variable length instruction set with a mass of addressing modes and a mass of condition codes not impact the entire pipeline, not just decoders.

That being said, the cost of a decent front end was such that AMD decided it had to be amortized across two integer cores.

dkanter · Dec 16, 2009

1. I have a hard time seeing Atom with a 8x or 10x perf/watt advantage over ARM in the near future. ARM is really efficient.

2. TMTA had some really serious issues with hardware that hampered their results. If you look at DP GFLOP/s per Watt though, Efficeon (90nm) was quite competitive with Woodcrest (65nm), Yonah (65nm) and others.

3. There is area, complexity and power overhead for x86. The advantage of a CISC isa though is that the instruction stream is effectively being compressed. I haven't seen a quantification of the overhead in power/area, and I don't know how to quantify complexity...but I think 20% overhead for x86 isn't unreasonable. In fact, the biggest costs are probably validation, since there is no such thing as an x86 spec.

ban25 · Dec 16, 2009

silent_guy said:
For developers, I don't see how it makes a practical difference.

Indeed. We have common libs that are cross platform (PS2, GC/Wii, X360, PS3, PC) and some go back to the PS1. In most cases we try to write portable code, with #ifdef'ed sections for platform-specific code.

Nick · Dec 16, 2009

darkblu said:
maybe. then again, there may be hw-posed issues for which you may never find a satisfactory solution in sw. as you said, since it's always the combination of hw and sw, a potential design issue discovered late in the hw (due to late sw, etc) could amplify the burden on the final sw by orders of magnitude. but since you brought up Abrash' role in the project - if Mike could not deliver a satisfactory rasterizer running on this hw, then maybe the hw was not meeting its GPU-domain targets?.. just speculating here, of course.

There's definitely a strong interaction between the software and hardware. Michael Abrash, Mike Sartain, Tom Forsyth and other people from the software team were directly involved in the design of LRBni, and probably many other hardware aspects of Larrabee. It's quite possible the hardware didn't behave as expected, and this had big consequences for the software, causing much delay. Or possibly they misjudged some software aspects and this demanded a hardware redesign. But either way you can't just make hardware adjustments and expect the performance of the software to go through the roof. You still have to balance the use of all available resources and major software changes can be required for minor hardware changes. At least that's my personal experience hopping from one CPU generation to the next, and Larrabee is an entirely different animal.

Anyway all I'm saying is that the software side of this is more complex than for any other computing device like this. Software schedules slip all the time, and considering the strong interaction with the hardware and the high targets I'm really not surprised Larrabee got delayed, despite some very bright minds working on it. It doesn't mean it's cancelled for good, and it doesn't mean x86 is the culprit for the delay.

i really don't understand your logic about ISA acceptance - ISA's today are mostly judged by how well they play with their compilers (from coders' perspective), and meet their price/performance/wattage envelopes (from EE/business perspectives) - not by their proximity to their 40th anniversaries. look round you - chances are you'll find more devices hosting 'young' ISAs than such hosting x86 or older. so what reception and by whom are you concerned about?

PC developers. The ability to create libraries that can run on a CPU as well as on Larrabee is a major advantage. And being able to use familiar tools also lowers the entry. Also, selling CPU-compatible binaries forms an incentive to create not only full-blown applications but also libraries and tools for other developers: a software ecosystem.

dramatically - no. but then again LRB's scalar ISA did not have to copy in-verbatim any existing ISA - intel could have used that to their best advantage if they hadn't been so concerned with the central socket. heck, they could've used a modernized form of their dear x86 (say, x86-64) and shaved off the legacy, trimmed the pipeline, retooled the protection mechanism - any/all of the above, just to make LRB a better GPGPU core, while staying in familiar waters. but nooo - it had to be the word97-compliant GPU.

Being able to run the same libraries on a CPU and on Larrabee means they need a common base: the Pentium instruction set (with support for x86-64). Everything else is part of an extension so it's no different than developing for CPUs with support for different ISA extensions.

Besides, I really doubt you can make a significantly better x86 core by shaving off legacy stuff. An i386 supports 99% of all Pentium instructions (minus x87) and weighs in at only 275,000 transistors. And in fact they did avoid the biggest overhead by sticking to in-order execution.

i agree, the task could become prohibitively-complex if they were seeking for that head-shot at the GPU - 'behold, we're the new GPU masters!' - again, i don't think anybody (clinically sane) expected LRB to dethrone any GPU heavyweights - if intel themselves had such expectations then maybe they were not familiar with the problem domain they were getting involved in. again, not saying they had such expectations - just trying to carelessly speculate of the events we've witnessed lately.

It takes a certain level of confidence to enter the GPU market, that can quickly turn into cockyness. While I still believe they could eventually succeed at creating an interesing GPU, it's not unlikely that their original time schedule was too cramped.

i'd venture to guess ms' ref performance issues are mostly algorithmic, and only secondly - of insufficient clock-counting. IOW, one cannot say that had they 'coded to the metal' of any ISA they'd have achieved much better peformance. how does that prove the (non-)fitness-to-a-task of an ISA, though?

My point was that software performance varies by a huge amount, with the ISA choice being of only secondary importance. And Larrabee depends mostly on LRBni anyway, not the legacy x86 portion. So once again it's more likely it has been delayed due to software issues than ISA issues. I can't exclude other hardware issues, but it's very unlikely they'll touch x86, even if that was an option.

maybe i misunderstood you, but i believe you mentioned something about 'incremental performance improvements of present code' in your original post, ergo my performance reference. pardon me if i've erred.

Initially code targetting the CPU will run slower on Larrabee, which can then be improved incrementally to run faster.

so let me know if i got you right here: developers would really like to spare themselves a trivial recompile of their existing scalar code to a new scalar ISA, but they would eagerly face the challenges of a brand new VPU which, apropos, is only meant to carry the bulk of the workload? hmm..

Trivial recompile? Porting code is much more complicated than a recompile. For a lot of external libraries you might not even have the source code. Having to rewrite those really discourages a lot of developers. If instead you're able to get a prototype running on day zero it really boosts confidence. You might eventually end up rewriting those binary libraries anyway if it helps performance, but at least you'll be able to do that incrementally one function at a time, which is a massive advantage. Take it from a TransGamer.

you can open your whatever compiler and start using the APIs these parts prudently offer. occasionally, you might have to resort to heresies like OCL, CUDA or native compilers *gasp*. regardless, it would be still a tad better than what you could do with a LRB today.

Sure, but it wouldn't be better than what you'll be able to do with Larrabee tomorrow.

Larrabee is a long-term investment in dominating the computing world. Anything that helps smoothen the path is a significant advantage. Despite carrying some overhead, x86 is such an advantage.

what delay - the parts in question are on the market. the APIs are on the market. the native compilers are coming last, but you can ask intel how that generally goes. *wink*

The delay in creating momentum in software development. Sure, you can use the APIs to create certain applications, but you can't create anything not supported by these APIs, which the processor would actually be capable of. And even for APIs with the same functionality developers can have a certain preference. The more things you support the more attractive the platform becomes to a wide range of developers. And instead of writing everything themselves, Intel can offer x86 compatibility and a couple of tools to enable developers to quickly port things over, use existing APIs, and create and trade new components.

I don't really get your 'wink'. As far as I know Intel has always updated its compiler with support for new ISA extension very quickly. Besides, with x86 compatibility Larrabee doesn't even depend on Intel to deliver all the development tools. There are plenty of compilers and other tools for x86 that only need a minor update to support LRBni.

issues with what - putting their next SIMD into their next CPU - no - they just sent it off the door - no issues. or do you mean issues for the developers - easy use of the new extension without having to (re)learn a new ISA? - let me see - no auto-utilizing/auto-vectorizing compilers for generations upon generations of the ISA, instead some rudimentary in-house libraries, but clear delegation of the onus of making use of the new extension to the app developers. which, combined with lack of basic capabilities commonly found in other SIMD ISA, can only mean no issues for the app developers. *grin*

Despite all that, game developers and codec developers feast on new ISA extensions like vultures. Nobody wants to be left behind, and it only takes a few developers to abstract things into libraries. Given that LRBni does offer a wide range of SIMD capabilities, it will have no issues creating a software ecosystem from null. It's the legacy x86 portion for which the existing ecosystem really matters.

let me ask you something: why, in your sincere opinion, developers embraced the GPU shader architectures, particularly after the advent of the HL shading languages? and why do you think intel was so determined to come up with a GPGPU of their own. i mean, after all they had the, erm.. fine GMA line of GPUs (more than half of the pc market - that's what we call a developers' embrace, right?), and they held the key to the central socket. so why?

Developers embrace shaders... for graphics. It makes a lot of sense to describe the "shading" of millions of pixels as a function in a high level language. It all breaks down though when trying to create a different data flow. After CPUs became multi-core and considerably increased SSE performance, a lot of developers stopped bothering about GPGPU.

However, multi-core CPUs aren't the best answer to generic thoughput computing either. So Intel realized it had to come up with something that combines the best of both worlds...

Of course GPU manufacturers haven't been sitting on their hands either. OpenCL and architectures like Fermi chase the same goals as Larrabee. But it's pretty clear that the classic concept of shaders are getting dated. And it remains to be seen whether OpenCL and Fermi really offer the level of freedom developers are looking for, or whether it's just another half-baked solution for a specific domain. Note that for Intel it doesn't really matter. Larrabee will support OpenCL really well. Each 'shader unit' comes with its own fully programmable 'controller'. Either way Intel doesn't have to bet everything on one API. It can support anything developers want, sparking off from an existing x86 ecosystem.

so, another Q from me: what, again in your sincere opinion, failed this project? maybe Abrash & co's inability to pull a half-decent D3D/OGL implementation for it?

Abrash & Co. are very capable of pulling it off. I just believe it's not finished yet. There was no room for errors, but with a project of this complexity that was unavoidable. Nobody can predict the full consequences of every tradeoff, so even minor miscalculations can result in major redesigns.

i mean, according to you, the performance/power ratio should've been ok (unless there was something deeply screwed up in LRBni, since the rest of the chip - namely x86 and caches - were just fine), the adoption of the programming model would've been fine (x86? - woot!). pretty much everything would've been roses. and yet, no LRB on the shelves after 3 years of focused effort (by some pretty smart individuals at that, where we totally agree). so why*?

Three years is nothing. G80 took at least four years of development and was not nearly as ambitious. And since many of the things that delay GPU development have been turned into software on Larrabee, which had to be written from scratch, I'm betting the "why" is a number of software issues that simply need more time to get resolved.

Nick · Dec 16, 2009

Ailuros said:
However since Intel clearly stated that LRB was behind in terms of its hw and sw schedule, it's rather naive to blame sw alone.

I'm not blaming software alone. There could very well be a hardware issue. But since the hardware can only be an issue because of how the software uses it, it's always a software issue as well. For instance you can solve a bandwidth issue by increasing physical bandwidth, but you could also rewrite the software to use less bandwidth. Doing both would actually be a waste. So there are numerous tradeoffs to be made, all of which involve the software.

Nick · Dec 16, 2009

Voxilla said:
I did a little test rendering 2 M visible textured triangles at 1920x1200.

Why only visible triangles?

With Swiftshader I'm getting 4 Mtris/s on a 3.2Ghz quad core.
WARP only manages 2 Mtris/s.

I'm getting 46.75 million triangles/s, SwiftShader 2.x, Core i7 920 @ 3.2 Ghz. My GeForce 8800 GTS 512 does 360.73 million triangles/s, same test, same system.

And I honestly haven't really given triangle rate a lot of attention...

Assuming a Larrabee core could render small triangles as fast as a full blown OoO, higher clocked core, (which I doubt as the vector unit would be of little benefit in this case) than we arrive at 32 /4 *4 = 32 Mtris/s.
This of course is a ridiculous low number as a mainstream GPU is 10 times faster.
I can only assume that both Switshader and WARP are not optimized at rendering small triangles, even than making it 10 times faster seems like a big jump.

Wrong.

Voxilla · Dec 16, 2009

rpg.314 said:
Really? Tell me that when you see atom beating arm at perf/mm/W. Come to think of it, we'll see how much ISA matters when ARM cortex A9 faces off with intel atom next year.

I'm curious how far ARM will be able to eat into x86 territory, no doubt in my mind that this ISA is far superior and easy to implement efficiently in hardware

rpg.314 said:
The other lrb ideas, ie tbdr based sw pipeline for rasterization and unification of on-chip memory pools are clearly a step in the right direction. Other gpu's will do well to follow these ideas to improve their overall efficiency.

Another way it might go is by making use of eDRAM. Once you get 64 MB or even 128MB of this on die, you have all the space you need for hires frame/Z buffer and AA. Could easily give you another 100 GB/s bandwidth. Shifting it to a separate die including ROP, Xenos style would make even more sense now, as the GPU without ROP could serve well as a GPGPU.

Voxilla · Dec 16, 2009

Nick said:
Why only visible triangles?

I didn't write new software to test this, using my xvox terrain renderer

Nick said:
I'm getting 46.75 million triangles/s, SwiftShader 2.x, Core i7 920 @ 3.2 Ghz. My GeForce 8800 GTS 512 does 360.73 million triangles/s, same test, same system.

I'm curious what benchmark you use.
With 3dmark2001 I'm getting 8 Mtris/s, of which probably half are culled.

rpg.314 · Dec 16, 2009

Voxilla said:
I'm curious how far ARM will be able to eat into x86 territory, no doubt in my mind that this ISA is far superior and easy to implement efficiently in hardware

Well, depends on how successful Chrome OS can be. :???:

Another way it might go is by making use of eDRAM. Once you get 64 MB or even 128MB of this on die, you have all the space you need for hires frame/Z buffer and AA. Could easily give you another 100 GB/s bandwidth. Shifting it to a separate die including ROP, Xenos style would make even more sense now, as the GPU without ROP could server well as a GPGPU.

That will be too costly for cheaper gpu's. Prolly they'll tile the rendering to make do with less edram.

MfA · Dec 16, 2009

I don't fancy the chances of arm much in netbooks, most people seem to just want a small windows laptop with an extremely hard to use keyboard, but if tablets take off they could do well there I think.

Nick · Dec 16, 2009

Gubbi said:
I think it has less to do with specific instructions, and more to do with ISA quirks, like setting the conditional flags on every ALU instruction. That's a lot of RAW hazards if you don't do something about them (ie. rename the be-jeezus out of them).

Why would this affect Transmeta's processors? They simply recompile everything to whatever RISC equivalent they want.

ARM and PowerPC can also update the status flags with most ALU instructions...

It was only exceptionally expensive to produce because DEC had to maintain a heavily under-utilized fab to produce them.

You're sure the full-custom design didn't affect yields? And the R&D cost didn't affect part cost?

It was above and beyond anything at the time, as was EV7 and EV8 would have been.

Sure, due to its insane clock frequency, which also put it above and beyond anything in power consumption as well.

Gubbi · Dec 16, 2009

Nick said:
ARM and PowerPC can also update the status flags with most ALU instructions...

It's not that you can, it's that you can't avoid updating the condition flags.

Think of the carry flag. Almost all ALU instruction sets it, and almost all ALU instructions uses it. That is an endless stream of RAW hazards in a wide, deeply pipelined in-order cpu core. I'd imagine that you can do a lot of static analysis at translation time and remove a good deal of the dependencies, but in a lot of cases you can't.

Thus: Special hardware support for flags.

Nick said:
You're sure the full-custom design didn't affect yields? And the R&D cost didn't affect part cost?

Since the bulk cost of running a fab is capital cost and not running cost, having poor yield in a heavily underutilized fab makes little difference to the bottom line. And I don't even know if Alpha yielded poorly.

Nick said:
Sure, due to its insane clock frequency, which also put it above and beyond anything in power consumption as well.

Above and beyond not only in circuit design, but architecture and system infrastructure too. FWIW 21264 topped out at 90W power consumption. The fastests 1.33GHz 21264C used just 80W.

Interestingly 21464 would have had 4-way multithreading per core. There was a vector extension proposed that would have incorporated 32 very wide (8Kbit) registers to the architecture. It was supposed to ship in late 2003, but with the merger of first DEC to Compaq, and then Compaq to HP it got killed (another IPF causalty). Sound familiar ? Here we are, 6-7 years later, still waiting for a less ambitious implementation of these ideas.

Cheers

Nick · Dec 16, 2009

3dilettante said:
The additional bits tended to be parts that would have tanked performance emulating some of the more difficult parts of x86 in software.

But why? If RISC is so superior why not translate it into a sequence of superior RISC instructions?

I mean, you start with C code or something equivalent, it gets compiled to x86 code, it gets decompiled into the original C operations, you perform whatever optimization applies to your processor, and output RISC code.

Could you show me multiple examples of frequently occuring cases where this isn't true?

It was a significantly smaller chip with reduced power consumption.

Atom is also a significantly smaller chip with reduced power consumption.

Unless the size, performance or power consumption was equal there's nothing to compare with equal measures.

It was weaker by half than a mobile Pentium per clock, but the advent of Atom seems to indicate that these days it may not be as important as it once was for certain markets.

Against Pentium 4 they actually had a shot, since that was a big chip with low IPC and high power consumption. But after Prescott Intel started to focus on performance/watt:

Presler: 65 nm, 376 million transistors, 162 mm², 4 MB, 3.2 GHz, 130 Watt
Conroe: 65 nm, 292 million transistors, 143 mm², 4 MB, 2.4 Ghz, 65 Watt

The latter also outperforms the former in every single benchmark. Suffice to say things would have gotten a whole lot harder for Transmeta if they had actually continued. The Merom L7700 running at 1.8 GHz even has a TDP of only 17 Watt.

That would put it in a similar category as the x86 chips of the time. x86 would have not been competitive without the significant amount of custom engineering needed to make it performant.

That's indeed likely. The transistor density difference might be due to the dynamic logic versus CMOS. Either way it's hard to conclude that the Pentium was inferior due to x86.

The 80 bit format was rarely used and x87 essentially lopped of half of a chip's FP performance, in large part because compilers had to be overly conservative about spilling registers for the stack-based FP.

I agree it's quite ugly. I assume Larrabee supports x87 though because Intel describes the cores as being Pentium compatible, and it would make sense in light of an easy transition from CPU code. It would also be useful for some double precision shader instructions that don't require a fast implementation.

This Pentium die shot suggests it's about 60% the size of the integer pipelines.