OT
Gubbi said:
CISC has *zero* code size advantage over RISC. Compare ARM thumb or MIPS16 to x86 and you'll find the latter losing
Well sure, but condensed ISAs are hardly what one means when one says RISC. Of course Thumb and MIPS16 deserve the term "RISC" ISAs, because they are variations on "classic RISC" ISAs (and SuperH because it is so similar to Thumb and MIPS16), and incorporate many of the design insights of the RISC revolution. But they pointedly fail to have many of the features shared by all general-purpose RISC ISAs: fixed-length instructions (specifically fixed at 32 bits); 32 GPRs; and three operands on all arithmetic instructions.
ARM Thumb (in 16-bit mode), for example, only provides 8 GPRs, only 8-bits of offset on a conditional branch
and 11 on a jump, and so on. It's a nice ISA for many embedded applications, but it is completely unusable for general-purpose computing. And if it wasn't clear in my post, I was talking about general-purpose computing, where 8-bit immediate fields don't cut it. If you want to play this game, I can come up with an 8-bit microcontroller that blows Thumb or any other narrow-RISC ISA out of the water when it comes to code density. But it's pretty obviously irrelevant to whether CISC has a code density advantage over RISC in general-purpose use.
The average instruction size of the new x86-64 is 5 bytes per instruction
If you're referring to
this Paul DeMone post, then you follow his postings even more closely than I do.
But he later
makes it quite clear that this
almost 5 bytes/instruction figure is anomalous even for the brand-new x86-64 under GCC, and certainly for normal-case x86 under a real compiler. I'm not going to take the time to come up with a real figure, but it is obviously significantly less than 5 bytes.
-yes you can have a memory operand in there, but at the same time you only have a 2-adress instruction format, -and fewer registers, so you'll end up with more instructions shuffling data around than in a typical RISC.
The bottom line is that x86 has a significant code size advantage over a traditional RISC in general-purpose code, roughly 20% in the case of the SPEC suite. A quick search found me
this paper on dictionary compression of RISC ISAs. (Interestingly, IBM does a similar thing for some embedded RISC MPUs, rather than moving to a hybrid 16/32-bit ISA ala Thumb or MIPS16.) Check out page 9: uncompressed x86 code size averages 18% smaller than ARM and 29%
smaller than PowerPC for a large sample of SPEC95 subtests. Admittedly the figures aren't perfect, as AFAICT they represent (unlinked) binary size, rather than runtime code path size, which is what really matters. But they give a general idea.
In any case, the fact that x86 has a smaller runtime code size than all general-purpose RISCs is well established.
Also decoding ia32 into uOps does not take negligable resources. decoders are either big and power hungry (Athlon) or less power hungry but even bigger (P4; trace cache).
I didn't say "negligable", I said "increasingly negligable silicon cost". This is just a simple consequence of Moore's Law. Of course, as more resources are available, more will be given to the task of decoding. It is certainly fair to charge the extra footprint of the expanded instructions in the trace cache to the decoding cost, but it's worth noting that a trace cache is a worthwhile feature in and of itself; the idea was developed in academia for reasons having nothing to do with taking a CISC-"RISC" decoder out of the critical path.
Obviously the x86 tax is still too great to pay when backwards compatiability isn't worth anything, and power/heat are important issues, as is the case with most embedded systems But that doesn't mean x86 can't do the high-end of low-power reasonably well. Pentium M offers pretty remarkable performance/power-consumption considering how high the performance really is. Yeah, it would be even better if it were a RISC with the same design resources poured into it; but in the meantime it sure wipes the floor with the G3/G4 in both performance and battery life.
A 21264 core is half the die size of the P4 in a similar process and yet has higher performance.
Er, no. Let's compare the last similar process that both chips have topped out: .18um bulk Al. EV68 (833 MHz) tops out at 518/643 SPECint/fp base, while Willamette (2 GHz) hit 681/735. So nice try, Alpha, but no cigar. And I'm sure the Alpha's 8 MB L2 had nothing to do with anything, as it's always perfectly fair to compare a $500 chip to a $20,000 one.
Moreover,
the 21264B had a die size of 153mm^2, hardly "half the die size" of the 217mm^2 Willamette, and that's disregarding the fact that 21264B has about half the on-die cache of Willamette.
Ok, so you were probably talking about the 21264C, which is in .18um Cu. (SOI? I forget.) Not quite "a similar process", but we'll let that slide. 21264C's die size is 125mm^2, so you're a bit closer there, although again considering the missing 128Kb of on-die cache (or, alternatively, the 8MB off-die cache) I'm disinclined to give the benefit of the doubt. 21264C is shipping at 1250 MHz--50% faster than the 833 MHz 21264B--so obviously it could turn in higher SPEC scores than the 2 GHz Willy, if HPaq would only submit them.
While we're on the subject, the .18um Cu/SOI Power4 (IIRC only <= 1.3 GHz; the faster ones are .13um) scores higher as well, although the fricking 128MB L3 can't hurt.
But that doesn't contradict what I said. This is the most important point, so I'll make it quite clear:
I never said CISC could now beat RISC in process-normalized performance; I said it has become "increasingly competitive". I think our "disagreement" arises mainly from the fact that you don't realize just how much of an advantage RISC provided over CISC around 15 years ago. There was a famous study carried out at DEC where they pitted their own VAX 8700 against the MIPS M2000; the chips were chosen because they were built on an extremely similar process. Just as with your EV6 vs. P4 example, the RISC chip had about half the core complexity of its rival. The difference is, instead of being around .8x as fast, it was
2.66x as fast. Here's
a nice Powerpoint presentation discussing the results (download OpenOffice if you don't have Powerpoint), although you can find gazillions of less in-depth mentions of it as it is featured in Hennessy and Patterson and thus in the curriculum of every college MPU architecture course in the nation.
Now, let's dwell on this a second. Obviously the reason the P4 is doing so well is not because of x86 but in spite of it. Clearly the Alpha was hampered by fewer development resources, an older design optimized for older chip geometries; process technology that, while pretty decent (IBM), was not tailored to the MPU as Intel's is; not quite as good a compiler as Intel's, and so on. This is all a function of the huge installed base of x86 and the money their captivity buys. Fine.
Problem is, Alpha was still a ton better off than any of the other RISC architectures. Alpha at least had a design team with the talent (if not the resources or company backing) to challenge Intel's; indeed, the Alpha core has the advantage of being more hand-tweaked than even Intel's designs. None of the RISC vendors owns their own fab (except IBM, but they don't target their fab to their own chips, as the fab is run as a completely seperate entity), and many are worse off on this front (Sun uses TI for example). In compilers, too, Alpha was the only group that could even compete with Intel. And so on. Finally, Alpha was particularly strong in SPEC (and particularly weaker in TPC), so the comparison is made on reasonably favorable terms for Alpha. I mean, think about comparing the 2 GHz Willamette to the best .18um process chips from Sun (USII I believe), HP, or SGI in single-threaded SPEC. P4 at .18um will probably beat what PA-RISC or MIPS achieve at .13um on SPEC at least. (If SGI never bumps the R16000 past 700 MHz, it won't even be remotely close.)
Ok, so obviously even Alpha and Power4 are in many ways victims of an unfair comparison with P4. (OTOH, they do have those huge off-die caches, which SPEC loves, and the benefit of IBM's more advanced process technology, even if that only brings them even with Intel's bulk Al .18um.) Obviously if Intel were to put the same amount of resources they dedicated to P4 into a RISC chip it would be faster. Probably a lot faster. Maybe as much as 30-40% faster at similar cost and process.
Thing is, that doesn't begin to compare to 166% faster, which is what the MIPS M2000 did to the VAX 8700 in the late 80s. And, while I don't have process-normalized information, this sort of dominance, or even greater, continued throughout the early to mid-90s (i.e. RISCs compared to 486 and then Pentium). It was only with the PPro that x86 could be considered within the same breath as RISC chips in SPECint performance (but not SPECfp); and with P4 that x86 took a constant place at or near the top of the SPEC standings. (One that it will by all indications lose for good to Itanium when Madison launches in the coming weeks.)
There are lots of reasons for this, among them the fact that serious development of big-iron RISC chips stalled out of the fear of Itanium (except at IBM and SUN, with the latter being too behind to matter much). But by far the biggest reason is that there has been a huge secular increase in the competitiveness of CISC architectures compared to RISC in the last decade. And this is due to Moore's Law, first offering the mere possibility of a CISC->"RISC" translating design, and then making it ever-cheaper in relative silicon cost.
The succes of x86 is solely due to economy of scale, which has allowed the companies behind the MPUs to pour $$$$ into process and uarch developments while still maintaining a price/performance edge.
Quite true, inasmuch as x86 was still extremely successful back in the days when it wasn't anywhere near performance-competitive with RISC MPUs. If you mean "success" as "marketplace success".
If you mean technological success, you're entirely wrong. All the engineering talent and R&D money in the world couldn't give a CISC the cost/performance of a RISC 10-15 years ago; now it's made P4 competitive on an absolute performance basis, much less considering manufacturing cost.
Finally: The compiler advancements that will benefit EPIC (VLIW) will also benefit every single other architecture out there.
EPIC is infinitely more dependent on good compilers for high performance than CISC or RISC, and particularly out-of-order implementations of CISC or RISC. Moreover, the other general-purpose architectures don't have features like full predication, branch hints (with poison bits to preserve correctness), or memory reference speculation. Plus their smaller visible register set limits how aggressive the compiler can be in terms of software pipelining or trace scheduling.
The only thing EPIC has going for it is the large register file
Totally wrong. For one thing, simply giving a classic RISC 128 GPRs without significantly changing the rest of the ISA would barely improve performance at all. (After all, OoO RISCs get most of the benefit of a large visible register set by having a similarly large renaming register set.) For another, among all the other bits I mentioned above, you're somehow forgetting the little bit about the explicit parallelism...
and with SMT becoming ever more popular even that is looking likely to be a liability (big ass context-> fewer contexts juggled at the same time->lower throughput).
So because SMT is now "popular", IPF is going to have to use it??
Apparently you're not as big a Paul DeMone fan as I thought. There are other forms of multithreading, you know. (Or maybe not.)
Look, the main challenge facing MPU architects is extracting enough parallelism to keep busy the increased number of functional units Moore's Law affords them.
For a while, ILP found via OoO was enough to keep things going. Unfortunately, that method has pretty much played itself out: increasing the reordering window size is one of the most important ways of extracting more ILP, but the silicon required increases quadratically.
So now we have two new approaches. The first is extracting thread-level parallelism via SMT. Unfortunately, we won't get to see what would have been the best early exemplar of this, EV8. It's certainly a viable approach, although it obviously relies on having multiple threads competing for CPU-time.
The second is to extract ILP at compile time. There are obvious disadvantages, but the amount of ILP left unclaimed by current methods is enormous, and while not all is knowable at compile-time, we can do a lot better than what can be practically extracted dynamically.
Of course the proof is in the pudding, and after a pretty awful debut (but then again, most 1st-generation processors never make it out of the lab for the world to see how bad they really are), Itanium has become quite impressive performance-wise. With Madison that will turn to "quite dominant performance wise"; now that EV8 is no longer, nothing is going to challenge Madison's SPEC numbers in .13um.
Going back to the discussion earlier: does this mean Intel couldn't have put up similar numbers if they'd poured the same resources into a RISC design? On .13um, probably they could have. (On .18um, definitely.) A couple process generations from now, it's looking more dubious. While it's certainly not a perfect test, IBM's Power roadmap looks ambitious enough that we should get to see how a serious RISC competitor stacks up to EPIC.
(Although ironically, Power4 does a crude form of on-chip RISC->"semi-VLIW" encoding, so that it can reap some of the control benefits of a bundled instruction ISA. Remind you of anything?)
P.S. - As you said, this is quite OT; I'll take further reponses to PM if any are necessary.