FP - DX9 vs IEEE-32

Hyp-X said:
arjan de lumens said:
If you are doing DOT3, RCP or other composite operations, I would expect the delay of a fused-multiply-add + a standard add + a little bit more, giving ~8-11 ns.

For DOT3 I'd be surprised if it took that much.

FP add contains the following ops:
1. Compare the exponents and determine the amount of mantissa shift.
2. Shift the mantissa of one of the numbers.
3. Add the two mantissas together.
4. Search the highest bit in the result
5. Shift the mantissa for renormalization.
True enough, but you can do 2-input FP adds faster than that sequence of operation indicates. Note that if the two input numbers are about the same magnitude, then step 2 can be reduced to a 1-bit shift. Otherwise, if the two input numbers are not of similar magnitude, then the renormalization in step 5 can be reduced to a 1-bit shift. Split the FP adder in two paths - one for each of the two cases - and you will end up with a substantially faster 2-input FP adder. Most CPU makers do this for extra speed these days (dunno about GPU makers; this way of designing an FP adder can be expensive in terms of transistor count)
DOT3 requires a 3 parameter add, instead of the 2 parameter add used in MAD operations.
Stages 2, 4 and 5 should take exactly the same time, only stage 1 and 3 takes more time, but I'm not sure it matters as much as (a standard add + a little bit more)
For DOT3, you can overlap stage 1 with the multiplication part of the operation. But:
with a 3-input add, you can no longer use the adder trick I described above, and you cannot determine the sign of the mantissa until after you have done the addition step, so you get a potentially expensive negation step as well, so the time needed to perform the addition goes up by 60-80% over a 2-input add. A 4-input add (for DOT4) is, however, only slightly more expensive than a 3-input add.

As for the Athlon: it does both FP32 (for 3dnow) and FP80 (for x87) with the same number of cycles (4), although IIRC the units are separate from each other.
 
It's worth looking up the carry-save adder technique. I found this link:
http://www.geoffknagge.com/fyp/carrysave.shtml
CSA allows you to add 3 values together into 2 values, but critically has no carry propagation, and so the propagation time of the adder is just one or two gates.

Chaining these you can trivially convert the addition of 4, 5, 6, etc. numbers into a single carry-propagate adder - so a dot4 operation just needs a couple of extra propagation delays compared to a MAD - certainly nothing like huge gate propagation for a conventional CPA.
 
I am aware of how a carry-save adder works - the fastest available multiplier designs (wallace-trees, 4-2 trees) are just trees of carry-save adders. For an integer/fixedpoint DOT4, you can use a tree of carry-save adders to get an operation latency of roughly 1 multiply + 2 CSAs. For floating-point dot3/4, the problem is that you need normalization circuits before and after the addition when adding 3 or more numbers (which can be partially avoided if you add just 2 numbers).
 
arjan de lumens said:
Constant speed is O(1), not O(n). With both adders, multipliers, and barrel shifters (the basic circuits from which FP units are made), since the MSB of the result potentially depends on all the bits of the inputs, you can't get lower than O(log n) gate delays (and O(n) interconnect delay) no matter how you design the circuit, which is hardly constant.


I didn't claim O(n) was constant, I said addition was fundamentally O(n), if you take as consideration your model of computation as being the typical turing or decision tree model. It's constant only if you ignore delays. Once you move to a parallel model of you, you must take communication delays into account.

However, when we are dealing with small, fixed inputs, where the differences between 16, 24, and 32 aren't large, I don't find O() analysis that informative. After all, merge sort may be O(n log n), but insertion sort is going to be quicker if you're sorting 8 numbers.
 
Umm, if neither size nor delay is constant, then what is constant about n O(n) adders? It's not like you need as many as n adders to maintain constant throughput - there are several adder designs with O(n) size and O(log n) delays, so you need only log(n) adders. (These adders are only barely larger than ripple-carry adders and are twice as fast already at ~8-12 bits operand size).
 
OT

Gubbi said:
CISC has *zero* code size advantage over RISC. Compare ARM thumb or MIPS16 to x86 and you'll find the latter losing

Well sure, but condensed ISAs are hardly what one means when one says RISC. Of course Thumb and MIPS16 deserve the term "RISC" ISAs, because they are variations on "classic RISC" ISAs (and SuperH because it is so similar to Thumb and MIPS16), and incorporate many of the design insights of the RISC revolution. But they pointedly fail to have many of the features shared by all general-purpose RISC ISAs: fixed-length instructions (specifically fixed at 32 bits); 32 GPRs; and three operands on all arithmetic instructions.

ARM Thumb (in 16-bit mode), for example, only provides 8 GPRs, only 8-bits of offset on a conditional branch :!: and 11 on a jump, and so on. It's a nice ISA for many embedded applications, but it is completely unusable for general-purpose computing. And if it wasn't clear in my post, I was talking about general-purpose computing, where 8-bit immediate fields don't cut it. If you want to play this game, I can come up with an 8-bit microcontroller that blows Thumb or any other narrow-RISC ISA out of the water when it comes to code density. But it's pretty obviously irrelevant to whether CISC has a code density advantage over RISC in general-purpose use.

The average instruction size of the new x86-64 is 5 bytes per instruction

If you're referring to this Paul DeMone post, then you follow his postings even more closely than I do. :) But he later makes it quite clear that this almost 5 bytes/instruction figure is anomalous even for the brand-new x86-64 under GCC, and certainly for normal-case x86 under a real compiler. I'm not going to take the time to come up with a real figure, but it is obviously significantly less than 5 bytes.

-yes you can have a memory operand in there, but at the same time you only have a 2-adress instruction format, -and fewer registers, so you'll end up with more instructions shuffling data around than in a typical RISC.

The bottom line is that x86 has a significant code size advantage over a traditional RISC in general-purpose code, roughly 20% in the case of the SPEC suite. A quick search found me this paper on dictionary compression of RISC ISAs. (Interestingly, IBM does a similar thing for some embedded RISC MPUs, rather than moving to a hybrid 16/32-bit ISA ala Thumb or MIPS16.) Check out page 9: uncompressed x86 code size averages 18% smaller than ARM and 29% :!: smaller than PowerPC for a large sample of SPEC95 subtests. Admittedly the figures aren't perfect, as AFAICT they represent (unlinked) binary size, rather than runtime code path size, which is what really matters. But they give a general idea.

In any case, the fact that x86 has a smaller runtime code size than all general-purpose RISCs is well established.

Also decoding ia32 into uOps does not take negligable resources. decoders are either big and power hungry (Athlon) or less power hungry but even bigger (P4; trace cache).

I didn't say "negligable", I said "increasingly negligable silicon cost". This is just a simple consequence of Moore's Law. Of course, as more resources are available, more will be given to the task of decoding. It is certainly fair to charge the extra footprint of the expanded instructions in the trace cache to the decoding cost, but it's worth noting that a trace cache is a worthwhile feature in and of itself; the idea was developed in academia for reasons having nothing to do with taking a CISC-"RISC" decoder out of the critical path.

Obviously the x86 tax is still too great to pay when backwards compatiability isn't worth anything, and power/heat are important issues, as is the case with most embedded systems But that doesn't mean x86 can't do the high-end of low-power reasonably well. Pentium M offers pretty remarkable performance/power-consumption considering how high the performance really is. Yeah, it would be even better if it were a RISC with the same design resources poured into it; but in the meantime it sure wipes the floor with the G3/G4 in both performance and battery life.

A 21264 core is half the die size of the P4 in a similar process and yet has higher performance.

Er, no. Let's compare the last similar process that both chips have topped out: .18um bulk Al. EV68 (833 MHz) tops out at 518/643 SPECint/fp base, while Willamette (2 GHz) hit 681/735. So nice try, Alpha, but no cigar. And I'm sure the Alpha's 8 MB L2 had nothing to do with anything, as it's always perfectly fair to compare a $500 chip to a $20,000 one.

Moreover, the 21264B had a die size of 153mm^2, hardly "half the die size" of the 217mm^2 Willamette, and that's disregarding the fact that 21264B has about half the on-die cache of Willamette.

Ok, so you were probably talking about the 21264C, which is in .18um Cu. (SOI? I forget.) Not quite "a similar process", but we'll let that slide. 21264C's die size is 125mm^2, so you're a bit closer there, although again considering the missing 128Kb of on-die cache (or, alternatively, the 8MB off-die cache) I'm disinclined to give the benefit of the doubt. 21264C is shipping at 1250 MHz--50% faster than the 833 MHz 21264B--so obviously it could turn in higher SPEC scores than the 2 GHz Willy, if HPaq would only submit them.

While we're on the subject, the .18um Cu/SOI Power4 (IIRC only <= 1.3 GHz; the faster ones are .13um) scores higher as well, although the fricking 128MB L3 can't hurt.

But that doesn't contradict what I said. This is the most important point, so I'll make it quite clear: I never said CISC could now beat RISC in process-normalized performance; I said it has become "increasingly competitive". I think our "disagreement" arises mainly from the fact that you don't realize just how much of an advantage RISC provided over CISC around 15 years ago. There was a famous study carried out at DEC where they pitted their own VAX 8700 against the MIPS M2000; the chips were chosen because they were built on an extremely similar process. Just as with your EV6 vs. P4 example, the RISC chip had about half the core complexity of its rival. The difference is, instead of being around .8x as fast, it was 2.66x as fast. Here's a nice Powerpoint presentation discussing the results (download OpenOffice if you don't have Powerpoint), although you can find gazillions of less in-depth mentions of it as it is featured in Hennessy and Patterson and thus in the curriculum of every college MPU architecture course in the nation.

Now, let's dwell on this a second. Obviously the reason the P4 is doing so well is not because of x86 but in spite of it. Clearly the Alpha was hampered by fewer development resources, an older design optimized for older chip geometries; process technology that, while pretty decent (IBM), was not tailored to the MPU as Intel's is; not quite as good a compiler as Intel's, and so on. This is all a function of the huge installed base of x86 and the money their captivity buys. Fine.

Problem is, Alpha was still a ton better off than any of the other RISC architectures. Alpha at least had a design team with the talent (if not the resources or company backing) to challenge Intel's; indeed, the Alpha core has the advantage of being more hand-tweaked than even Intel's designs. None of the RISC vendors owns their own fab (except IBM, but they don't target their fab to their own chips, as the fab is run as a completely seperate entity), and many are worse off on this front (Sun uses TI for example). In compilers, too, Alpha was the only group that could even compete with Intel. And so on. Finally, Alpha was particularly strong in SPEC (and particularly weaker in TPC), so the comparison is made on reasonably favorable terms for Alpha. I mean, think about comparing the 2 GHz Willamette to the best .18um process chips from Sun (USII I believe), HP, or SGI in single-threaded SPEC. P4 at .18um will probably beat what PA-RISC or MIPS achieve at .13um on SPEC at least. (If SGI never bumps the R16000 past 700 MHz, it won't even be remotely close.)

Ok, so obviously even Alpha and Power4 are in many ways victims of an unfair comparison with P4. (OTOH, they do have those huge off-die caches, which SPEC loves, and the benefit of IBM's more advanced process technology, even if that only brings them even with Intel's bulk Al .18um.) Obviously if Intel were to put the same amount of resources they dedicated to P4 into a RISC chip it would be faster. Probably a lot faster. Maybe as much as 30-40% faster at similar cost and process.

Thing is, that doesn't begin to compare to 166% faster, which is what the MIPS M2000 did to the VAX 8700 in the late 80s. And, while I don't have process-normalized information, this sort of dominance, or even greater, continued throughout the early to mid-90s (i.e. RISCs compared to 486 and then Pentium). It was only with the PPro that x86 could be considered within the same breath as RISC chips in SPECint performance (but not SPECfp); and with P4 that x86 took a constant place at or near the top of the SPEC standings. (One that it will by all indications lose for good to Itanium when Madison launches in the coming weeks.)

There are lots of reasons for this, among them the fact that serious development of big-iron RISC chips stalled out of the fear of Itanium (except at IBM and SUN, with the latter being too behind to matter much). But by far the biggest reason is that there has been a huge secular increase in the competitiveness of CISC architectures compared to RISC in the last decade. And this is due to Moore's Law, first offering the mere possibility of a CISC->"RISC" translating design, and then making it ever-cheaper in relative silicon cost.

The succes of x86 is solely due to economy of scale, which has allowed the companies behind the MPUs to pour $$$$ into process and uarch developments while still maintaining a price/performance edge.

Quite true, inasmuch as x86 was still extremely successful back in the days when it wasn't anywhere near performance-competitive with RISC MPUs. If you mean "success" as "marketplace success".

If you mean technological success, you're entirely wrong. All the engineering talent and R&D money in the world couldn't give a CISC the cost/performance of a RISC 10-15 years ago; now it's made P4 competitive on an absolute performance basis, much less considering manufacturing cost.

Finally: The compiler advancements that will benefit EPIC (VLIW) will also benefit every single other architecture out there.

EPIC is infinitely more dependent on good compilers for high performance than CISC or RISC, and particularly out-of-order implementations of CISC or RISC. Moreover, the other general-purpose architectures don't have features like full predication, branch hints (with poison bits to preserve correctness), or memory reference speculation. Plus their smaller visible register set limits how aggressive the compiler can be in terms of software pipelining or trace scheduling.

The only thing EPIC has going for it is the large register file

Totally wrong. For one thing, simply giving a classic RISC 128 GPRs without significantly changing the rest of the ISA would barely improve performance at all. (After all, OoO RISCs get most of the benefit of a large visible register set by having a similarly large renaming register set.) For another, among all the other bits I mentioned above, you're somehow forgetting the little bit about the explicit parallelism...

and with SMT becoming ever more popular even that is looking likely to be a liability (big ass context-> fewer contexts juggled at the same time->lower throughput).

So because SMT is now "popular", IPF is going to have to use it?? :LOL:

Apparently you're not as big a Paul DeMone fan as I thought. There are other forms of multithreading, you know. (Or maybe not.)

Look, the main challenge facing MPU architects is extracting enough parallelism to keep busy the increased number of functional units Moore's Law affords them.

For a while, ILP found via OoO was enough to keep things going. Unfortunately, that method has pretty much played itself out: increasing the reordering window size is one of the most important ways of extracting more ILP, but the silicon required increases quadratically.

So now we have two new approaches. The first is extracting thread-level parallelism via SMT. Unfortunately, we won't get to see what would have been the best early exemplar of this, EV8. It's certainly a viable approach, although it obviously relies on having multiple threads competing for CPU-time.

The second is to extract ILP at compile time. There are obvious disadvantages, but the amount of ILP left unclaimed by current methods is enormous, and while not all is knowable at compile-time, we can do a lot better than what can be practically extracted dynamically.

Of course the proof is in the pudding, and after a pretty awful debut (but then again, most 1st-generation processors never make it out of the lab for the world to see how bad they really are), Itanium has become quite impressive performance-wise. With Madison that will turn to "quite dominant performance wise"; now that EV8 is no longer, nothing is going to challenge Madison's SPEC numbers in .13um.

Going back to the discussion earlier: does this mean Intel couldn't have put up similar numbers if they'd poured the same resources into a RISC design? On .13um, probably they could have. (On .18um, definitely.) A couple process generations from now, it's looking more dubious. While it's certainly not a perfect test, IBM's Power roadmap looks ambitious enough that we should get to see how a serious RISC competitor stacks up to EPIC.

(Although ironically, Power4 does a crude form of on-chip RISC->"semi-VLIW" encoding, so that it can reap some of the control benefits of a bundled instruction ISA. Remind you of anything?)

P.S. - As you said, this is quite OT; I'll take further reponses to PM if any are necessary.
 
Simon F said:
Dave H said:
FP32 is not inherently slower than FP24, FP16, or FPwhatever. It just takes more silicon to support it natively.
Are you sure that it's possible to keep the calculation time constant just by throwing even more silicon at the problem?
I'm not convinced - surely it must still be slower.

Others have taken this discussion farther than I will (or could) here, but...

Sure, FP32 arithmetic requires either more levels of logic, or more loops through the logic you've got, than does FP24; simplified a bit, it's a matter of doing a 24-bit integer multiply and several 24-bit adds instead of doing 17-bit multiplies and adds (plus assorted shifts in both cases). I didn't mention it because GPUs are ridiculously pipelined anyways, and their FMADs should be no different. In case we've forgotten, FMADs are generally done on vec4 inputs, so it's not like previously we were getting our results in a single cycle or anything.

(EDIT: Ok, obviously the multiplies are all independent, so the fact that you're doing 3 FMADs and a multiply "in sequence" really only leaves you with an extra FP add if you dedicate the requisite hardware (4 independent multiplies, then add the results in pairs, then add those results). So, maybe it is plausible to do a vec4 FP24 FMAD in one cycle at ~500 MHz. Heck, maybe it's plausible with FP32. I don't know if I'm even in the ballpark. Still, there's no good reason not to have it all pipelined if need be; so much of a GPU is about latency hiding I can't see why it would be a problem here.)

So yes, moving to FP32 could increase latencies of arithmetic operations a bit. But I would be quite surprised if these latencies were not completely hidden from the point of view of a pixel shader program. You may need to increase the number of pixels you have in-flight down the shader pipeline, but, as I said, a matter of throwing silicon at the problem.

In my understanding at least. If there is some reason why moving to FP32 would necessarily impact cycle time, or why a little extra arithmetic latency would negatively impact shader performance, I'd be interested to hear it.
 
Dave H said:
21264C is shipping at 1250 MHz--50% faster than the 833 MHz 21264B--so obviously it could turn in higher SPEC scores than the 2 GHz Willy, if HPaq would only submit them.

21264C scores has been submited, it scores 845/928 base/peak in specint and 1019/1365 in specfp.
 
Tim said:
Dave H said:
21264C is shipping at 1250 MHz--50% faster than the 833 MHz 21264B--so obviously it could turn in higher SPEC scores than the 2 GHz Willy, if HPaq would only submit them.

21264C scores has been submited, it scores 845/928 base/peak in specint and 1019/1365 in specfp.

Whoops: I was only looking through the scores submitted by Compaq! :oops: :oops: :oops:

Thanks for the heads-up.
 
Back
Top