FP - DX9 vs IEEE-32

Discussion in 'General 3D Technology' started by Reverend, Jun 3, 2003.

  1. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    True enough, but you can do 2-input FP adds faster than that sequence of operation indicates. Note that if the two input numbers are about the same magnitude, then step 2 can be reduced to a 1-bit shift. Otherwise, if the two input numbers are not of similar magnitude, then the renormalization in step 5 can be reduced to a 1-bit shift. Split the FP adder in two paths - one for each of the two cases - and you will end up with a substantially faster 2-input FP adder. Most CPU makers do this for extra speed these days (dunno about GPU makers; this way of designing an FP adder can be expensive in terms of transistor count)
    For DOT3, you can overlap stage 1 with the multiplication part of the operation. But:
    with a 3-input add, you can no longer use the adder trick I described above, and you cannot determine the sign of the mantissa until after you have done the addition step, so you get a potentially expensive negation step as well, so the time needed to perform the addition goes up by 60-80% over a 2-input add. A 4-input add (for DOT4) is, however, only slightly more expensive than a 3-input add.

    As for the Athlon: it does both FP32 (for 3dnow) and FP80 (for x87) with the same number of cycles (4), although IIRC the units are separate from each other.
     
  2. Dio

    Dio
    Veteran

    Joined:
    Jul 1, 2002
    Messages:
    1,758
    Likes Received:
    8
    Location:
    UK
    It's worth looking up the carry-save adder technique. I found this link:
    http://www.geoffknagge.com/fyp/carrysave.shtml
    CSA allows you to add 3 values together into 2 values, but critically has no carry propagation, and so the propagation time of the adder is just one or two gates.

    Chaining these you can trivially convert the addition of 4, 5, 6, etc. numbers into a single carry-propagate adder - so a dot4 operation just needs a couple of extra propagation delays compared to a MAD - certainly nothing like huge gate propagation for a conventional CPA.
     
  3. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    I am aware of how a carry-save adder works - the fastest available multiplier designs (wallace-trees, 4-2 trees) are just trees of carry-save adders. For an integer/fixedpoint DOT4, you can use a tree of carry-save adders to get an operation latency of roughly 1 multiply + 2 CSAs. For floating-point dot3/4, the problem is that you need normalization circuits before and after the addition when adding 3 or more numbers (which can be partially avoided if you add just 2 numbers).
     
  4. Dio

    Dio
    Veteran

    Joined:
    Jul 1, 2002
    Messages:
    1,758
    Likes Received:
    8
    Location:
    UK
    It wasn't aimed at you :) you sounded far more informed than I was anyway, and confirmed it!
     
  5. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California

    I didn't claim O(n) was constant, I said addition was fundamentally O(n), if you take as consideration your model of computation as being the typical turing or decision tree model. It's constant only if you ignore delays. Once you move to a parallel model of you, you must take communication delays into account.

    However, when we are dealing with small, fixed inputs, where the differences between 16, 24, and 32 aren't large, I don't find O() analysis that informative. After all, merge sort may be O(n log n), but insertion sort is going to be quicker if you're sorting 8 numbers.
     
  6. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    Umm, if neither size nor delay is constant, then what is constant about n O(n) adders? It's not like you need as many as n adders to maintain constant throughput - there are several adder designs with O(n) size and O(log n) delays, so you need only log(n) adders. (These adders are only barely larger than ripple-carry adders and are twice as fast already at ~8-12 bits operand size).
     
  7. Dave H

    Regular

    Joined:
    Jan 21, 2003
    Messages:
    564
    Likes Received:
    0
    OT

    Well sure, but condensed ISAs are hardly what one means when one says RISC. Of course Thumb and MIPS16 deserve the term "RISC" ISAs, because they are variations on "classic RISC" ISAs (and SuperH because it is so similar to Thumb and MIPS16), and incorporate many of the design insights of the RISC revolution. But they pointedly fail to have many of the features shared by all general-purpose RISC ISAs: fixed-length instructions (specifically fixed at 32 bits); 32 GPRs; and three operands on all arithmetic instructions.

    ARM Thumb (in 16-bit mode), for example, only provides 8 GPRs, only 8-bits of offset on a conditional branch :!: and 11 on a jump, and so on. It's a nice ISA for many embedded applications, but it is completely unusable for general-purpose computing. And if it wasn't clear in my post, I was talking about general-purpose computing, where 8-bit immediate fields don't cut it. If you want to play this game, I can come up with an 8-bit microcontroller that blows Thumb or any other narrow-RISC ISA out of the water when it comes to code density. But it's pretty obviously irrelevant to whether CISC has a code density advantage over RISC in general-purpose use.

    If you're referring to this Paul DeMone post, then you follow his postings even more closely than I do. :) But he later makes it quite clear that this almost 5 bytes/instruction figure is anomalous even for the brand-new x86-64 under GCC, and certainly for normal-case x86 under a real compiler. I'm not going to take the time to come up with a real figure, but it is obviously significantly less than 5 bytes.

    The bottom line is that x86 has a significant code size advantage over a traditional RISC in general-purpose code, roughly 20% in the case of the SPEC suite. A quick search found me this paper on dictionary compression of RISC ISAs. (Interestingly, IBM does a similar thing for some embedded RISC MPUs, rather than moving to a hybrid 16/32-bit ISA ala Thumb or MIPS16.) Check out page 9: uncompressed x86 code size averages 18% smaller than ARM and 29% :!: smaller than PowerPC for a large sample of SPEC95 subtests. Admittedly the figures aren't perfect, as AFAICT they represent (unlinked) binary size, rather than runtime code path size, which is what really matters. But they give a general idea.

    In any case, the fact that x86 has a smaller runtime code size than all general-purpose RISCs is well established.

    I didn't say "negligable", I said "increasingly negligable silicon cost". This is just a simple consequence of Moore's Law. Of course, as more resources are available, more will be given to the task of decoding. It is certainly fair to charge the extra footprint of the expanded instructions in the trace cache to the decoding cost, but it's worth noting that a trace cache is a worthwhile feature in and of itself; the idea was developed in academia for reasons having nothing to do with taking a CISC-"RISC" decoder out of the critical path.

    Obviously the x86 tax is still too great to pay when backwards compatiability isn't worth anything, and power/heat are important issues, as is the case with most embedded systems But that doesn't mean x86 can't do the high-end of low-power reasonably well. Pentium M offers pretty remarkable performance/power-consumption considering how high the performance really is. Yeah, it would be even better if it were a RISC with the same design resources poured into it; but in the meantime it sure wipes the floor with the G3/G4 in both performance and battery life.

    Er, no. Let's compare the last similar process that both chips have topped out: .18um bulk Al. EV68 (833 MHz) tops out at 518/643 SPECint/fp base, while Willamette (2 GHz) hit 681/735. So nice try, Alpha, but no cigar. And I'm sure the Alpha's 8 MB L2 had nothing to do with anything, as it's always perfectly fair to compare a $500 chip to a $20,000 one.

    Moreover, the 21264B had a die size of 153mm^2, hardly "half the die size" of the 217mm^2 Willamette, and that's disregarding the fact that 21264B has about half the on-die cache of Willamette.

    Ok, so you were probably talking about the 21264C, which is in .18um Cu. (SOI? I forget.) Not quite "a similar process", but we'll let that slide. 21264C's die size is 125mm^2, so you're a bit closer there, although again considering the missing 128Kb of on-die cache (or, alternatively, the 8MB off-die cache) I'm disinclined to give the benefit of the doubt. 21264C is shipping at 1250 MHz--50% faster than the 833 MHz 21264B--so obviously it could turn in higher SPEC scores than the 2 GHz Willy, if HPaq would only submit them.

    While we're on the subject, the .18um Cu/SOI Power4 (IIRC only <= 1.3 GHz; the faster ones are .13um) scores higher as well, although the fricking 128MB L3 can't hurt.

    But that doesn't contradict what I said. This is the most important point, so I'll make it quite clear: I never said CISC could now beat RISC in process-normalized performance; I said it has become "increasingly competitive". I think our "disagreement" arises mainly from the fact that you don't realize just how much of an advantage RISC provided over CISC around 15 years ago. There was a famous study carried out at DEC where they pitted their own VAX 8700 against the MIPS M2000; the chips were chosen because they were built on an extremely similar process. Just as with your EV6 vs. P4 example, the RISC chip had about half the core complexity of its rival. The difference is, instead of being around .8x as fast, it was 2.66x as fast. Here's a nice Powerpoint presentation discussing the results (download OpenOffice if you don't have Powerpoint), although you can find gazillions of less in-depth mentions of it as it is featured in Hennessy and Patterson and thus in the curriculum of every college MPU architecture course in the nation.

    Now, let's dwell on this a second. Obviously the reason the P4 is doing so well is not because of x86 but in spite of it. Clearly the Alpha was hampered by fewer development resources, an older design optimized for older chip geometries; process technology that, while pretty decent (IBM), was not tailored to the MPU as Intel's is; not quite as good a compiler as Intel's, and so on. This is all a function of the huge installed base of x86 and the money their captivity buys. Fine.

    Problem is, Alpha was still a ton better off than any of the other RISC architectures. Alpha at least had a design team with the talent (if not the resources or company backing) to challenge Intel's; indeed, the Alpha core has the advantage of being more hand-tweaked than even Intel's designs. None of the RISC vendors owns their own fab (except IBM, but they don't target their fab to their own chips, as the fab is run as a completely seperate entity), and many are worse off on this front (Sun uses TI for example). In compilers, too, Alpha was the only group that could even compete with Intel. And so on. Finally, Alpha was particularly strong in SPEC (and particularly weaker in TPC), so the comparison is made on reasonably favorable terms for Alpha. I mean, think about comparing the 2 GHz Willamette to the best .18um process chips from Sun (USII I believe), HP, or SGI in single-threaded SPEC. P4 at .18um will probably beat what PA-RISC or MIPS achieve at .13um on SPEC at least. (If SGI never bumps the R16000 past 700 MHz, it won't even be remotely close.)

    Ok, so obviously even Alpha and Power4 are in many ways victims of an unfair comparison with P4. (OTOH, they do have those huge off-die caches, which SPEC loves, and the benefit of IBM's more advanced process technology, even if that only brings them even with Intel's bulk Al .18um.) Obviously if Intel were to put the same amount of resources they dedicated to P4 into a RISC chip it would be faster. Probably a lot faster. Maybe as much as 30-40% faster at similar cost and process.

    Thing is, that doesn't begin to compare to 166% faster, which is what the MIPS M2000 did to the VAX 8700 in the late 80s. And, while I don't have process-normalized information, this sort of dominance, or even greater, continued throughout the early to mid-90s (i.e. RISCs compared to 486 and then Pentium). It was only with the PPro that x86 could be considered within the same breath as RISC chips in SPECint performance (but not SPECfp); and with P4 that x86 took a constant place at or near the top of the SPEC standings. (One that it will by all indications lose for good to Itanium when Madison launches in the coming weeks.)

    There are lots of reasons for this, among them the fact that serious development of big-iron RISC chips stalled out of the fear of Itanium (except at IBM and SUN, with the latter being too behind to matter much). But by far the biggest reason is that there has been a huge secular increase in the competitiveness of CISC architectures compared to RISC in the last decade. And this is due to Moore's Law, first offering the mere possibility of a CISC->"RISC" translating design, and then making it ever-cheaper in relative silicon cost.

    Quite true, inasmuch as x86 was still extremely successful back in the days when it wasn't anywhere near performance-competitive with RISC MPUs. If you mean "success" as "marketplace success".

    If you mean technological success, you're entirely wrong. All the engineering talent and R&D money in the world couldn't give a CISC the cost/performance of a RISC 10-15 years ago; now it's made P4 competitive on an absolute performance basis, much less considering manufacturing cost.

    EPIC is infinitely more dependent on good compilers for high performance than CISC or RISC, and particularly out-of-order implementations of CISC or RISC. Moreover, the other general-purpose architectures don't have features like full predication, branch hints (with poison bits to preserve correctness), or memory reference speculation. Plus their smaller visible register set limits how aggressive the compiler can be in terms of software pipelining or trace scheduling.

    Totally wrong. For one thing, simply giving a classic RISC 128 GPRs without significantly changing the rest of the ISA would barely improve performance at all. (After all, OoO RISCs get most of the benefit of a large visible register set by having a similarly large renaming register set.) For another, among all the other bits I mentioned above, you're somehow forgetting the little bit about the explicit parallelism...

    So because SMT is now "popular", IPF is going to have to use it?? :lol:

    Apparently you're not as big a Paul DeMone fan as I thought. There are other forms of multithreading, you know. (Or maybe not.)

    Look, the main challenge facing MPU architects is extracting enough parallelism to keep busy the increased number of functional units Moore's Law affords them.

    For a while, ILP found via OoO was enough to keep things going. Unfortunately, that method has pretty much played itself out: increasing the reordering window size is one of the most important ways of extracting more ILP, but the silicon required increases quadratically.

    So now we have two new approaches. The first is extracting thread-level parallelism via SMT. Unfortunately, we won't get to see what would have been the best early exemplar of this, EV8. It's certainly a viable approach, although it obviously relies on having multiple threads competing for CPU-time.

    The second is to extract ILP at compile time. There are obvious disadvantages, but the amount of ILP left unclaimed by current methods is enormous, and while not all is knowable at compile-time, we can do a lot better than what can be practically extracted dynamically.

    Of course the proof is in the pudding, and after a pretty awful debut (but then again, most 1st-generation processors never make it out of the lab for the world to see how bad they really are), Itanium has become quite impressive performance-wise. With Madison that will turn to "quite dominant performance wise"; now that EV8 is no longer, nothing is going to challenge Madison's SPEC numbers in .13um.

    Going back to the discussion earlier: does this mean Intel couldn't have put up similar numbers if they'd poured the same resources into a RISC design? On .13um, probably they could have. (On .18um, definitely.) A couple process generations from now, it's looking more dubious. While it's certainly not a perfect test, IBM's Power roadmap looks ambitious enough that we should get to see how a serious RISC competitor stacks up to EPIC.

    (Although ironically, Power4 does a crude form of on-chip RISC->"semi-VLIW" encoding, so that it can reap some of the control benefits of a bundled instruction ISA. Remind you of anything?)

    P.S. - As you said, this is quite OT; I'll take further reponses to PM if any are necessary.
     
  8. Dave H

    Regular

    Joined:
    Jan 21, 2003
    Messages:
    564
    Likes Received:
    0
    Others have taken this discussion farther than I will (or could) here, but...

    Sure, FP32 arithmetic requires either more levels of logic, or more loops through the logic you've got, than does FP24; simplified a bit, it's a matter of doing a 24-bit integer multiply and several 24-bit adds instead of doing 17-bit multiplies and adds (plus assorted shifts in both cases). I didn't mention it because GPUs are ridiculously pipelined anyways, and their FMADs should be no different. In case we've forgotten, FMADs are generally done on vec4 inputs, so it's not like previously we were getting our results in a single cycle or anything.

    (EDIT: Ok, obviously the multiplies are all independent, so the fact that you're doing 3 FMADs and a multiply "in sequence" really only leaves you with an extra FP add if you dedicate the requisite hardware (4 independent multiplies, then add the results in pairs, then add those results). So, maybe it is plausible to do a vec4 FP24 FMAD in one cycle at ~500 MHz. Heck, maybe it's plausible with FP32. I don't know if I'm even in the ballpark. Still, there's no good reason not to have it all pipelined if need be; so much of a GPU is about latency hiding I can't see why it would be a problem here.)

    So yes, moving to FP32 could increase latencies of arithmetic operations a bit. But I would be quite surprised if these latencies were not completely hidden from the point of view of a pixel shader program. You may need to increase the number of pixels you have in-flight down the shader pipeline, but, as I said, a matter of throwing silicon at the problem.

    In my understanding at least. If there is some reason why moving to FP32 would necessarily impact cycle time, or why a little extra arithmetic latency would negatively impact shader performance, I'd be interested to hear it.
     
  9. Tim

    Tim
    Regular

    Joined:
    Mar 28, 2003
    Messages:
    875
    Likes Received:
    5
    Location:
    Denmark
    21264C scores has been submited, it scores 845/928 base/peak in specint and 1019/1365 in specfp.
     
  10. Dave H

    Regular

    Joined:
    Jan 21, 2003
    Messages:
    564
    Likes Received:
    0
    Whoops: I was only looking through the scores submitted by Compaq! :oops: :oops: :oops:

    Thanks for the heads-up.
     
  11. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,528
    Likes Received:
    862
    Agree, I wrote a response here

    Cheers
    Gubbi
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...