Why do you keep harping on FMA? Doing single cycle throughput FMA doesn't matter in FLOPS because it doesn't go any faster than single cycle throughput MAD which is what GPUs have been using since forever ... also ATI has said R800 supports FMA any way.
Did you meant that it doesn't help single precision (32bit) throughput? Well I'm doubtful about that because FMA could help better core utilization, and also i have doubts that some simple vliw instruction could make FMA possible beyond good/very good emulation if it ain't implemented on hardware level. So does ATi hides that abilities from RV870 chip until Fermi is released or it's simple marketing blob? Or we'll never saw it on Radeon class only FireStream class cards. And ATi still didn't even announce any of FireStream cards based on RV870.
This is a piece from 2000, but there are numbers and reasons stated.
http://www.realworldtech.com/page.cfm?ArticleID=RWT021300000000&p=1
This is great but horrifying history lecture how Intel's push-ups onto newer processing node kill outperforming architecture in favor of their x86 CISC. I just hope that now when taiwanese semiconductor manufacturers reach Intel's shrinkage node levels we couldn't see how history repeats once again in favor of some hypo monster as Lara Bee is. And hopefully it was a good AMD's decision to go AssetSmart and give away running expensive manufacturing plants to jointly venture with ATIC. So we'll now had newer nodes without AMD's lack of cash burdens them.
And just hope that nVidia will finally develop some new architecture after NV40 in 2004 that's finalized in G70 issue. It's a time to do that after Fermi sees daylight in early 2010 as usually cycles last for 18 month and they need to have some NGGA in their minds for long now. I hoped for G300 to be really new as promised (before GT200 launch when they talked about their DX11 inventions) but it's still just tranny pumping and praying to outperform competition on tranny advantage account . And that ATi won't sleep on successful R600 reiterated into R800 design, there's simply too many things they can upgrade on it.
Slides from a while back indicated that Larrabee could perform 2 non-SSE DP FLOPs a cycle.
That would seem to indicate x87, though the slides are pretty old at this point.
SSE wouldn't be an option anyway, as it appears Larrabee does not support it.
So we're in fact once again cheated by Intel when it comes in terms of Intel's 487sx math processor performance. We're all glad that's this "features" are once again easier to implement than fully capable transcendental math coprocessor ... only 20 years after Cyrix introduced it's FasMath 83D87 (and improved EMC87). It's hilarious cause i want stay optimistic about tjis fraud.
So it's definitely not so simple..
Why it ain't simple. You simply waste 20% less of die space resulting in cheaper production (not a reality in Intels case i know ). And better for all of us if we'd have 20% less leakage and maybe down to 40% of original power requirements when you ditch all that x86 ISA pre-decode.
And the best thing Larabee is IN-ORDER chip afaik and all that ooo chips advantages over RISC (that Intel obliterated with their marketing fuds) are gone there. And in order chips need recompilation for ooo compiled applications that are with us in last 10yrs. And also apps need to be aware of all that masking going on when they're executed on chip that carry on illusion of x86 compliance. I'd give more of credit to Fermi when it comes out to that x86 compatibility even if aint x86 chip at all Lara Bee is multicore core-multithreaded chip and in this proto-Larabee age they need to see what kind of optimizations could they do to outperform higher clocked ooo CPUs on core level. But in fact real question is what kind of HPC should LB provide when their math is based on old 487sx engine?
--