Slower, why?
We don't know how either chip performs yet...
My interpretation is that Fermi is TMU- and ROP-less. NVidia's traded them for lots of INT ALUs (512, an insane number by any standard, let's make no mistake). I don't know how many 32-bit ALU cycles on GF100 a bog standard 8-bit texture result would take through LOD/bias/addressing/decompression/filtering, so it's hard to say whether GF100 has ~2x GT200's texture rate, or 4x, etc.
But what I can say is that in dumping TMUs NVidia's effectively spent more die space on texturing (non-dedicated units) and seemingly justified this by double-precision acceleration and compute-type INT operations (neither of which are graphics - well there'll be increasing amounts of compute in graphics as time goes by). Now you could argue that the 80 TUs in RV870 show the way, NVidia doesn't need to increase peak texturing rate. Fair enough. Intel, by comparison, dumped a relatively tiny unit (rasteriser), so the effective overhead on die is small. Rasterisation rate in Larrabee, e.g. 16 colour/64 Z per clock, is hardly taxing.
So in spending so much die space on integer ops, NVidia's at a relative disadvantage in comparison with the capability of Larrabee.
Secondly, each core can only support a single kernel. This makes for much more coarsely-grained task-parallelism than seen in Larrabee. NVidia may have compensated by more efficient branching - but I'm doubtful as there's no mention so far of anything like DWF. The tweak in predicate handling is only catching up with ATI (which is stall-less) - while Intel has branch prediction too.
Cache. Well clearly there aint enough of it if there's no ROPs. Would guess that NVidia's still doing colour/Z compression type stuff (and the atomics are effectively providing a portion of fixed-function ROP capability, too), so some kind of back end processing for render targets, just not full ROPs.
Finally, there's that scheduler from hell, chewing through die space like there's no tomorrow.
That's my gut feel (and I'm only talking about D3D graphics performance). Of course if it does have TMUs and ROPs, then I retract a good bunch of my comments. ROPs, theoretically, should have disappeared first (compute-light) - but it's pretty much impossible to discern anything about them from the die shot. Separately, I just can't see enough stuff per core for TMUs - and that's what I'm hinging my opinion on.
But from what we do know, Intel aims at GTX285 performance, and so far it looks like Fermi is going to be considerably faster than GTX285, hence probably faster than Larrabee.
With full C++ support, I don't really think there's much that x86 can do, that Fermi can't.
Broadly, anything Larrabee can do, GF100 can do too in terms of programmability. What gives me pause is that on Larrabee multiple kernels per core can work in producer/consumer fashion. In GF100 producer/consumer requires either transmitting data amongst cores or using branching within a single kernel to simulate multiple kernels. This latter technique is not efficient. (Hardware DWF would undo this argument.)
I think that the more finely-grained parallelism in Larrabee (not to mention that the SIMDs are actually narrower in logical terms) will allow Intel to tune data-flows more aggressively. The large cache will also make things more graceful under strain.
Against this I see three key advantages for NVidia: it's graphics, stupid; CUDA has been a brilliant learning curve in applied throughput computing (enabling a deep understanding for the construction of a new ISA); certain corners of the architecture (memory controllers, shared memory) are home turf.
Anyway, I'm dead excited. This is more radicalism than I was hoping for from NVidia. Just a bit concerned that the compute density's not so hot.
Jawed