OK, I'll have a quick go. There's an awful lot of threads that already cover this kind of stuff.
Bill said:
1.: ALU's. I have heard in a few places Xenos ALU's described as more powerful. The basic idea is they work on "vec4's+scalar" while RSX works on "vec3+scalar". To me this seems to mean Xenos Alu's are 20% more powerful. However different articles say different things, and it seems very hard to pin down. Is this true?
Xenos's primary shader ALUs have to be Vec4+scalar, because they serve for both vertex programs and pixel programs. Vertex programs normally work with a Vec4+scalar architecture.
It's arguable that the Vec4+scalar architecture will be sorely wasted when running pixel shaders, particularly as the traditional GPU typically spends more time/resources running pixel shaders than vertex shaders. That's simply because it'll be relatively rare when the Vec3+scalar+scalar capability will be used. Though you could read this as allowing RGBA (colour channels+alpha channel) + scalar. So, maybe not.
Technically RSX's pipeline is more powerful. It has a double-primary ALU organisation capable of dual-issuing a Vec3+scalar MAD (whereas Xenos can't). RSX also offers special purpose units that do other, fairly rare instructions. But it runs those rare instructions very fast.
RSX pipeline's major problem is register bandwidth. It simply isn't always possible to perform a dual-issued MAD because the register count for the instructions exceeds the number of registers the pipeline can actually fetch.
There are other more complicated limitations due to instruction dependency and texture address calculation ALUs (and dependent texturing) that add further blows to the peak capability of the RSX pipeline.
In short, the RSX pipeline can't sustain very high utilisation of all its constituent units (ALUs etc.).
Xenos's design is the diammetric opposite - to provide the minimum number of units per pipeline that get the job done, so that there's little room for all the "exceptions".
That's where the increase in pipeline count and the use of a complicated scheduler and fully decoupled texture pipes comes in. It's a transistor trade between per-pipe functionality and pipeline-ALU-utilisation. (It's also a compiler-trade - a shader program, when compiled by the driver, has to fit the pipeline like a glove - it's actually a seriously difficult computing problem to make that fit. So a simpler pipeline makes the compiler's job much easier and more likely to get the perfect fit.)
2: Mini-alu's. This is a key difference. While 7800 GTX/RSX has 56 ALU's (8 vector, 48 pixel shader) to 48 for Xenos, not a big difference, the RSX/7800 also has 48 mini-alu's. It seems these help, but it's hard to say how much or even what they do. Anybody care to put say, a percentage in broad terms on it? Is a mini alu "worth" roughly say, ten percent of a major?
An NVidia mini-ALU provides some bias/scaling/clamping functionality. That basically means multiply or divide in powers of 2; add; or limit results to a range (e.g. the range 0...1).
ATI's mini-ALU provides add. It may provide other things, but I dunno...
I've seen (and been involved in) plenty of attempts to quantify processing power in terms of ALUs (which then leads into shader ops and GFLOPs). It aint worth it.
3: Tesselation/interpolation. Xenos has these in hardware. Are they good? I though tesselation was something necessary for GPU operation, but apparantly it's more an optional "effect"? Also, does 7800 have dedicated interpolation hardware?
I'm biased here - I think Xenos's tessellation is what DX10 is going to provide within the next year.
If you read:
http://www.beyond3d.com/forum/showpost.php?p=572515&postcount=166
http://www.beyond3d.com/forum/showpost.php?p=572526&postcount=167
and combine those tessellation concepts with the concept of the Xbox Procedural Synthesis you'll start to realise this is an order of magnitude beyond where RSX is.
4: (particularly for Jawed) Does Xenos use a advanced scheduler like R520, to further drive efficiency apart from the unified shaders? How does this work? Will it be a huge advantage?
Conceptually they're the same as far as I can tell (apart from Xenos's ability to schedule vertices and fragments). I do have my suspicions that there's a few cut-corners in R520...
The major demerit in Xenos is that it appears to use 64-pixel (or vertex) batches. Compared with R520's 16-pixel batches, Xenos looks to be at a disadvantage.
Batch size is a property, jointly, of the scheduler and the memory system (for registers) that supports it. So in that sense Xenos is not so advanced.
But R580, which is more Xenos like (due to pipeline count!), appears to be forced into making the same, larger-batch, compromise. So R520 may prove to be a bit of an exception with its small batches. Hard to tell.
But in basic conceptual terms, the fact that Xenos and R520 are capable of issuing out-of-order batches to either the shader or texture pipelines is a fairly big deal and makes them pretty much equal.
I'm still waiting to get a definitive comparison of the effect on efficiency of out-of-order scheduling. We're seeing numbers anywhere from 10-40%.
Jawed