AFAIK, the tessellator in xbox 360 is mostly unused. Would you say that it has disadvantaged non-tessellated geometry rendering?
Yes of course. If they had chosen not to include the tessellator those transistors could be spent to other features (such as one extra shader ALU), or just be removed (to improve yields and manufacturing costs, and thus reducing the console price). Simple as that. This is the exact reason why I feel that adding large amount of fixed function transistors for features that are not yet being established (such as real time raytracing) is not a good decision. It will be a hit or a miss. A big gamble really.
If huge amount of transistors are spend to some rendering technique that's not yet proven to be usable, there's always the risk of not giving developers what they exactly want. There will be limitations that nobody though about, because the rendering technique wasn't yet widely used and researched thoughout. Also the fixed function hardware solution is likely going to result in less research in alternative techniques, and that's not a good thing, if one of the alternative techniques proves to be better in the long run.
Basically you should only use fixed function hardware for established tasks, that are thoroughly researched, the "best" solution for the problem exists, is agreed by most developers and the task is commonly used in huge majority of games. Good examples: texture samplers, triangle setup, CPU caches, math (sqrt, exp, estimates...), texture decompression, depth buffering (including hierachial depth and depth compression), general purpose compression/decompression (LZ-style), video decompression (for parts that benefit most from FF hardware), audio mixing...
---
I purposefully left MSAA out of the above list. Let me explain it a bit.
Efficient 4xMSAA means that you have to:
1. 4x fixed function point-inside-triangle hardware
2. 4x fixed function depth testing hardware
3. 4x fixed function stencil testing hardware
4. 4x fixed function blending hardware
5. 4x wider data path from rops to memory (but not any wider from PS to rops)
6. More memory and more memory bandwidth (for backbuffer operations) and/or MSAA color compression an depth compression (again more fixed function hardware)
That's a lot of hardware. And MSAA is only used by small minority of current generation console games (and not in all PC games either). Even in those games that use MSAA, the MSAA hardware is only used for around 20% of the time (rendering objects to g-buffers). It does nothing for 80% of the time (it's not needed in shadow map rendering, lighting and post processing).
Current PC GPUs solve the MSAA backbuffer bandwidth demands with depth and color compression hardware (more fixed function hardware), and the rest is solved by making rops (much) fatter. Xbox 360 solved the problem with EDRAM. It had the required bandwith and built in fixed function processors for blending/z-buffering/stenciling. EDRAM solved all the six requirements.
Now let me talk about shadow map rendering. It's an operation that is heavily fill rate bound. In games that have lots of dynamic shadow casting lights, shadow map rendering can take up to 30% of the frame time. Shadow map rendering mainly requires fixed function depth buffering hardware (and storing depth samples to render target). These are the main bottlenecks of it. GPUs with 4xMSAA have four times more fixed function depth buffering hardware (and bus/bw/compression to store the depth values). That must be a very good thing for shadow mapping? Will it provide 4x depth fill rate?
Unfortunately no... You can render the shadowmap to 4xMSAA buffer, and have proper separate depth testing to each subpixel and have 4x fillrate. However the fixed function point-inside-triangle testers have fixed offsets (optimized MSAA pattern), and these offsets do not form a regular grid. So the sampling from the rendered shadowmap texture is not straightforward. Even the simplest nearest neighbor sampling requires excessive amount of math (dozens of instructions). Because one of the six fixed function components is not up to the task (1), all the others (2-6) are pretty much unusable as well for shadow mapping. What a shame.
The lesson here is: Even if some rendering technique is currently very popular (MSAA antialiasing), it might suddenly become a niche because of new research (deferred shading). MSAA depth testing hardware would be very good for shadow map rendering (4x fill rate), but because a single small mistake in the FF hardware design it cannot be used for shadow map rendering at all (two sets of MSAA sample offsets, and a flag to select preferred one would be enough to solve the issue). ---> Fixed function hardware is very fragile. Even tiny limitations in it might make it unusable.
All this extra fixed function hardware for 4xMSAA makes rops very fat. Current generation consoles have only eight rops, and thus are often fill rate bound. I would trade all that MSAA hardware for 16 rops any day (and even that would halve the fixed function hardware listed). EDRAM has the bandwidth, that's not a problem.
when power is the major constraining factor why do we care about utilisation? FF will be "king" so long as it allows continued progress while offering ultimately Superior performance. If utilisation was the key metric VLIW never would have happened. being able to do whatever you want however you want has a price too, larrabee wasn't able to pay the price of admission
.
VLIW happened. Separate vertex and pixel shaders also happened...
Processors with VLIW such as Intel Itanium and Transmeta Crusoe are dead. Itanium was always plagued with heat problems. GPUs with VLIW are going the same route: NVidia dropped both VLIW and separate pixel and vertex shaders in Geforce 8000 series, and we all know how that story ended (8800 is one of the most popular graphics hardware ever). AMD kept VLIW alive for longer, but dropped it in GCN. So VLIW is as good as dead (in both CPU and GPU markets).
Utilization is important. Pentium 4 pipeline was long to allow very high clocks. It's very hard to keep a long pipeline properly utilized. Intel's Core 2 architecture came up with a much shorter pipeline that allowed much higher hardware utilization. P4 had much higher theoretical peak numbers, but the improved utilization was the key to both good performance and good efficiency. Thus Core 2 was the biggest jump in perf/watt we have ever seen in CPUs.
---
Also you have to think about chip manufacturing costs. Each fixed function unit requires some extra transistors. And these transistors have very low utilization ratio (are idling most of the time even in software that uses them). These units also require separate design (research cost), while you can often just replicate the programmable units many times to scale up the performance, and slighly improve the design from generation to another. Replication is also a boon for programmable units, because of the yields. You can just add a few extra programmable units to improve yields (and thus lower production costs). A programmable unit can replace any programmable unit if there is a defect in the chip manufacturing. For fixed function units, you need to have a separate spare for each fixed function unit type (a broken video decoder cannot be replaced by anything else, so you might need to have two of them in the chip just to improve yields). Additional broken programmable units often only degrade the performance, while the feature set stays the same. This provides an cost effective method for selling hardware for lower price points.