Buntar: I generated the data for that graph, so I should know...
Because of the lack of TAs *and* a bug for FP32 (confirmed by NV) on G80 (under the drivers used for that test), it is indeed impossible to see what I'm refering to there though. These numbers should be a fair bit clearer hopefully! (for G92)
G92 Trilinear
----------------------------------
DXT1: 16783.216533MTexops/s
DXT3: 16783.216533MTexops/s
DXT5: 16666.666418MTexops/s
INT8 Vec1: 16783.216533MTexops/s
INT8 Vec2: 16783.216533MTexops/s
INT8 Vec3: 16783.216533MTexops/s
INT8 Vec4: 16783.216533MTexops/s
FP10: 8391.608267MTexops/s
RGB9E5: 8391.608267MTexops/s
Depth16: 8333.333209MTexops/s
Depth24: 8391.608267MTexops/s
Depth32: 8391.608267MTexops/s
FP16 Vec1: 16783.216533MTexops/s
FP16 Vec2: 16783.216533MTexops/s
FP16 Vec3: 8362.369213MTexops/s
FP16 Vec4: 8391.608267MTexops/s
INT16 Vec1: 16551.723891MTexops/s
INT16 Vec2: 16666.666418MTexops/s
INT16 Vec3: 8391.608267MTexops/s
INT16 Vec4: 8391.608267MTexops/s
FP32 Vec1: 16783.216533MTexops/s
FP32 Vec2: 8421.052506MTexops/s
FP32 Vec3: 4203.152302MTexops/s
FP32 Vec4: 4195.804133MTexops/s
As for blending: I agree it's not a big problem, and if you do the calculations it's only a very minor bottleneck. If you look at a 'classic' particle workload, I think depth often isn't even read (hier-z...) and it doesn't need to be written either. As for texturing, you only read one DXT5 texture; that's 1 byte or 2 bytes depending on bilinear or trilinear.
So if you estimate blending to take exactly 12 bytes/pixel for your average particle counting memory subsystem inefficiencies, then the 8800GT would be perfectly balanced: it requires and has exactly 57.6GB/s for that. But I can definitely imagine scenarios where it takes, say, 10 bytes/pixel. Then you've just lost 17% performance for that part of the frame. You could argue that it's no longer the case with 4x MSAA, but not every benchmark and/or game is run with AA, obviously.
And while the 8800GT seems mostly balanced, the fact the blending rate is so 'borderline' means it's not the case in every SKU; 8800 Ultra, for example. Anyhow, keeping FP16 at the same speed as INT8 doesn't make a lot of sense to me; but you're right that it doesn't matter much and I should just STFU about this for once!
It really isn't fair to put this bottleneck on the same footing as triangle setup on G80 as I sometimes did, either...
It is going to be interesting to see what happens when NV switches to GDDR5 though (2500MHz+ vs 900MHz for G92's GDDR3). If they 'only' double the number of ROPs per memory partition, they'd need ~40% higher clock rates to achieve identical ROP performance/bit of bandwidth. While ~850MHz isn't really unrealistic by itself, it might be on a larger chip than G92 on 65/55nm. Hmmm...