Xenos's 8 ROPs, which have essentially no compression functions, cost ~20M transistors and are less functional than D3D10 ROPs. 24 ROPs in G80 prolly cost in the region of 100M transistors. 8 Zs per ROP per clock is a lot.
Okay, that's a good point. They still have to decompress the data stream, but I'll agree that it's less than what an EDRAM-less GPU has to deal with.
It should be noted that Xenos has a 4 blend units per ROP, and we don't know how many G80 has. It may only have 1, because if the samples aren't identical (and hence not compressed) it'll take a while to transfer the data anyway. Also, I think G80 has half speed blend like NV40. The 3DMark numbers are showing precisely 12 pixels per clock (6.9 GTex/s). That will save a lot of transistors.
Nonetheless, let's assume your right with 100M. That's still inline with what I was talking about. Suppose 16 DX10 ROPs doing 2 samples per clock cost 40M transistors. I think that's fair. 60M additional transistors is under 10% of G80, and though I can never prove it, I think G80 has benefitted more than 10% in AA performance (the only performance that counts for this market segment). I really don't think they've been wasteful here
Texture fetch (as opposed to bilinear or AF filter) is rarely a bottleneck.
Even with AF disabled, R580 was very often only marginally faster than R520. (Trilinear has always had a low impact with the mipmap optimizations we've seen, so bilinear rate = fetch rate for all intents and purposes.)
You lamented single-cycle trilinear as wasteful back just before G80 launched.
Filtering is the single most expensive part of texturing:
Originally, yes. I thought it was wasteful because you only need a bit more hardware to make double speed bilinear (it's probably even a greater fraction than what's shown by that chart once you take the sample fetching, cache, and MC into account). However, then I realized that you would need twice the threads and registers to truly double bilinear texturing speed. That was a pretty big oversight in my original analysis. Moreover, I did not consider that it would also double the single mipmap speed of FP16, I16, FP32, other >8bpc HDR formats, and 8bpc volume textures.
I think alpha-blending and overdraw are what cause the worst framerate minima, things like explosions, clouds of smoke, lots of characters running around the screen, large areas of foliage. I think in comparison, "texture load" is relatively constant in regions of heavy texturing - you don't get a 50% reduction in framerate from crouching.
That's true, and I've brought that point up many times myself too. I wasn't saying AF is the biggest cause of dips in framerate, but just suggesting that it's a load that could wildly fluctuate when everything else stays the same, especially if texture resolutions start going up the way we all want them to.
Agreed.
best-case. So, ahem, G84 v RV630 is a solid win with 4xAA/AF at 1280x1024, but 8800GTS-640 against HD2900XT is a narrower victory at 1600x1200 8xAA/AF:
Even if you disable AF to almost negate the filtering advantage, the 8800GTS-640 is not far from the HD2900XT. With less fillrate (well, w/o AA only I guess), way less bandwidth, and way less math ability, you have to think the equal texturing rate has something to do with it, right?
I'm thinking of an alternate history where G80 was 12 clusters, with a 1:1 TA:TF ratio, so more ALU capability and less TF.
But that would mean 48 bilinear TMUs! Weren't you just saying fetch rate is less important than filtering rate? By definition a cluster has a TMU quad, so if you wanted more math and less texturing, you'd go for few clusters with wider SIMDs (and a larger warp size).
Remember how the original Geforce could do single cycle trilinear (i.e. 2 filtered quads per clock) in 15M transistors. It cost NVidia 10M transistors to double the addressing ability and double the pixels in flight for the Geforce 2. Though I suspect they increased texture blending math and I guess increased clock speed had some cost as well.
G8x architecture was set too far back for the "prove to the world" thing. Don't forget that a batch in G80 is actually 16 objects in size. A warp is two batches married, because it makes the more complex register operations of pixel shaders kinder on the register file. Pixel shaders will tend towards vec2 or scalar operands, while vertex shaders will tend towards vec4 operands.
There are 2 parameters in warp size: clocks per instruction and SIMD width. The problem with making the SIMD wider, say 16 objects, is that the register file needs to get twice as wide. G80 fetches 16 scalars per clock (and 16 constants per clock and 16 scalars per clock from PDC). All of these would have to be doubled if the SIMD gets wider.
Okay, we can scrap my "prove to the world" idea, but given that ATI went nuts for R520's tiny batch size, I can see how NVidia would do the same when planning for G80. They obviously knew how useless G70's DB was.
I'm not following you on the impact of pixel shader vs. vertex shader on the register file. In a scalar architecture it doesn't really matter. You store your data in a SOA manner in the RF. Parallel MADDs still need to load the same 8 floats for each of 3 operands as with serial MADDs. The clocks between instruction changes makes sense, though. A vec4 op on a batch of 16 would uses the same instruction for 8 clocks on 8-wide SIMD arrays.
Also, making the register file wider isn't nearly as costly as making it bigger. The number of operands are still the same. One quarter of R600 fetches data for 80 SPs each clock. The sixteen arrays in G80 only fetch data for 8 SPs. Anyway, regardless of how you approach it, it would definately be more costly to keep the warp size the same and double the math instead of doubling both.