Rasterization is cheap. Remember that the 15M transistor Geforce could do 4 pixels per clock, and only a fraction of its transistors would be devoted to rasterization. Going from 32 to 96 samples per clock won't triple performance, but even a 10% increase is probably worth the die space.
Xenos's 8 ROPs, which have essentially no compression functions, cost ~20M transistors and are less functional than D3D10 ROPs. 24 ROPs in G80 prolly cost in the region of 100M transistors. 8 Zs per ROP per clock is a lot.
Texture fetching is very useful. The 8800GTX is has the same fetch rate as R600 and is pretty close in performance, even though it is has a heavy deficit in math throughput and bandwidth.
Texture fetch (as opposed to bilinear or AF filter) is rarely a bottleneck.
Texture filtering may seem like overkill,
You lamented single-cycle trilinear as wasteful back just before G80 launched.
but NVidia basically decided that instead of just free FP16 like ATI, adding double speed 8-bit doesn't cost a whole lot.
Filtering is the single most expensive part of texturing:
Not only that, but the 8-bit "HDR" formats (texture+render target) in D3D10 make fp16 texture filtering (or fp32 filtering) something of an historical blip I suspect, caught between those formats and deferred rendering engines. They'll always have their uses, but devs will use them as a last resort simply because of the space and bandwidth they consume.
Fast filtering could really help with min framerate. Game situations can arise where a different view or simply crouching can drastically increase the AF load, especially when texture resolutions are high (as we hope they will be in upcoming games).
I think alpha-blending and overdraw are what cause the worst framerate minima, things like explosions, clouds of smoke, lots of characters running around the screen, large areas of foliage. I think in comparison, "texture load" is relatively constant in regions of heavy texturing - you don't get a 50% reduction in framerate from crouching.
Here I agree, though I'm not sure what you're saying about Xenos. It was a mistake, or it misled you into what you expected for R600?
Ooh, I don't think it was a mistake, it was a great decision. It also gave the impression that ATI considered the time right to go with 4xAA per loop.
This, to me, is a bit disconcerting. They clearly weren't aiming high enough with what to do with 700M transistors. What did they expect? That NVidia would just sit around?
Precisely. At the same time I think their architecture, specifically the virtualisation and the multiple concurrent contexts, are very costly. Look at the huge size of RV630, 390M transistors
, even if a fair chunk of that is texture cache. So the architecture has a high cost of entry.
As these architectural concepts become more important in future D3D versions, NVidia will have to play ball, just like NVidia had to implement small-batches, out of order threading and decoupled texturing.
As much as they talk about the high end getting too expensive, I'm sure the executives at ATI would love to sell a smaller chip at the same price and/or this chip at a higher price. If NVidia can halve G80 and get 60% (due to a higher clock) of the performance when going to 65nm, then ATI will really feel it.
Yep, though I think 70%+ is more likely, there's a lot of clock headroom...
G84 doesn't seem to be as effective as G80 per transistor per clock, so I'm not so sure about writing off the filtering ability just for that reason.
There's NVIO "on-die" plus the fixed-function overheads: thread scheduling, VP2, etc.
8600GTS at 10.8G texels/s bilinear and 43.2 G zixels/s versus HD2600XT at 6.4G texels/s bilinear and 6.4G zixels/s does look barmy. For all that, 8600GTS comes out about 28% better:
http://www.computerbase.de/artikel/..._xt/24/#abschnitt_performancerating_qualitaet
best-case. So, ahem, G84 v RV630 is a solid win with 4xAA/AF at 1280x1024, but 8800GTS-640 against HD2900XT is a narrower victory at 1600x1200 8xAA/AF:
http://www.computerbase.de/artikel/..._xt/32/#abschnitt_performancerating_qualitaet
The good thing about double speed filtering is that you improve texturing performance without increasing the number of registers needed for latency hiding.
That's the only positive I've heard so far. And G80 is not exactly thread-happy, having "under-sized" register files and extremely slow register file spill to memory.
Your theory is sort of weird, too. They have to have one TMU quad per cluster with their current design (they actually call it a "texture processing cluster"), so if they didn't have this "cap" then what? If they were planning on fewer TMUs, NVidia would have had fewer clusters, not more.
I'm thinking of an alternate history where G80 was 12 clusters, with a 1:1 TA:TF ratio, so more ALU capability and less TF.
But I suspect this wasn't practical, because thread scheduling hardware is just gobbling up die space, or because a 12:6 crossbar twixt clusters and ROPs/MCs might have been too costly.
1:2 TA:TF just looks like the easiest to construct. But it forced NVIO off-die. I have a theory that a year earlier, G80 was planned to be 1:1, but they revised upwards to get the performance scaling they desired. It was the simplest "fix" to an under-performing part.
As for the thread scheduling, G84 could simply double the warp size to match ATI. That would halve their scheduling cost with almost no performance cost. I think they wanted to prove to the world that they could lead technologically in dynamic branching, and that's why they did it. Now they can go back to a more sensible warp size, and I suspect they'll do so with G92.
G8x architecture was set too far back for the "prove to the world" thing. Don't forget that a batch in G80 is actually 16 objects in size. A warp is two batches married, because it makes the more complex register operations of pixel shaders kinder on the register file. Pixel shaders will tend towards vec2 or scalar operands, while vertex shaders will tend towards vec4 operands.
There are 2 parameters in warp size: clocks per instruction and SIMD width. The problem with making the SIMD wider, say 16 objects, is that the register file needs to get twice as wide. G80 fetches 16 scalars per clock (and 16 constants per clock and 16 scalars per clock from PDC). All of these would have to be doubled if the SIMD gets wider.
So I suspect it's easier for NVidia to go with a longer instruction duration, 4 clocks in future GPUs. This also has the side-effect of reducing the read after write etc. hazards against the register file, which also makes it easier to fetch operands for those tricky corner cases of MAD+MUL co-issue. At this point the batch size and warp size will coincide: 32 objects, and that will be that for the foreseeable future, I suspect.
Jawed