Demalion-
Yeah, I've been sort of avoiding responding to your very worthwhile arguments regarding (originally) 4x2/8z vs. 8x1 on NV30, because I want to do a bit more thinking and analysis first. Unfortunately, all the interesting news of the past couple days has put this further and further off.
But in order to avoid continually ignoring your comments, a quickie response. (BTW, for the purposes of the discussion I'll talk mostly about 4x2 vs. 8x1, but the ideas are general.)
1)
Why do I keep ignoring the effect of 4 proxel/shixel pipelines vs. 8? Initially it was because I was thinking that the fixed-function organization (i.e. 4x2) did not
necessarily have anything to do with the shader pipeline organization, and so for the purposes of the discussion I wanted to keep them seperate. I think I developed this view back when it was unclear whether NV30's PS 1.4+ shader pipeline was somehow decoupled such that it really did have 8 pipelines (in the sense that it could have 8 pixels moving through the pipeline in parallel). We now know that's not true, and further reflection on what's actually going on has made me come to the conclusion that it's not likely to be true on any near-future GPU.
Instead it seems likely that what I've been calling "the fixed-function organization" does have a very specific effect on the shader pipeline--it describes the
texturing and texture filtering portion of the shader pipeline. To be specific, this is no different from its role in the fixed-function pipeline; the difference, though, is that all the other math and memory operations in the fixed-function pipeline are done by dedicated fixed-function units arrayed in such a way that it is only the number of texture applications (specifically, in most modern cards, the number of bilinear-filtered texture applications) plus any memory latency constraints which fully describe the throughput of the pixel pipeline. In other words, even though e.g. 4x2 really only describes the texturing portion of the fixed-function pipeline, that combined with the memory hierarchy is the only thing that determines performance.
With shaders it's different of course; the math and/or the performance of dependent texture reads is what's generally going to bottleneck performance in pixel shaders of any complexity. OTOH, it's my impression that most pixel shaders as they are used today (i.e. PS 1.1-1.3) are going to be mostly texturing limited, as they are not really all that different from multitexturing with a few math ops thrown in. So in that sense, 4x2 vs. 8x1 might indeed be a useful differentiation for a PS 1.1 shader whose performance is mainly determined by the 3 textures it reads rather than the 2 or 3 math ops it does with them. (i.e. 8x1 should be an advantage because of the whole odd/even thing. Assuming none of the three textures is trilinear/anisotropic filtered...) OTOH, "4 proxel/shixel pipelines" would
not be a useful description in this case, because 4x1 and 4x2 will give very different results.
And in general I'm afraid I don't think "N shixel pipelines" is ever a useful term without discussing what constitutes each pipeline. Again, with fixed-function pipelines, the only relevant constituent has been the number of TMUs. (And the amount of filtering they can do. And, for MSAA, the number of z-units. And maybe something else I'm forgetting.) But with shixel pipelines, you also need to talk about the ALU resources in each pipeline as well. And if you want to be accurate, you need to talk about them in some detail. I'm afraid that, for PS 2.0 type shaders at least, counting up the number of pipelines will have nearly zero bearing on overall performance, as the only thing it really limits is the number of pixels issued or retired on any given clock, and these limits are never going to be a limiting factor. (Well it also limits texturing efficiency to the degree that you're reading e.g. an odd number of textures with a 4x2 vs. an 8x1 organization. But we already have 4x2 and 8x1 as terms.)
It's sort of like...it's sort of like we're trying to measure the area of a rectangle and you're just telling me the width.
[Wait a sec: just to be sure, by "proxel"(/"shixel") we are counting the number of fragments shaded in parallel, not the number of ops applied per clock, right? Because now that I think about it, the analogy with "texel" would probably suggest the latter definition. Anyways, it still doesn't capture all the necessary constraints.]
Um, so while I take back the notion that 4x2 vs. 8x1 in and of itself isn't important for UT2003 performance just because shaders are being used, I do think it doesn't begin to tell us enough to differentiate meaningfully between performance on a nice hefty PS 2.0 shader.
2)
If feature lists didn't influence buying decisions, why would they bother printing them on the box? Clever, my boy, very clever.
But I have a question for you: if feature lists
do influence buying decisions, why does the Xabre list include "Xmart", "Vertexilizer" and "Xminator-II"??? 8)
3)
How can I argue "4x2/8z vs. 8x1 doesn't matter" so long as there are situations where they will result in different performance? Just because there are situations where they result in the same performance doesn't cut it: there are situations where an E&S 4-way R300 and an S3 ViRGE will result in the same performance (e.g. Nethack)!
First things first: it does matter that Nvidia lied about it, in the sense that, well, they lied about it. Very bad. But, been there, done that. Not that I've forgotten--just that I don't care to talk about it any more.
New question is, does it matter in terms of it adversely affecting performance (given the rest of the NV30 design)? Any my still tentative answer is, not in any significant way. (By "significant" I mean "meaningful" rather than just "pretty big".)
I still want to analyze this (or try) in greater detail, but some points to think about before you dismiss the idea:
1) 4x2 is more flexible than 8x1 in the sense that it gets a reduced triangle-edge penalty. (Because each "pixel cohort" has to be from the same triangle, and in general are taken in a specific rectangular pattern as I understand it; pixels outside the triangle mean wasted pipelines.) This gets more important with small triangles and low resolutions (i.e. in future games), so it can even be seen as being forward-looking in some way. Point is, there are performance benefits in common situations as well as performance drawbacks.
2) I don't care about any situation where performance is already above, say, 70fps. Or, at least, 85fps. As per above I'm going to ignore performance of long shaders, because 4x2 or 8x1 only tells us a little about their performance characteristics.
3) Bandwidth limitations mean there would be probably
no real-world single-textured situation where 4x2 vs. 8x1 would make a difference...except in the sense that the triangle-edge effect would probably make the former
faster. (Actually I may be wrong on this. Can someone come up with a reason to use apply a single-texture pass with a highly magnified texture and no z reads/writes?)
Same with color (but no texture) + z.
4) Z/stencil-only is at 8 zixels/clock of course. Is there any situation where you would want to write non-textured color without z? I dunno. If so that would represent a shortfall with this design. I don't know enough to say whether this occurs on any sort of a regular basis.
5) The performance hit, then, if there is one, is going to come exclusively when applying an odd number (higher than 1) of biliear-filtered texels to a fragment in a single pass. (Assuming I've correctly dismissed all the other situations above.) The theoretical fillrate hit is 1/4 fillrate for 3 textures, 1/6 for 5, etc.
First thing to note is that all trilinearly filtered and/or anisotropically filtered textures result in even numbers of bilinear-filtered texture applications. Arjan helpfully pointed out that there are some types of textures that don't get trilinear, but I'm not sure if that goes for aniso as well (I would tend to guess it doesn't, but I really don't know).
But if we combine the fact that I don't want to talk about shader performance because there are too many other variables with the fact that I only want to talk about performance hits that actually mean something, I think you end up almost necessarily in a situation with trilinear and/or AF. Unless you turn on MSAA or something, but in that case you're going to be bandwidth limited, so no point in blaming anything on low fillrate!
Hell:
in general with NV30 you're going to be bandwidth limited. Or shader op limited. Or platform limited, of course. The set of circumstances where you're limited by texel fillrate is pretty small with NV30, and generally implies AF. I don't know enough to know whether this necessarily means an even number of bilinear texels per pass (assuming no shaders), or whether it just means a pretty darn high number, which may be just as likely to be odd as even but which represents a small performance hit for the Nx2 pipeline inefficiency in any case.
Not to mention the triangle-edge benefit. No, let's mention it:
the triangle-edge benefit!!
I rest my case! And (long overdue) also my body. Hopefully this made some small amount of sense...