Dave H said:
demalion-
(N.B. I've decided to use the word "pipes" instead of "pipelines", in order to avoid confusion with the unrelated term "pipelined".)
Sorry I haven't responded specifically to your defense of treating proxel pipe count as a useful indicator of clock-normalized PS performance. But I don't feel it really responds to any of my points.
First off, I can't really forsee a situation where the number of proxel pipes would be any different from the number of pixel pipes.
You can't? Well, this seems the crux of the matter then. Hmm....well, the nv30 seems to be heading in that direction to me. In fact, it is what marketing would have you believe the nv30 shows you now. You can't see the performance difference this would make if a chip could calculate shaders for 8 elements at a time, even if the input and output were limited to 4 per clock, or you can't see an architecture being able to be designed in such a way?
For a different example, what if the nv35's 8x1 (which would be pixel pipelines) has a processing unit capable of only handling the processing for 4 elements for each of the pair of 4 pixel pipes, but with the difference (based on one of the theories about the nv30) that it allowed texture fetching to occur independently of FP calculation. Wouldn't describing it as
only either 8x1 or 4x? pipes would be a disservice?
You sometimes talk about "effective" proxel pipe count, but I don't remember any specific explanation of what you mean. (Sorry if I've just missed it.) So to this extent, "proxel pipes" is a redundant measure.
Ack! The link I provided leads to a rather thorough set of "specific explanations", and if you have a specific question please clarify?
AFAICT, the only information that "pipe counting" gets us are the maximum pixel dispatch and retire rates. In general, these rates are much "too high"; that is, they will almost never be the limiting factor in PS performance.
No, that's what pixel pipe counting gets us. Proxel pipe counting gets us the amount of independent element calculations that can be processed in one clock. One of the possible advantages of the nv35 is that it would be a 8x1 pixel pipe with 4 proxel pipes organized such that stall situations are reduced compared to the nv30 (remember the relatively small transistor count change that has been quoted). In fact, depending on how the nv35 really changes, we might end up with a situation where proxel pipe count changes depending on the data type being processed.
This is a big difference from fixed-function rasterizing; counting pixel pipes is somewhat useful there, because fragments that actually only take one clock are not terribly uncommon, and thus dispatch/retire becomes the limiting factor.
Hmm? Are you saying instructions that take one clock aren't common? I'm confused a bit...did you really look at all the information in that link? I tried to collect everything so I wouldn't have to re-explain the reasoning behind such things as the maximum, minimum, and standardized proxel fillrates I also proposed.
Now, when it comes to fixed-function performance, pipe counting gives us more information because all the TMUs in a particular pipe are constrained to be texturing the same pixel at any given time. [as I understand it]
But this type of constraint can be a characteristic of calculations as well...
We are rehashing the comments I tried to cover before, it seems to me.
The reason for this is nothing to do with the pipe count per se--remember, pipes are pipelined. Instead it's because allowing every TMU to texture an independent pixel means having as many texture address units as TMUs, and texture address units are relatively expensive.
To me that reads as saying "it doesn't have anything to do with pipe count, except as it has to do with the count of a defining characteristic of a pipe"...?
(Indeed, once you've spent the transistors to implement the extra texture address units you might as well just add the extra pixel pipes as well, which is why texture address units and pixel pipe count are considered synonymous with respect to fixed-function functionality.)[/as I understand it] AFAICT there's no reason for the same restriction to apply to shader op execution units.
Hmm...I think you've repeated the heart of the issue here. Programmable functionality is significantly more demanding of resources than fixed function blending, so it does not fit this idea of "might as well just add the extra pipes".
If the previous paragraph makes no sense whatsoever, basically what I'm saying is this: with a 4x2 fixed-function design, triple-textured throughput is the same as quad-textured throughput and hence worse than on an 8x1. [as I understand it]But with a 4 proxel pipe design where each pipe has 2 execution units, throughput on a three instruction shader should be better than on a four instruction shader and just as good as on an 8 pipe design with 1 functional unit per pipe.[/as I understand it] The only difference, as I said, would be dispatch/retire performance, which only makes a difference with one instruction shaders, i.e. never.
Hmm, no, you are ignoring the idea of independent element calculation, and ignoring, for example, that the 9700 can perform up to 3 different types of operations in one clock for one element. This is distinctly different than being able to perform 1 op for 24 elements in one clock, which by your argument would be equivalent since proxel pipe count doesn't matter. The count of proxel pipes is a determining characteristic of the applicability of the throughput for different circumstances (just as for the case of 4x2 compared to 8x1), and the symmetry present on the 9700 between pixel and proxel pipe characteristics need not be replicated everywhere.
To me, it reads as if you've answered your own initial question about not being able to see why proxel and pixel pipe count would deviate ...so, for now I'm going to stop here. If you want me to address the rest of your text before continuing to discuss, just say so, and then we can discuss selections from both responses at once for the sake of avoidng repetition, but I really feel that this reply and the collation I linked to answer what I see as the questions you've raised.