Toms Hardware: GeForce 5200 is 2x2?

Assuming "shixel" refers to shader ops, "texel" and "voxel" suggest that the vowel quality should be preserved, hence "shaxel"!

I would also hate to see "zixel" (or "stenxel") spreading into PR. Maybe we can just use "Nvixel" vs. "Atixel/Matroxel" pipelines...
 
I take it then (from reading unclesam's conclusions) that the NV3X architecture contains general purpose processing units which can be dynamically allocated. They are not limited in function to "texture fetching", "shading", etc. It is a cool concept, but can/will it be made practical?
NV31 is

4х1 in blend stages only
2х2 on all p shaders with 2 stage ALUs per pipe probably
What about for procedural shading, woudn't 4 alu's be sent/allocated for completing the color ops?

It would be nice if someone could write a procedural shader which required no texture fetches/adressing. This would be good use for the cg compiler (in FP30 mode) so that the pool of resources on the processor are more efficiently allocated and the code is to the metal. Anyone want to give it a try?

In theory, then, should we expect the NV31 to be as fast as the NV30 in shader color ops/cycle?
 
Why is that Chalnoth, doesn't the NV31 contain 4 general purpuse alu's which can be dynamically allocated (4 color ops possible), while the NV30 is stuck to a 4 color ops and 2 texture op configuration (unless it is purely a driver limitation)? I understood this from the following post from unclesam:
NV31 is

4х1 in blend stages only
2х2 on all p shaders with 2 stage ALUs per pipe probably
all commands on same ALU's - so 1.1 and 1.4 and 2.0 speed equal
its NV35 approach

NV34 is true half of NV30 in terms of pixel pipes organisation

2х2 always.
1.4 and 2.0 twice slower like in NV30
 
I don't think so. I think the entire NV3x line is highly-programmable, but different configurations are chosen in different situations for performance reasons (or just that the driver team hasn't yet figured out all the issues with using the higher-performing configurations).
 
Besides, that post was wrong. Single-textured fillrate tests show NV34 is a true 4x1 in fixed-function mode.
 
Dave H said:
Besides, that post was wrong. Single-textured fillrate tests show NV34 is a true 4x1 in fixed-function mode.

that would still lead us to believe that the nv31 should have the same performance per clock as the nv30 in pixel shader operations, shouldnt it?
 
that would still lead us to believe that the nv31 should have the same performance per clock as the nv30 in pixel shader operations, shouldnt it?

No. NV30 can do 2 texture lookups per pixel per clock, for one thing; NV31/34 can only do one. And while the specifics aren't entirely clear, it appears almost certain that NV30 can do more shader ops per clock than NV31 which can in turn do more per clock than NV34.

This could be the result of NV30 having more functional units as part of the shader pipeline, or NV31/34 having some units not fully pipelined, or perhaps something else like too-small buffers on NV31/34, although the results seem too robust for that.

This has always been my problem with the notion that we can count up the number of "proxel pipelines" and get a meaningful figure: there's absolutely no reason why "1 proxel pipeline" on architecture A should have a theoretical instruction throughput at all similar to "1 proxel pipeline" on architecture B. We still don't know the details of the NV30 implementation , but it seems NV31/34 are cut down in some way. While it's pretty certainly untrue that NV30 has "8 proxel pipelines" (as Nvidia still appears to want to claim), NV31 6 and NV34 4 (as someone reliable--MuFu IIRC--had it, or rather had internal Nvidia descriptions having it), the notion that the execution resources of each shader pipeline in NV30/NV31/NV34 are different from each other, such that the average clock-normalized throughput has a ratio of 2 : 1.5 : 1 respectively, actually fits quite well with observed PS benchmarks.

It's possible that this can all be explained by the differences in texture read throughput and NV34's lack of z-compression, but I doubt it.

To sum up: NV30 definitely has greater per-clock shader resources than NV31 and NV34; NV31 and NV34 are definitely 4x1 in terms of their fixed-function pipeline; and those two statements are in no way conflicting.
 
Hmm...well, I'll point you towards the last response I gave, from when you complained about the proxel before, Dave H, since the way you refer to things reads to me as if you might have missed the explanation. Namely, I specifically discuss how your concern is addressed in more than one place.
 
If anyone takes a look at the pixel shader 2.0 3DMark03 bench for the 5600 ultra in Tom's article, and this one: http://www.extremetech.com/article2/0,3973,922668,00.asp at extremetech, it becomes obvious that the NV31 does, indeed have 4 completely configurable pixel pipelines, which can texture or shade either which way.

The 3DMark03 pixel shader 2.0 benchmark generates an object porcdurally, meaning no texture fetches. In NV31's performance we see that, compared to the 9500pro, which has 8 shader units, a lower clockspeed (75MHz) and a comparable bus (128-bit), the NV31 is just a bit more than half as fast. Theoretically if it sports 4 shading alu's which execute at the same rate as the 9500pro, it should be half as fast. This goes to show that (assuming the R300 core uses all 8 pipelines) the NV3X core is capable of outputing single cycle shader ops; also that the NV31 can arrange its pipelines for shading in a 2x2 manner which functions as a 4x0 (if only shader ops are issued), a conclusion made in unclesam's theory:
NV31 is

4х1 in blend stages only
2х2 on all p shaders with 2 stage ALUs per pipe probably
all commands on same ALU's - so 1.1 and 1.4 and 2.0 speed equal
its NV35 approach
i think nv31 have 4 universal alu's at all. we can use it as 2 stage 2х2 or one stage on 4х1. for hard texturing seems they can get some optimisations from 2 stage 2 tmu scheme so they use it.

rule simply - deeper pipe - probably lower fetch latency from one command standpoint.
As for precision, according to pocketmoon:
Halves (or PP) and Floats operate at the same speed IF register usage is the same. The gain offered by halfs is that they u?e less register resource. And less register usage means more 'potential ' speed.

Thus, it seems the NV31 (NV3X) can execute fp32 ops as fast as the R3XX cores, but at the price of texturing ability (although they have a smaller pool of units than the R3XX cores). In other words, NV31 can potentially (not sure of NV30, but NV35 is speculated) allocate any of processing resources for shading, but it would yield no benefit when texturing/texture reads are in question. Since this is a procedural pixel shader, it is a great measure of whether or not the NV31 can delegate its 4 pipelines to shading. We know the NV31 can be arranged as a 2x2 for multitexturing purposes.

Would the NV31 be able to arrange its pipelines in a 2x1 setup, calculating an fp shader op, texture op, and combiner op per pipeline, rather than only 2 texture ops with a combiner op (per pipe)?
 
This thread offers a lot of informative information. Before reading it, I just could not figure out the architecture of the NV3X processors; now, if feel as if I'm working on the refresh. :)
 
Lum, aren't those the drivers that are presumed to be using 16-bit fp at all times?

Also, pocketmoon's statement has a caveat about register usage for the nv30...under which circumstances does that occur for the nv31 and nv34? It seems you are implying your comments address fp32 performance, and I'm not clear on why you propose they do.
 
Well, demalion, if you take a look at pocketmoon's benchmarks here, you'll find that the descrepencies between partial and full precision on the NV30 are little, if anything, under FP30 mode. We know that the R3XX architecture tends to excel under DX9 code, while NV3X's performance is fully maximized in FP30 mode. So it seems the NV31 is at a disadvantage with this mark, particularly because it is DX9, R3XX's best field, not NV30's (although this was Nvidia's on doing). There are some cases (benchmark 4 in pocketmoon's cg suite) where DX9 performance is nowhere near NV3X's full capabilities. This is a reason why the performance we observe maybe more indicative of NV31 in fp32 mode, having the shader written and compiled using Cg under FP30 mode. This is not the main point, however.

My reasoning lies behind this:
The procedural shader in question (in the 3DMark03 benchmark), most probalbly, does not even exceed the R3XX's 96 instruction count (or else R3XX would have to multipass). If true, this would translate to the absence of r/w latencies that would differentiate performance between NV31's fp32 mode and fp16 mode. The supposed performance delta between the two modes is derived from be the number of available registers for each. More registers means better flow, which translates in a bit more performance, but I don't think this benchmark presents such a case.
 
demalion-

(N.B. I've decided to use the word "pipes" instead of "pipelines", in order to avoid confusion with the unrelated term "pipelined".)

Sorry I haven't responded specifically to your defense of treating proxel pipe count as a useful indicator of clock-normalized PS performance. But I don't feel it really responds to any of my points.

First off, I can't really forsee a situation where the number of proxel pipes would be any different from the number of pixel pipes. You sometimes talk about "effective" proxel pipe count, but I don't remember any specific explanation of what you mean. (Sorry if I've just missed it.) So to this extent, "proxel pipes" is a redundant measure.

AFAICT, the only information that "pipe counting" gets us are the maximum pixel dispatch and retire rates. In general, these rates are much "too high"; that is, they will almost never be the limiting factor in PS performance. This is a big difference from fixed-function rasterizing; counting pixel pipes is somewhat useful there, because fragments that actually only take one clock are not terribly uncommon, and thus dispatch/retire becomes the limiting factor.

Now, when it comes to fixed-function performance, pipe counting gives us more information because all the TMUs in a particular pipe are constrained to be texturing the same pixel at any given time. [as I understand it]The reason for this is nothing to do with the pipe count per se--remember, pipes are pipelined. Instead it's because allowing every TMU to texture an independent pixel means having as many texture address units as TMUs, and texture address units are relatively expensive. (Indeed, once you've spent the transistors to implement the extra texture address units you might as well just add the extra pixel pipes as well, which is why texture address units and pixel pipe count are considered synonymous with respect to fixed-function functionality.)[/as I understand it] AFAICT there's no reason for the same restriction to apply to shader op execution units.

If the previous paragraph makes no sense whatsoever, basically what I'm saying is this: with a 4x2 fixed-function design, triple-textured throughput is the same as quad-textured throughput and hence worse than on an 8x1. [as I understand it]But with a 4 proxel pipe design where each pipe has 2 execution units, throughput on a three instruction shader should be better than on a four instruction shader and just as good as on an 8 pipe design with 1 functional unit per pipe.[/as I understand it] The only difference, as I said, would be dispatch/retire performance, which only makes a difference with one instruction shaders, i.e. never.

(Note: if such a design is what you would call "8 effective proxel pipes", then we have much less of a disagreement, although I think that's a poor way to describe the design--there really are only 4 pipes, it's just that they are fully pipelined. This is in contrast to e.g. a 4x2, which is fully pipelined with respect to texture mapping but not with respect to texture addressing.)

The point, then, is that all the important aspects of shader performance are essentially independent of proxel pipe count. (The one exception is texture lookups, but we already know the restrictions on those from our NxM description of the fixed-function functionality.) How many execution units does each proxel pipe have--vec4/vec3/scalar? What are the latencies and throughputs on each type of instruction? What are the restrictions on calculating various combinations of things in parallel (e.g. on NV3x there is evidence that a single pipe can't calculate a texture address and do an FP shader operation in parallel)? Are there performance advantages (and what are they?) to using less precise data formats? How many registers are available? What're the instruction length limits and what are the penalties under the various schemes to avoid them (i.e. ATI's F-buffer and Nvidia's apparent streaming of shader code)? How are constants dealt with? How are branches dealt with and what is their performance impact? What is the buffer/cache behavior like? How is dependent texturing handled and what is the performance impact?

These (and probably some others I've missed), in combination with pipe count, are what determines shader performance. Pipe count alone tells you nothing. To the extent that these answers are similar across different architectures, then pipe count might be a proxy for performance; but R3x0 vs. NV30 vs. NV31/34 already shows the inapplicability of this approach, and these designs are IMO much more alike than competing shader architectures are likely to be a couple generations down the road.

Hence we should dump proxel pipe count.

Finally, to the argument that proxel pipe count may not be perfect, but surely it's more convenient than answering that big long list of questions above, particularly as far as the layman is concerned: sorry, I don't buy it. As long as it's not a useful proxy for overall shader performance, it's not "convenient" at all to the layman, only confusing. By analogy, the number of parallel execution units in a CPU is just as simple a number (easily understood by the layman!), but because it tells us nothing useful about realized performance, it's never used in that fashion.

Shader performance is getting to be as complicated to understand as CPU performance, and it should be summarized for the layman in the same way: through the results of a few well-chosen, well-understood, diverse and representative shader benchmarks. And the details of the actual implementations should be released (hi, Nvidia) so that nerds like us can learn and debate the fine points (just as nerds who are interested in CPUs do). In fact, in many ways shaders wil probably be more complicated to understand than CPUs, because while the hardware and its interactions with workloads is actually much simpler than for CPUs, the variation in designs will probably be considerably greater and thus any decent armchair analysis will probably require a more complete understanding of all the issues involved.

[edited to specify which parts I pulled out of my ass]
 
Nice explanation Dave. One question though, why does can't texturing/texture accessing and pixel shading happen concurrently in the NV3X architectures. Is it because they are using the same ALU. Are all the ALU's in the CineFX architecture generic and general purpose enough to be pooled according to necessity?
 
Nice explanation Dave.

Thanks! 8) It was mostly speculation though. :oops: I think I'm gonna edit it to reflect that...

One question though, why does can't texturing/texture accessing and pixel shading happen concurrently in the NV3X architectures. Is it because they are using the same ALU.

First off, I should say that I've just seen this "no texturing while FP calculating" thing mentioned a few times around here; I don't have any confirmation that it's true. But, having said that, calculating a texture address involves FP math and so it's quite plausible that they are indeed using the same ALUs for texture addressing that they use for FP shader ops. This is the only reason I can think of why the restriction would be in place, particularly because it is said not to exist for texture addressing + int shader ops.

Are all the ALU's in the CineFX architecture generic and general purpose enough to be pooled according to necessity?

I doubt it. But a lot depends on definitions. First off, by "the CineFX architecture" do you mean NV3x in general or just the PS 1.4+ functionality on NV3x? Second, it depends what you mean by an ALU. The traditional definition of ALU more or less implies that it is general purpose, i.e. performs many different bits of arithmetic, which are selected between by a control signal. If you're asking whether there are hard-coded circuits doing a lot of arithmetic on NV3x, then I'd have to say yes; such circuits exist on all microprocessors, although I would imagine that on a GPU there are many more of them doing what we might call "real math" (as opposed to, say, sign extension).

As for "pooled according to necessity", I don't think e.g. functional units from one shader pipe can "donate their services" to another pipe. And I don't think this matters much, because as I understand it all the pipes are going to be executing the same instruction(s) simultaneously--certainly all pixels being shaded at any given time are running the same PS program, and as I understand it, branches and the like are implemented by predication (i.e. all instructions are run and not-supposed-to-have-been-run instructions just have their results thrown out) rather than true branching. (Although I wonder if this applies to instructions that have a real cost outside the pixel shader engine, i.e. texture fetches. And I wonder if the entire pipeline stalls if say 3 of 4 simultaneous texture fetches hit in cache and the last needs to fetch from DRAM. Hmm...)
 
Dave H said:
demalion-

(N.B. I've decided to use the word "pipes" instead of "pipelines", in order to avoid confusion with the unrelated term "pipelined".)

Sorry I haven't responded specifically to your defense of treating proxel pipe count as a useful indicator of clock-normalized PS performance. But I don't feel it really responds to any of my points.

First off, I can't really forsee a situation where the number of proxel pipes would be any different from the number of pixel pipes.

You can't? Well, this seems the crux of the matter then. Hmm....well, the nv30 seems to be heading in that direction to me. In fact, it is what marketing would have you believe the nv30 shows you now. You can't see the performance difference this would make if a chip could calculate shaders for 8 elements at a time, even if the input and output were limited to 4 per clock, or you can't see an architecture being able to be designed in such a way?

For a different example, what if the nv35's 8x1 (which would be pixel pipelines) has a processing unit capable of only handling the processing for 4 elements for each of the pair of 4 pixel pipes, but with the difference (based on one of the theories about the nv30) that it allowed texture fetching to occur independently of FP calculation. Wouldn't describing it as only either 8x1 or 4x? pipes would be a disservice?

You sometimes talk about "effective" proxel pipe count, but I don't remember any specific explanation of what you mean. (Sorry if I've just missed it.) So to this extent, "proxel pipes" is a redundant measure.

Ack! The link I provided leads to a rather thorough set of "specific explanations", and if you have a specific question please clarify? :-?

AFAICT, the only information that "pipe counting" gets us are the maximum pixel dispatch and retire rates. In general, these rates are much "too high"; that is, they will almost never be the limiting factor in PS performance.

No, that's what pixel pipe counting gets us. Proxel pipe counting gets us the amount of independent element calculations that can be processed in one clock. One of the possible advantages of the nv35 is that it would be a 8x1 pixel pipe with 4 proxel pipes organized such that stall situations are reduced compared to the nv30 (remember the relatively small transistor count change that has been quoted). In fact, depending on how the nv35 really changes, we might end up with a situation where proxel pipe count changes depending on the data type being processed.

This is a big difference from fixed-function rasterizing; counting pixel pipes is somewhat useful there, because fragments that actually only take one clock are not terribly uncommon, and thus dispatch/retire becomes the limiting factor.

Hmm? Are you saying instructions that take one clock aren't common? I'm confused a bit...did you really look at all the information in that link? I tried to collect everything so I wouldn't have to re-explain the reasoning behind such things as the maximum, minimum, and standardized proxel fillrates I also proposed.

Now, when it comes to fixed-function performance, pipe counting gives us more information because all the TMUs in a particular pipe are constrained to be texturing the same pixel at any given time. [as I understand it]

But this type of constraint can be a characteristic of calculations as well...

We are rehashing the comments I tried to cover before, it seems to me.

The reason for this is nothing to do with the pipe count per se--remember, pipes are pipelined. Instead it's because allowing every TMU to texture an independent pixel means having as many texture address units as TMUs, and texture address units are relatively expensive.

To me that reads as saying "it doesn't have anything to do with pipe count, except as it has to do with the count of a defining characteristic of a pipe"...?

(Indeed, once you've spent the transistors to implement the extra texture address units you might as well just add the extra pixel pipes as well, which is why texture address units and pixel pipe count are considered synonymous with respect to fixed-function functionality.)[/as I understand it] AFAICT there's no reason for the same restriction to apply to shader op execution units.

Hmm...I think you've repeated the heart of the issue here. Programmable functionality is significantly more demanding of resources than fixed function blending, so it does not fit this idea of "might as well just add the extra pipes".

If the previous paragraph makes no sense whatsoever, basically what I'm saying is this: with a 4x2 fixed-function design, triple-textured throughput is the same as quad-textured throughput and hence worse than on an 8x1. [as I understand it]But with a 4 proxel pipe design where each pipe has 2 execution units, throughput on a three instruction shader should be better than on a four instruction shader and just as good as on an 8 pipe design with 1 functional unit per pipe.[/as I understand it] The only difference, as I said, would be dispatch/retire performance, which only makes a difference with one instruction shaders, i.e. never.

Hmm, no, you are ignoring the idea of independent element calculation, and ignoring, for example, that the 9700 can perform up to 3 different types of operations in one clock for one element. This is distinctly different than being able to perform 1 op for 24 elements in one clock, which by your argument would be equivalent since proxel pipe count doesn't matter. The count of proxel pipes is a determining characteristic of the applicability of the throughput for different circumstances (just as for the case of 4x2 compared to 8x1), and the symmetry present on the 9700 between pixel and proxel pipe characteristics need not be replicated everywhere.

To me, it reads as if you've answered your own initial question about not being able to see why proxel and pixel pipe count would deviate ...so, for now I'm going to stop here. If you want me to address the rest of your text before continuing to discuss, just say so, and then we can discuss selections from both responses at once for the sake of avoidng repetition, but I really feel that this reply and the collation I linked to answer what I see as the questions you've raised.
 
Luminescent said:
Well, demalion, if you take a look at pocketmoon's benchmarks here, you'll find that the descrepencies between partial and full precision on the NV30 are little, if anything, under FP30 mode.

Hmm...I don't see those figures as supporting that interpretation clearly, though it may just be insufficient understanding on my part.

For the fake noise, it is actually faster with no precision hint, which to me indicates a problem with scheduling, but it seems the strongest support for your statement. The problem is that the behavior is too odd to provide a clear indication (to me).

The difference is non-existent for the multiple dependent texture read benchmark, which makes sense regardless of a difference in partial and full precision calculation performance differences AFAICS.

The difference for the Sobel filter is slight, which seems likely to be associated with texture read stalls associated with each sample. This seems to point out that the texture sampling is the primary workload, not that partial and full calculation performance is equivalent.

The bilinear filter strikes me as warranting the same basic description.

The difference for the median filter is rather large, and looking at the shader code it seems to have significantly more math operations than texture sampling instructions. Also, based on the Cg output instruction count, it seems to not be using a concept like modifiers at all for expressing the GT/LT/EQ comparisons (EDIT: for the nv30 output), so the Cg output performance to me seems indicative of actual calculation performance difference between the two formats. This is one of the reasons I'm curious about directly comparing the output results for the compilers, as I think some of the assumptions might not be accurate based on the instruction count alone.

We know that the R3XX architecture tends to excel under DX9 code, while NV3X's performance is fully maximized in FP30 mode. So it seems the NV31 is at a disadvantage with this mark, particularly because it is DX9, R3XX's best field, not NV30's (although this was Nvidia's on doing). There are some cases (benchmark 4 in pocketmoon's cg suite) where DX9 performance is nowhere near NV3X's full capabilities.

Well, if you count "nv30 instruction execution" versus "DX 9 instruction execution" speeds, instead of actual performance. But when you look a the full precision performance compared to half for the nv30 results, each with the same listed instruction count, what do you think that means?

Did you mistake the table for fps results, and overlook the bargraphs? The axes could have been more clearly labelled.

This is a reason why the performance we observe maybe more indicative of NV31 in fp32 mode, having the shader written and compiled using Cg under FP30 mode. This is not the main point, however.

My reasoning lies behind this:
The procedural shader in question (in the 3DMark03 benchmark), most probalbly, does not even exceed the R3XX's 96 instruction count (or else R3XX would have to multipass). If true, this would translate to the absence of r/w latencies that would differentiate performance between NV31's fp32 mode and fp16 mode. The supposed performance delta between the two modes is derived from be the number of available registers for each. More registers means better flow, which translates in a bit more performance, but I don't think this benchmark presents such a case.

Hmm...I don't understand how your reasoning relates to pocketmoon's Cg results at all, at this time.
 
Hmm...I don't understand how your reasoning relates to pocketmoon's Cg results at all, at this time.

Back to top
Demalion, I'm sorry if I didn't make it clear enough, but my "reasoning" was given as the second and main point in my response to address this doubt:
It seems you are implying your comments address fp32 performance, and I'm not clear on why you propose they do.
My main reasoning behind the fp16 and fp32 performance delta has nothing to do with pocketmoon (that was extra info which also support my point somewhat). What I did was put in my $.02 of reasoning explaining why the pixel shader 2.0 3DMark bench would not accurately exploit the difference between the two types of fp formats in NV3X. In a nutshell, this is why:
Because it is not a long shader (I'm assuming it does not exceed R3XX's instruction limit). A long shader would require allot of data streaming and register access, which would more readily exploit the pros and cons faced by NV3X when exploiting certain precision formats over others.

Note: Yikes, I feel NV3X is some sort of collective similar to something found in Anthem. Oh my, NV3X might not like all this speculation. :D

As for Dave H's question, yes, I was referring to the NV3X architecture. By "general purpose" I mean that the unit's tasks be intermingled; one unit could fetch a texture 1 cycle and run pixel shaders the other. From what you answered, it seems this is present in the NV30's architecture.
 
Back
Top