NV35 pipeline organization

Eck. . . definitely a case of driver optimization. If we go into "conspiracy theory mode", we can speculate that NVidia purposely broke ARB_fragment_program support so that hardware sites would have no choice at all but to use the NV30 path for the benchmark. . .

EDIT: misread a post. . .
 
But are those 12 fp units just the shader units, the number of shader units and texture units combined, or are they all capable of functioning as either?
 
Luminescent said:
But are those 12 fp units just the shader units, the number of shader units and texture units combined, or are they all capable of functioning as either?

No. It's just that the TEX and FP units are intrinsically linked in such a way that allows 3 FP/pipe/clock OR 2 FP + 2 TEX lookup. I suppose this is most likely due to shared physical logic. I doubt they are discrete, "multi-purpose" units as that would rule out the inter-dependency.

MuFu.
 
Any increase in actual floating point performance is a very good thing, as long as testing bears out that it is real...all the other performance issues were not nearly as significant as this and the impact on DX 9 moving forward. Thinking back to the GDC slides and what it proposes for the HLSL ps_2_a target, this evolution seems natural and according to nVidia's original plan for NV3x (hmm...and also in line with some speculation I had intended for the forums, but restricted to some PMs due to a disappearing thread).

I don't see nVidia blatantly lying about this, and it makes sense within the assumptions about the NV30 transistor count that I abandoned a while ago as unrealistic, and the good news is that Wavey has an NV35 to put through its paces.

The bad news is that he won't have as much to tease us about with regards to surprises with the results until he finishes.
Oh, wait, that's only bad news for him :p.
 
When the NV35 carries out RGBA computations involving complex ops (ddx, ddy, rsq, lrp), do you guys believe it has to perform the ops on all four components, due to the apparent lack of the independent scalar support (in contrast to R3xx), or is there a separate special function unit for this type of computation?
 
I didn't expect it either. Means there's probably quite a bit of scope for compiler optimisation now as well.

Luminescent mentioned "general-purpose" units and Uttar touched on the idea as well. Can we think about it like this, per pipe...

~>NV30 has one FP unit and one general-purpose unit that can either lookup two textures or execute one FP pixel op.

~>NV35 has two FP units and one general purpose unit.

:?:

Logically, it is quite neat.

MuFu.
 
Funny thing is that fp "shader" units are quite general purpose enough, as they are composed of 4 general fmads (perhaps with support for other instructions). Wouldn't 4 fmads with ddx/ddy and other special capabilities (thepkrl comfirmed NV30 could execute 4 ddx/ddy's per fragment pipeline) be up to the challenge of texture fetching and filtering? It seems to me that an fp shader unit can be used as a texture unit, but not always is it the other way around (in previous architectures, at least). Many times texture addressing/filtering units use fixed function/optimized logic, which is configurable, but not entirely programmable.

It might be that the fp shader units can be used for texture fetches and general shading ops (tex ops don't seem to be so compicated for general purpose shading logic).
 
Well, yes, but I think nVidia is using quite a few additional tricks in the texturing system than your Joe GPU company architecture.
I think they got a very effective latency hiding system for it. Obviously, two would cost more than one :)
So something that might still be discussed is whether ONE of the three FP units can do Tex ops, or if all of them can do them, but they could only be done once per pass. That might be revealed quite easily with shader tests.

Uttar
 
I just read this from Mufu, which makes sense and disproves my previous theory.
I doubt they are discrete, "multi-purpose" units as that would rule out the inter-dependency.
If this holds true, then the texturing logic is probably dependent on the ddx/ddy filtering capabilities of one of the shader pipelines, being able to addresses only two textures, while the shader performs the derivative calculations for filtering; check this for possible evidence in NV30. According to thepkrl:
Texture fetches and FP-ops do not work in parallel, so FP unit is probably involved in texture fetches somehow (perhaps DDX,DDY calculation). FX-ops do work in parallel with texture fetches.
Check out his diagram here, which indicates that in the Nvidia implementation of texturing, at least one shader's logic is used for something.
 
The way I think of it is different (you know about this, I think, Mufu):

I think what we see here is the way the vertex processing pipeline works more completely transferred to the fragment processing pipeline.

Namely, I view both pipelines as architectured as one "uber" scheduling unit attached to a floating point unit (all the branching and special "2.0+" functionality) + one processing unit (set?) with a much narrower range of simple calculation functionality attached to a simpler processing handler (register combiner). The difference was that the vertex processing pipeline had the second unit with fp32 precision processing, and the fragment processing pipeline was limited to fx12 (for the NV30).

I also thought there were 2 tex op units but that the "uber" unit was tied up when using them for anything other than fixed texture fetching (in NV3x, limited to fragment processing usage).

I then thought that this design of the NV3x facilitated that the NV40 would achieve effective symmetry between the vertex and fragment processing pipelines, and then possibly remove redundancy by being able to use the resources for either dynamically, lending itself easily to a unified shading model.

What it seems like now is that the NV35 took the first step in this direction ...the mystery is not how it did this, but how the NV30 failed to do it with its transistor budget, as it is only the NV30 transistor count and restricted capabilities that hid this possibility AFAICS. Given the ability to achieve this in the NV35, it opens up the possibilities for NV40 again...
Simply allowing the "tex op" resources to be used by the vertex programming pipelines would move a great deal in this direction, wouldn't it?
What would be needed for a primitive processor...some sort of expanded "tex op"-alike unit treating vertices in a texture-like fashion? What else?

Anyways, unless we see functionality or peak ("low" precision) performance dropped in the NV35 in relation to the NV30, it seems, IMO, that the NV30 holds the record for the most wasteful chip design released, and that people who bought into the NV3x hype before the NV35 are getting burned in a major way o_O. It has been clear that the worst of nVidia has been evident in full force in the handling of the NV3x, but atleast this indicates that engineering competitiveness is no longer absent :-?.
 
Interesting view, demalion. I don't know if you meant something similar, but I see the NV35 pixel pipeline as an array (similar to its vertex shader) consisting of 4 sets of 3 pipelined fp units, with some texture addressing logic added in each pipeline, all governed by a control unit which takes care of the branching, dependencies, etc. The mysterious part of the puzzle is just what the control logic is. Is it a discrete unit which issues commands to the fp pipelines?
 
Luminescent said:
Interesting view, demalion. I don't know if you meant something similar, but I see the NV35 pixel pipeline as an array (similar to its vertex shader) consisting of 4 sets of 3 pipelined fp units, with some texture addressing logic added in each pipeline, all governed by a control unit which takes care of the branching, dependencies, etc. The mysterious part of the puzzle is just what the control logic is. Is it a discrete unit which issues commands to the fp pipelines?

I did mean something similar, but my statement (and conception behind it) doesn't recognize texture addressing logic in the vertex processing pipeline for the NV35 or NV30 already. Is this just a failing in my understanding?

I think the NV30 and NV35 have discrete (but fairly similar) vertex and fragment processing control units, and the NV40 will have units even more similar to each other. The dynamic resource allocation might be too forward looking, depending on what is required for primitive processing and whether the NV40 is supposed to offer it...I had actually thought at NV30 launch that it might already have dynamic resource allocation, but perhaps I just don't properly recognize the hurdles with that in theorizing that it will be in the NV40. Hmm...I suppose it is even possible that this is one more thing that was simply broken, but I'd have to look more at the VS 2.0+ versus PS 2.0+ spec (or maybe the NV30 extensions that correspond for more clarity) to see how much sense it makes, though.
 
I have a quick question. In NV30, when you ran a PS1.1 shader, what units (or whatever) did those calculations?

The reason I'm asking is because even though the clockspeed of NV35 is 50mhz lower than the NV30, according to the shadermark test on [H], the fixed function portion is about 10% faster than NV30.

Is any of this connected? Or is the increase in that type of shader due to drivers and/or just optimizations of whatever units do this kind of fixed function?

Since I have no idea what I'm talking about here, hopefully someone can make sense out of that question and give me an answer, lol.
 
According to the test results by my friends, NV30 uses register combiners to do PS 1.1. It performs almost like a GF4 Ti of the same clock rate.
 
Demalion:
I did mean something similar, but my statement (and conception behind it) doesn't recognize texture addressing logic in the vertex processing pipeline for the NV35 or NV30 already. Is this just a failing in my understanding?
If you were trying to understand my "theory", it is not a failing of your understanding, in fact, we are in agreement. Like you, I only see the texturing logic as a possibility in the pixel shader of the NV30/NV35, not the vertex shader (or it would be PS 3.0 compliant). In the future, however, I think all units will include such logic in the pipeline, and resources will be allocated dynamically (like you stated).
 
Yeah, dynamic resource allocation was a goal of the R400, no? Too bad it's gonna be delayed so much.

Anyway, I think nVidia uses their same FX12 units ( although not shared, just same design ) in many parts of the pipeline. Did you ever wonder why they got 12-bit subpixel precision? Maybe they use it for the Lighting in T&L too.

I'm still wondering exactly what's happening for T&L in the NV30 ( & NV35, too, since the same phenomen is present ) . Dedicated units would just seem way too expensive... Maybe a pool of unit shared between Triangle Setup and T&L, or something strange like that? I guess that'd be nearly unverifiable.

My theory for the NV3x currently is that it got no pipelines. Just control logic and units.
So, that control logic would have one instruction cache of 1024 instructions and cache for registers, and would then send what got to be calculated to those units.

My guess for why David Kirk said 32 functional units in the NV30 is:
4 Color Output units
8 Z Output units ( or at least, 8 you can use without AA )
4 FP32/TEX units
8 FX12 MUL/ADD units
8 FX12 MUL units

The FX12 MUL units have the use of enabling 8 LRP ops/clock in FX12 mode instead of 4. They are also less sophisticated than the other parts of the pipeline: heck, they can't even be used when the op is dependent, contrary to the rest of the pipeline. They're obviously different, but I'm wondering in which way they are...

For the NV35, it'd be:
4 Color Output units
8 Z Output units
4 FP32/TEX units
8 FP32 units

So my guess is the 8FP units are different than the 4 FP32/TEX units, they probably can't do ddx, ddy, ... in one clock and they obviously can't be used for texturing.


Uttar
 
That might just be it, Uttar. I never thought of the fact that Nvidia could just preclude the other units of having ddx/ddy ability.

The only problem I observe by looking at this conjecture deals with the fact that ddx/ddy is a shader and not a texture op in the CineFX architecture. For example, the R3xx architecture probably does partial derivatives in its texture unit, but it is fixed and not configurable. In CineFX it is a shader instruction. If it is a shader instruction, then shouldn't all the fp units have single cycle ddx/ddy ability (for parallel ddx/ddy performance, when applying textures)?
 
Back
Top