View Full Version : NV35 pipeline organization
Hey everyone,
Just thought I should start a new thread about this, since it might become a fairly big subject. Didn't see any yet.
The NV30 is 1FP/TEX unit and 2FX units/pipe
There two possible things nVidia could have done, since they kept their 4 pipes:
- 1FP/TEX unit, 1 true FP unit and 1 FX unit/pipe
- 2FP/TEX units and 1 FX unit, with the FP/TEX units only being able to do 1 independent fetch/clock instead of 2.
- 1FP unit, 2 TEX units, 2 FX units with a ridiculous amount of cheating.
My guess is actually number two.
With the old configuration, there was some sharing between FP & Tex, but TEX could do 8/clock, so I guess it had quite a bit additional trannies too. So, with this, you wouldn't need as much additional trannies for the texturing, and the whole design thus becomes possible at 130M transistors with other overall optimizations.
Any feedback, comments, ideas?
Uttar
Uttar
LeStoffer
13-May-2003, 18:19
I'm thinking about the same thing, but I'm haven't seen any info or benchmark that gives any hint at what have been changed. Number two option does look promising, but right now I feel clueless. Sorry.
Joe DeFuria
13-May-2003, 18:27
Uttar,
Heh...could you clarify your definition of FP/TEX, FX and "True FP" units?
Yeah, we do have very very little info about it ( still more than about the NV40, though, hehe! )
The most we have is:
http://www.hardocp.com/article.html?art=NDcyLDEy
It seems 100% obvious nVidia is capable of 2FP/clock, but got lower efficiency ( register usage performance hits remain I guess, although they might have been lowered, who knows ) in most cases. The cases where it wins would likely be when it benefits from its bigger native instruction set.
This would slightly increase efficiency with FX too, because you could do 1FX/FP and 1TEX op in parallel, instead of always having to do 2TEX ops to get max efficiency.
So now, the NV35 is a lot nearer to a 8x1 than the NV30, even though it's still practically a 4 pipelines architecture. Funny, eh?
Uttar
EDIT: For Joe: FP/TEX: unit who can do either FP or TEX ops, not in parallel. In the case of the NV30, you could do either 4FP ops to 8TEX ops.
True FP: Unit who can do FP ops in 1 clock, no sharing with texturing.
FX: Unit who can do FX ops in 1 clock
Joe DeFuria
13-May-2003, 19:01
EDIT: For Joe: FP/TEX: unit who can do either FP or TEX ops, not in parallel. In the case of the NV30, you could do either 4FP ops to 8TEX ops.
OK, but Im confused because you listed NV30 as "1FP/TEX unit and 2FX units/pipe". Doesn't that indicate only 1 texture operation/read per pipe total? (Doesn't NV30 have the ability to do Two?) Is the TEX unit more analogouse to the traditional TMU, or is the FX unit? I'm not clear on what the purpose of the "FX" unit is....
Well, based on the NV30 pipeline threads, I think it was finally agreed that there was a unit which could do either 1FP op/clock/pipe or 2TEX ops/clock/pipe
Or at least, that's the practical POV. There's obviously some dedicated trannies for each type of operation, but much of it is probably shared.
My idea is that with the NV35, it's 1FP op/clock/pipe or 1TEX op/clock/pipe for 2 FP/TEX units.
The FX unit is obviously the integer unit, for INT12 operations.
Uttar
Joe DeFuria
13-May-2003, 19:37
OK, I think we're on the same wavelength now. ;)
Options 2 and 3 really seem like the only feasible possibilities to me. It might actually be somewhat of a combination of the two.
I think the only way to really ascertain what's going on, is to have both the 5800 and 5900 side by side and run through several pixel shading tests...with several sets of drivers.
I think it's either the first or the second variant. But I tend to believe the first. That would mean either 8xFP or 8xTex + 4xFP per clock, which IMO best explains why the FX5900 is close to R9800Pro, but rarely surpasses it in 2.0 shaders although the FX is clocked higher.
I think it's either the first or the second variant. But I tend to believe the first. That would mean either 8xFP or 8xTex + 4xFP per clock, which IMO best explains why the FX5900 is close to R9800Pro, but rarely surpasses it in 2.0 shaders although the FX is clocked higher.
Actually, that's the second variant ;) The first is ( 4xFP or 8x Tex ) + 4xFP + 4xFX
There's two serious variants, and the third which is really much more of a paranoid dream.
Uttar
MDolenc
13-May-2003, 21:09
I actually got reply on this from NVidia 2 hours ago. :wink:
Joe DeFuria
13-May-2003, 21:19
I actually got reply on this from NVidia 2 hours ago. :wink:
That would mean then, that unlike the NV30, the NV35 should be able to run the ARB2 path of Doom3 at the same speed as the NV30 path. So the "default" path for NV35 should, like R3xx, be ARB2, where the default path of NV30 will be NV30....correct?
I think it's either the first or the second variant. But I tend to believe the first. That would mean either 8xFP or 8xTex + 4xFP per clock, which IMO best explains why the FX5900 is close to R9800Pro, but rarely surpasses it in 2.0 shaders although the FX is clocked higher.
Actually, that's the second variant ;) The first is ( 4xFP or 8x Tex ) + 4xFP + 4xFX
There's two serious variants, and the third which is really much more of a paranoid dream.
(4xFP or 8xTex) + 4xFP is equal to 8xFP or (8xTex + 4xFP). I left out the FX units.
The second variant would be 8xFP or (4xTex + 4xFP) or 8xTex
MDolenc,
interesting information. If that's true it should be significantly faster than R300 in shaders that use few registers.
No it's not :)
The difference all lies in parallelism. It's easier to get parallelism with ( 4x FP or 4x TEX ) x 2 than with (4xFP or 8xTex) + 4xFP
MDolenc: VERY interesting info! That would most definitively justify the "Force FP16" flag nVidia has got MS to put in a future revision of DX9!
That most certainly explains the "12 ops/clock" number from the outdated PR docs I leaked a while back.
Anyway, very nice info. I guess nVidia is gonna have a fair bit of trouble with the new FP16/FP32 switching though. I guess the hit comes when there's switching in the same pass. Funny performance hit, hehe.
Uttar
I actually got reply on this from NVidia 2 hours ago. :wink:
That would mean then, that unlike the NV30, the NV35 should be able to run the ARB2 path of Doom3 at the same speed as the NV30 path. So the "default" path for NV35 should, like R3xx, be ARB2, where the default path of NV30 will be NV30....correct?
Possibly. Maybe NV35 would still be faster in Doom3 when using FP16. Then even a modified NV30 path would make sense.
No it's not :)
The difference all lies in parallelism. It's easier to get parallelism with ( 4x FP or 4x TEX ) x 2 than with (4xFP or 8xTex) + 4xFP
True, dependent texture reads are easier with (4xFP or 4xTex) x 2, which is your second variant. But (4xFP or 8xTex) + 4xFP can do more per clock.
LeStoffer
13-May-2003, 21:56
I actually got reply on this from NVidia 2 hours ago. :wink:
It actually seams that integer logic is gone from NV35 pixel shaders. It is capable of 3 floating point (and it doesn't care that much about fp16 vs. fp32 either) instructions per pipe per clock (12 floating point instructions per clock total) or 2 floating point instructions + 2 texture look-ups per pipe per clock.
Woah! If this is true - and why not? - it makes the orginal CineFX look somewhat outdated already. Thus nVidia's claim for CineFX vesion 2.0. I'm all for going full FP if peformance allows (like on R3x0), but I really wonder where this leave the NV31 and NV34 in the eyes of developer support now that NV30 - and the int12 lead with it - is de facto a dead end. :?
Nice thread :)
I don't have any number that could help me talking without any (or not too much) doubt about NV35 pipeline organization. Actually my guess was that NVIDIA has kept the same pipeline as NV30 (including FX units) with one more unit per pipeline: a floating point one or a FP/tex one (or FP/adress processor). In regard with HOCP Shadermark results, it seems like there is another change to increase FP shader power. I thought that NVIDIA had doubled the number of without-performance-hit-usable registers.
But MDolenc information makes sense too (but isn't it a too big change from NV31-34-30 ?). If it's true I think that it's a pretty nice design. This way, the NV35 has the same theoretical throughput that the Radeon 9800/9700 has in case of 2 texture lookups + 2 FP ops. The NV35 has an advantage when there's more FP ops than texture lookups but on the other side needs more optimised shader with less dependence.
If it's true, the only drawback from NV30 would be the loss of the double FX multiplication power in fixed point units (5 multiplication FX ops per cycle possible). Everything else should be faster or a lot faster. One possible question is: are the new FP units able to do every operation? Maybe they can just do simple operations and only the FP/tex unit is able to do every complex operation? (it's just a question I'm asking me ;) )
The FP16/32 question remains. If NVIDIA has kept the same register access organization, FP16 remains very gainful as it allows access with no performance drop to 4 registers instead of 2. Using FP16 and FP32 in the same pipeline could be a problem when dealing with registers usage optimisation. So it should be better to use only FP32 or only FP16.
I actually got reply on this from NVidia 2 hours ago. :wink:
That would mean then, that unlike the NV30, the NV35 should be able to run the ARB2 path of Doom3 at the same speed as the NV30 path.
Due to a bug, ARB2 currently does not work with NVIDIA's DX9 cards when using the preview version of the Detonator FX driver. According to NVIDIA, ARB2 performance with the final driver should be identical to that of the NV30 code.
MuFu.
demalion
13-May-2003, 22:22
Woah, I hadn't expected that until NV40. I had no idea the NV30 was that broken. Well, I did, but I dismissed the possibility too soon, it appears.
Woah, I hadn't expected that until NV40.
I hadn't expected it too :P I thought that NVIDIA would try to use a pipeline very similar to NV30/1/4 pipelines to "help" developers make shader that every NV3x like.
It's great if NV35 can work properly at full speed with the ARB2 path.
Eck. . . definitely a case of driver optimization. If we go into "conspiracy theory mode", we can speculate that NVidia purposely broke ARB_fragment_program support so that hardware sites would have no choice at all but to use the NV30 path for the benchmark. . .
EDIT: misread a post. . .
Luminescent
13-May-2003, 22:50
But are those 12 fp units just the shader units, the number of shader units and texture units combined, or are they all capable of functioning as either?
But are those 12 fp units just the shader units, the number of shader units and texture units combined, or are they all capable of functioning as either?
No. It's just that the TEX and FP units are intrinsically linked in such a way that allows 3 FP/pipe/clock OR 2 FP + 2 TEX lookup. I suppose this is most likely due to shared physical logic. I doubt they are discrete, "multi-purpose" units as that would rule out the inter-dependency.
MuFu.
demalion
13-May-2003, 22:59
Any increase in actual floating point performance is a very good thing, as long as testing bears out that it is real...all the other performance issues were not nearly as significant as this and the impact on DX 9 moving forward. Thinking back to the GDC slides and what it proposes for the HLSL ps_2_a target, this evolution seems natural and according to nVidia's original plan for NV3x (hmm...and also in line with some speculation I had intended for the forums, but restricted to some PMs due to a disappearing thread).
I don't see nVidia blatantly lying about this, and it makes sense within the assumptions about the NV30 transistor count that I abandoned a while ago as unrealistic, and the good news is that Wavey has an NV35 to put through its paces.
The bad news is that he won't have as much to tease us about with regards to surprises with the results until he finishes.
Oh, wait, that's only bad news for him :P.
Luminescent
13-May-2003, 23:02
I hope he puts NV35 through various shader benchmarks, including the ones developed by our very own forum members. :wink:
Luminescent
13-May-2003, 23:08
When the NV35 carries out RGBA computations involving complex ops (ddx, ddy, rsq, lrp), do you guys believe it has to perform the ops on all four components, due to the apparent lack of the independent scalar support (in contrast to R3xx), or is there a separate special function unit for this type of computation?
I didn't expect it either. Means there's probably quite a bit of scope for compiler optimisation now as well.
Luminescent mentioned "general-purpose" units and Uttar touched on the idea as well. Can we think about it like this, per pipe...
~>NV30 has one FP unit and one general-purpose unit that can either lookup two textures or execute one FP pixel op.
~>NV35 has two FP units and one general purpose unit.
:?:
Logically, it is quite neat.
MuFu.
Luminescent
14-May-2003, 02:29
Funny thing is that fp "shader" units are quite general purpose enough, as they are composed of 4 general fmads (perhaps with support for other instructions). Wouldn't 4 fmads with ddx/ddy and other special capabilities (thepkrl comfirmed NV30 could execute 4 ddx/ddy's per fragment pipeline) be up to the challenge of texture fetching and filtering? It seems to me that an fp shader unit can be used as a texture unit, but not always is it the other way around (in previous architectures, at least). Many times texture addressing/filtering units use fixed function/optimized logic, which is configurable, but not entirely programmable.
It might be that the fp shader units can be used for texture fetches and general shading ops (tex ops don't seem to be so compicated for general purpose shading logic).
Well, yes, but I think nVidia is using quite a few additional tricks in the texturing system than your Joe GPU company architecture.
I think they got a very effective latency hiding system for it. Obviously, two would cost more than one :)
So something that might still be discussed is whether ONE of the three FP units can do Tex ops, or if all of them can do them, but they could only be done once per pass. That might be revealed quite easily with shader tests.
Uttar
Luminescent
14-May-2003, 06:19
I just read this from Mufu, which makes sense and disproves my previous theory.
I doubt they are discrete, "multi-purpose" units as that would rule out the inter-dependency.
If this holds true, then the texturing logic is probably dependent on the ddx/ddy filtering capabilities of one of the shader pipelines, being able to addresses only two textures, while the shader performs the derivative calculations for filtering; check this (http://www.beyond3d.com/forum/viewtopic.php?p=99900#99900) for possible evidence in NV30. According to thepkrl:
Texture fetches and FP-ops do not work in parallel, so FP unit is probably involved in texture fetches somehow (perhaps DDX,DDY calculation). FX-ops do work in parallel with texture fetches.
Check out his diagram here (http://www.beyond3d.com/forum/viewtopic.php?p=100559#100559), which indicates that in the Nvidia implementation of texturing, at least one shader's logic is used for something.
demalion
14-May-2003, 07:02
The way I think of it is different (you know about this, I think, Mufu):
I think what we see here is the way the vertex processing pipeline works more completely transferred to the fragment processing pipeline.
Namely, I view both pipelines as architectured as one "uber" scheduling unit attached to a floating point unit (all the branching and special "2.0+" functionality) + one processing unit (set?) with a much narrower range of simple calculation functionality attached to a simpler processing handler (register combiner). The difference was that the vertex processing pipeline had the second unit with fp32 precision processing, and the fragment processing pipeline was limited to fx12 (for the NV30).
I also thought there were 2 tex op units but that the "uber" unit was tied up when using them for anything other than fixed texture fetching (in NV3x, limited to fragment processing usage).
I then thought that this design of the NV3x facilitated that the NV40 would achieve effective symmetry between the vertex and fragment processing pipelines, and then possibly remove redundancy by being able to use the resources for either dynamically, lending itself easily to a unified shading model.
What it seems like now is that the NV35 took the first step in this direction ...the mystery is not how it did this, but how the NV30 failed to do it with its transistor budget, as it is only the NV30 transistor count and restricted capabilities that hid this possibility AFAICS. Given the ability to achieve this in the NV35, it opens up the possibilities for NV40 again...
Simply allowing the "tex op" resources to be used by the vertex programming pipelines would move a great deal in this direction, wouldn't it?
What would be needed for a primitive processor...some sort of expanded "tex op"-alike unit treating vertices in a texture-like fashion? What else?
Anyways, unless we see functionality or peak ("low" precision) performance dropped in the NV35 in relation to the NV30, it seems, IMO, that the NV30 holds the record for the most wasteful chip design released, and that people who bought into the NV3x hype before the NV35 are getting burned in a major way :x. It has been clear that the worst of nVidia has been evident in full force in the handling of the NV3x, but atleast this indicates that engineering competitiveness is no longer absent :-?.
Luminescent
14-May-2003, 07:33
Interesting view, demalion. I don't know if you meant something similar, but I see the NV35 pixel pipeline as an array (similar to its vertex shader) consisting of 4 sets of 3 pipelined fp units, with some texture addressing logic added in each pipeline, all governed by a control unit which takes care of the branching, dependencies, etc. The mysterious part of the puzzle is just what the control logic is. Is it a discrete unit which issues commands to the fp pipelines?
demalion
14-May-2003, 07:50
Interesting view, demalion. I don't know if you meant something similar, but I see the NV35 pixel pipeline as an array (similar to its vertex shader) consisting of 4 sets of 3 pipelined fp units, with some texture addressing logic added in each pipeline, all governed by a control unit which takes care of the branching, dependencies, etc. The mysterious part of the puzzle is just what the control logic is. Is it a discrete unit which issues commands to the fp pipelines?
I did mean something similar, but my statement (and conception behind it) doesn't recognize texture addressing logic in the vertex processing pipeline for the NV35 or NV30 already. Is this just a failing in my understanding?
I think the NV30 and NV35 have discrete (but fairly similar) vertex and fragment processing control units, and the NV40 will have units even more similar to each other. The dynamic resource allocation might be too forward looking, depending on what is required for primitive processing and whether the NV40 is supposed to offer it...I had actually thought at NV30 launch that it might already have dynamic resource allocation, but perhaps I just don't properly recognize the hurdles with that in theorizing that it will be in the NV40. Hmm...I suppose it is even possible that this is one more thing that was simply broken, but I'd have to look more at the VS 2.0+ versus PS 2.0+ spec (or maybe the NV30 extensions that correspond for more clarity) to see how much sense it makes, though.
Lezmaka
14-May-2003, 09:00
I have a quick question. In NV30, when you ran a PS1.1 shader, what units (or whatever) did those calculations?
The reason I'm asking is because even though the clockspeed of NV35 is 50mhz lower than the NV30, according to the shadermark test on [H], the fixed function portion is about 10% faster than NV30.
Is any of this connected? Or is the increase in that type of shader due to drivers and/or just optimizations of whatever units do this kind of fixed function?
Since I have no idea what I'm talking about here, hopefully someone can make sense out of that question and give me an answer, lol.
According to the test results by my friends, NV30 uses register combiners to do PS 1.1. It performs almost like a GF4 Ti of the same clock rate.
Last edited by MDolenc on 14 May 2003 08:18, edited 1 time in total
Hmm ... :wink:
Luminescent
14-May-2003, 15:56
Wow, could it be there are Nvidia lurkers here?
Luminescent
14-May-2003, 16:17
Demalion:
I did mean something similar, but my statement (and conception behind it) doesn't recognize texture addressing logic in the vertex processing pipeline for the NV35 or NV30 already. Is this just a failing in my understanding?
If you were trying to understand my "theory", it is not a failing of your understanding, in fact, we are in agreement. Like you, I only see the texturing logic as a possibility in the pixel shader of the NV30/NV35, not the vertex shader (or it would be PS 3.0 compliant). In the future, however, I think all units will include such logic in the pipeline, and resources will be allocated dynamically (like you stated).
Yeah, dynamic resource allocation was a goal of the R400, no? Too bad it's gonna be delayed so much.
Anyway, I think nVidia uses their same FX12 units ( although not shared, just same design ) in many parts of the pipeline. Did you ever wonder why they got 12-bit subpixel precision? Maybe they use it for the Lighting in T&L too.
I'm still wondering exactly what's happening for T&L in the NV30 ( & NV35, too, since the same phenomen is present ) . Dedicated units would just seem way too expensive... Maybe a pool of unit shared between Triangle Setup and T&L, or something strange like that? I guess that'd be nearly unverifiable.
My theory for the NV3x currently is that it got no pipelines. Just control logic and units.
So, that control logic would have one instruction cache of 1024 instructions and cache for registers, and would then send what got to be calculated to those units.
My guess for why David Kirk said 32 functional units in the NV30 is:
4 Color Output units
8 Z Output units ( or at least, 8 you can use without AA )
4 FP32/TEX units
8 FX12 MUL/ADD units
8 FX12 MUL units
The FX12 MUL units have the use of enabling 8 LRP ops/clock in FX12 mode instead of 4. They are also less sophisticated than the other parts of the pipeline: heck, they can't even be used when the op is dependent, contrary to the rest of the pipeline. They're obviously different, but I'm wondering in which way they are...
For the NV35, it'd be:
4 Color Output units
8 Z Output units
4 FP32/TEX units
8 FP32 units
So my guess is the 8FP units are different than the 4 FP32/TEX units, they probably can't do ddx, ddy, ... in one clock and they obviously can't be used for texturing.
Uttar
Luminescent
14-May-2003, 17:45
That might just be it, Uttar. I never thought of the fact that Nvidia could just preclude the other units of having ddx/ddy ability.
The only problem I observe by looking at this conjecture deals with the fact that ddx/ddy is a shader and not a texture op in the CineFX architecture. For example, the R3xx architecture probably does partial derivatives in its texture unit, but it is fixed and not configurable. In CineFX it is a shader instruction. If it is a shader instruction, then shouldn't all the fp units have single cycle ddx/ddy ability (for parallel ddx/ddy performance, when applying textures)?
LeStoffer
14-May-2003, 18:02
Wow, could it be there are Nvidia lurkers here?
Absolutely. The company just doesn't seem to keen on letting them post here. :oops:
8 FX12 MUL/ADD units
8 FX12 MUL units
The FX12 MUL units have the use of enabling 8 LRP ops/clock in FX12 mode instead of 4. They are also less sophisticated than the other parts of the pipeline: heck, they can't even be used when the op is dependent, contrary to the rest of the pipeline. They're obviously different, but I'm wondering in which way they are...
They are most likely register combiner units as described in NV_register_combiner extension specs.
They take four inputs (each with modifiers: negate, bias, invert, etc.) and provide up to 3 outputs:
- (A * B), (C * D), ((A * B) + (C * D))
- (A * B), (C * D), (A * B) or (C * D) conditional
- (A dot B), (C dot D)
Hmm, yes, you're probably right Xmas. Although it wouldn't be the same register combiners as in the NV25, since it's FX12 and not FX9.
Uttar
Luminescent
14-May-2003, 18:43
If it is true that in NV35, only the fp32/tex units have ddx/ddy filtering capability, what about all other special functions in the remaining fp32 shaders? They won't be capable of cos/sin/rsq, etc.?
If it is true that only the fp/tex unit has ddx/ddy filtering capability, what about all the other special functions? Does this mean the other fp units will not be capable of cos/sin/rsq, etc.?
No idea. I'd be surprised if it couldn't do most of them, though. I guess we'll need more info on all those things.
Uttar
One thing bout ddx/ddy.
It seems dirt-simple to do it, one of the arithmetically simplest instructions in the set. It's just a subtraction, and the result is shared between pipes. The only odd thing about it is that it needs data transfered between the pipes.
Basic: remember that interview with David Kirk stating that the NV30 performed most ops 4 pixels at a time (and could have up to 32 pixels in flight in the shader units)?
I strongly suspect that shader ops (including ddx/ddy), are executed
on 2x2 pixel blocks - i.e. each "pipe" operates not on pixels, but on pixel stamps.
This makes obtaining the inputs for ddx/ddy instructions "easy", plus allows you to hide some latency of the shader ops (4 cycles sounds about right for a pipelined fmac).
Luminescent
14-May-2003, 21:32
In accordance with thepkrl's (http://www.beyond3d.com/forum/viewtopic.php?p=100559#100559) NV30 model and the MDolnec's information, which claimed Nvidia basically scrapped the integer hardware for NV35, a single pixel pipeline of NV35 probably resembles this:
temporary registers (R0,R1,..,H0,H1,..)
|
FLOAT (perhaps does DDX/DDY for dependent fetches)
| \
| TEXTURE <-- f[TEX0],f[TEX1],.. (DDX/DDY is free)
| /
FLOAT <-- temp registers
|
FLOAT <-- temp registers
|
(loopback to temporary registers or output)
Where the bolded items represent the new additions and subtractions to the NV30 pipeline. Accordingly, an NV30 pipeline resembles this:
(thepkrl)
temporary registers (R0,R1,..,H0,H1,..)
|
FLOAT (perhaps does DDX/DDY for dependent fetches)
| \
| TEXTURE <-- f[TEX0],f[TEX1],.. (DDX/DDY is free)
| /
INTEGER <-- f[COL0],f[COL1]
|
INTEGER <-- f[COL0],f[COL1]
|
(loopback to temporary registers or output)
These models are in accordance to thpkrl's architectural findings and are consistent with the data we have to date.
Alongside that, from the referenced NV30 thread, I found this interesting:
(thepkrl)
When textures are fetched, nothing else can be done in the FLOAT/TEXTURE unit. The following INTEGER unit can use the fetched texture result freely. FLOAT/TEXTURE unit can perform one fetch if the coordinates come from a previous calculation (meaning 4 textures/cycle). If the coordinates come directly from inputs as in PS1.1, the unit can perform a pair of texture fetches (meaning 8 textures/cycle).
Basically the texture unit can only make two fetches when the textures are directly given as imputs and there are no impending results from a previous operation (dependency). In NV30, one fp shader unit seems to share its resources with the texture unit (maybe ddx/ddy, as previously surmised), perhaps explaining NV35's 2fp+2tex vs. 3fp tradeoff.
Luminescent
14-May-2003, 23:53
psurge:
I strongly suspect that shader ops (including ddx/ddy), are executed
on 2x2 pixel blocks - i.e. each "pipe" operates not on pixels, but on pixel stamps.
Would this explain why shader benchmarks such as rightmark (see pixel shader results here (http://www.digit-life.com/articles2/gffx/5900u.html)) run unoptimally on NV35 (considering 12 fp shader units). It seems either the drivers or the applications are not yet ready to properly distribute the fp ops to the 4 pipelines of 3 pipelined fp units each. The R3xx's pipeline is already configured to distribute all operations evenly, with 8 parallel fp shaders. Given its parallel nature, a pixel block is less necessary to make full use of resources (there is one shader assigned to at least 1 pixel every clock), while NV35, on the other hand, should require the pixel blocks to exploit the serial nature of its pipelined fp units.
If no more than 1 pixel can be fed to each pipeline for shading, 2 shaders will remain idle per pipeline. If a larger pixel block can be fed, all the units can be busily chewing on the supplied pixels. This is all of course in my rationale; anyone with knowledge on the subject should feel free to correct me.
Luminescent
15-May-2003, 02:12
Wow, I just ran into Uttar's NV35 leaked spec page (http://www.beyond3d.com/forum/viewtopic.php?p=105197#105197), which pretty much confirms the 12 fp ops per clock number as the NV35's possible shader execution rate. Check it out:
Accelerated pixel shaders allow for up to 12 pixel shader operations/clock
Luminescent
15-May-2003, 03:21
I've been trying to justify Digit-life's low pixel shader 2.0 (Rightmark) scores all day and I think I've arrived at a conclusion. I arrived at it by contrasting the R350 and NV35 architectures through analogy (the following information should be a given to the techheads at B3D, but being that 3D hardware is only a hobby of mine, as of now, I've only given the subject serious thought recently).
Here is the didactic, mock scenario (in a world of conventional/steriotypical pipelines):
Let us say two vpu's (A & B) have equal clockspeeds. Vpu A has a fillrate of 3 while B has a fillrate of 1.5. Assuming a simple world, we can conclude that if vpu B has n piplines, vpu A has 2n pipelines. Vpu A, however, has 2 texture units per pipeline while vpu B has 1 texture unit per pipeline (both are capable of loopback). Vpu A and B attain the same Gtexel rate; vpu A by pipelining its three units/pipe and vpu B through more extensive superscalar (8-way) execution with one texture unit per pipeline.
Now the benchmarks come rolling in. On benchmarks with extensive use of multitexturing; vpu A is put to use most efficiently, its texture units can be used simultaneously, and its resources are well employed. Vpu B will have to loopback to multitexture, although it will produce twice as many pixels when it outputs its final results, so the performance of vpu's A and B are relatively equal. When the heavy single texturing marks begin to arrive, vpu A struggles because its it can only use one of its texture units per pipe and the extra logic is put to no use. Vpu B excels compared to A, because all its resources are employed efficiently and it has twice the amount of crucial resources that A does.
By equating textures to instructions and texture units to fp shader units:
We can observe NV35 suffering from the same fate endured by vpu A NV35 may contain 2 fp shaders per pipeline, but if the benchmark's pixel shaders consist of short instruction counts, the vpu will incur a disadvantage. Each pixel will need less shading (instructions), and unless those instructions come in factors of two, NV35 will not effectively use its 2 fp units per pipeline. Each pipeline's resources will have a greater likelyhood of being unemployed, as opposed to the other leading brand vpu which can address more pixels simultaneously through with only 1 fp shader per pixel in shaders with instruction counts which are factors of 1 per pixel.
This may have seemed obvious, but it sure offered me some speculation venting therapy. Please correct me if any of the comparisons or concepts are not valid.
Luminescent
15-May-2003, 04:53
I just read the following from Anandtech's review (http://www.anandtech.com/video/showdoc.html?i=1821&p=8) of the NV35, which seems to confirm the given NV35 conjectures and information about the pixel shader pipeline:
If you correctly pack the instructions that are dispatched to the execution units in this stage you can yield significantly more than 8 pixel shader operations per clock. For example, in NVIDIA's architecture a multiply/add can be done extremely fast and efficiently in these units, which would be one scenario in which you'd yield more than 8 pixel shader ops per clock.
It all depends on what sort of parallelism can be extracted from the instructions and data coming into this stage of the pipeline.
antlers
15-May-2003, 05:32
Anyone want to change their opinion on whether extensive and blatant cheating, rather than architectural changes, is responsible for the observed performance improvements in NV35 now that NVidia has been caught, well, blatantly and extensively cheating?
It's enough to make you think that 50% of a video card drivers CPU load is due to benchmark-detection schemes.
Luminescent
15-May-2003, 05:37
Dangit, you just had to suck the fun out of it, didn't you!!
Being that this architectural thread has focused more on the NV35's pixel shading arrangement, alongside the fact that it seems harder to cheat in the pixel shader 2.0 mark of 3DMark03 (due to the immediate appearence of rendering artifacts), the Detonator FX ordeal would seem to not be a major concern (with respect to changes in architecture). I do belive that fp32 bit precision is being used for the 5900 FX this time around, but the cheating (if any :wink: ) would seem to stem from hand tayloring the amount rendered rather than the rendering quality.
demalion
15-May-2003, 12:50
Your analogy works for the TMU example, but the shader example, when applied to the R3xx versuse NV3x, fails to recognize that the R3xx can do more than 1 op per cycle as well.
The shader doesn't have to be short, it just has to be limited in applicability to what the architecture can perform in one clock cycle: the instructions have to meet specific criteria. With the R3xx, when that fails, you can still process 8 pixel ops per clock...with the NV3x (presumably all of them, we need some testing) you can only do 4.
NV30: Best case, 12. Worst case, 4.
R300: Best case, 24. Worst case, 8.
The NV30 opportunties depend on restriction, even if instruction count increases (use a tex op OR use a complex fp op...also, limit register usage even if it increase instruction count, as long as the instruction count increase fits the restricted template of additional ops that can be performed). This seems to remain true for the NV35, except its template allows fp processing where the NV30 required integer.
The R300 opportunities depend only on having the opportunities made visible (it needs to know that there is a tex op that it could be doing in the same clock cycle, it needs to know that some ops are scalar and some are vec 3 if that is all that is required).
The R300's specific opportunities allow the NV30 opportunities to be similarly visible to its template (the scalar op would allow optimizing register usage, and it can also try to reschedule tex ops if the instructions fit its further template requirements), but not vice versa (extra instruction count to reduce register usage could decrease its performance for cases where it would not for the NV30, and the low level optimizer would have to spend significant resources to analyze for that, which would be introduce more CPU speed and software dependency). This can't be expressed in just counting textures or instructions per clock alone.
The day for hasty examples;
It is more like adding some hypothetical characteristics to your TMU comparison besides the textures per clock...have a 4 by 2 TMU architecture where each TMU can bilinear sample in 2 clocks or can be used together to produce one trilinear filtered sample in 2 clocks, and an 8 by 1 TMU architecture where that TMU can trilinear or bilinear sample in 2 clocks. It's not just the 1 texture case where the 1 TMU architecture might show advantage, but any number of texture applications requiring trilinear filtering.
As far as visibility to each other's template: You could write an application to manually perform trilinear filtering by asking for 2 specific bilinear filtered samples, instead of asking for one trilinear filtered sample.
I'm not proposing that this TMU example necessarily reflect any actual architecture, or even that it completely maps to NV30 versus R300 shader functionality, just that it is more like it than just counting TMUs and texture application...please forgive the sloppiness of it in that regard.
Luminescent
15-May-2003, 15:57
For clarification, when (or if) I said say operations in the past, I meant instructions ( :oops: sorry about that; instructions is the term Mdolnec used and the one which logically fits).
Then, hypothetically, if a control unit issues 1 fmad instruction to a 128-bit fpu, and the fpu can operate on 4 components (simultaneously) with that 1 command (let us say, in one cylce). We can then affirm that it is effectively capable of 4 muls plus 4 adds in one cycle, giving 8 opc (8 operatins per cycle, or 8 flopc to be more specific). Note: The term "opc" is not conventionally recognized
This in mind, demalion, when I mentioned 12 ops per clock for the NV35, I meant 12 instructions per clock ability. It is to say, the processor is capable of 12 shader only instructions per clock (fp) in its pixel shader ( I belive you compared NV35's total shader ability with R300's total texturing and shading ability), if NV35 has 12 fpus and the control unit sends out 12 128-bit instructions, each executing in 1 cycle.
In NV35, these instructions are vector (according to thepkrl's NV30 analysis), so 4 operands are operated on for every type of fp shader instruction (although there probably is a special functions unit for functions which require table lookups and such). R350, on the other hand, can issue two instructions per fp shader (a 4 component vector op and a scalar of which there are 8 units of each); so it has a maximum potential throughput of 16 instuctions per clock (fp shader only).
Then, we are looking at 1 architecture capable of 12 (4-component) vector ops and 1 capable of 8 vector ops + 8 scalar ops. This is, however, besides the main intent, which is to resolve the following: the difference (pros and cons) between a 4 pipeline processor with 2 fp shaders in each pipeline (assuming both have equal vector and scalar execution abilities) and one with 8 pipelines containing 1 fp shader in each? This is what I was intending to show with the analogy, but as Demalion pointed out, the comparison was not as accurate as it could be.
I believe answering this would facilitate our understanding of the NV35's performance in pixel shader benchmarks, alongside this important tidbit which explains the possible penalties of register usage with shaders (in NV30, which contains 4 128-bit fp shaders):
thepkrl:
Register usage is the key to performance, as has been mentioned earlier. For maximum performance, it seems you can only use 2 FP32-registers or 4 FP16-registers. Every two new registers slow down things, and going over 8 regs slows even more:
4.2 cyc/pix: 1reg (2 movs, 16 adds)
4.5 cyc/pix: 2reg (2 movs, 16 adds)
5.8 cyc/pix: 3reg (2 movs, 16 adds)
5.5 cyc/pix: 4reg (2 movs, 16 adds)
7.5 cyc/pix: 5reg (2 movs, 16 adds)
7.1 cyc/pix: 6reg (2 movs, 16 adds)
9.9 cyc/pix: 7reg (2 movs, 16 adds)
9.9 cyc/pix: 8reg (2 movs, 16 adds)
15.0 cyc/pix: 9reg (2 movs, 16 adds)
In the above test the N registers are used in order. If the register usage order is very mixed, performance seems to drop even more. This suggest there are about 2-4 real registers for each pixel in flight (depending if output register is counted or if extra temporaries are reserved). If more registers are used, data is moved between active registers and some slower memory buffer, which adds extra instructions.
demalion
17-May-2003, 21:03
Hmm, I was considering texture ops as instructions, like "texld". If the improvements consist of register combiners were upgraded to floating point, as I understood that comment to indicate, keep in mind that the NV3x architecture seems to be a "PS 2.0+" OR texture op architecture (the other ops are a restricted set). Anyways, if it can do 8 texture loads per clock usefully for the shader excution, my peak for it should have been 16.
If you mean to just consider arithmetic ops, it seems you need to introduce the qualifier of "no texture ops" in your peak figure of twelve to reflect contrast properly. I think that with floating point register combiners, the NV35's special case is far more competitive to that of the R3xx, but depends on limited texture usage. Unlike the NV30 with its integer dependency, that dependency is realistic and useful (IMO) on the NV35. To me, FX12 just directly killed the point of longer shaders that allowed this dependency on limited texture usage and calculations for details to be useful at all.
Of course, someone (perhaps with plenty of caffeine and a frustrated penchant for teasing) needs to investigate to see all the things that are fixed, but my current concept of the NV35 is an actual delivery of "better quality pixels, not faster" (or however it goes) with regards to the NV30.
My own thinking on the register usage is an issue of a stack storage system for values, and/or of access limited to the beginning and end of the pipeline, rather than to all of the of the "in flight" stages (i.e., register usage is simply exposing latency). This seems to make sense with the idea of constants for the architecture, and comments from thepkrl about how some register MOVs are free and some aren't. You should probably consider this unsupported and wild speculation, though. :P
Actually, typing that gives me deja vu, like there was a hypothetical discussion about that very idea. I checked Zephyr's R300/NV30 article discussion briefly, but did not see something that seemed to be what I'm thinking of. I really could swear there was something related to that type of idea that someone else proposed last year, but my searching success rate here is pretty poor. :-?
Luminescent
18-May-2003, 14:14
With texture ops included, the NV35 is only capable of 8 (128-bit) fp ops per clock. Remeber, MDolnec stated it NV35 was capable of 12 ops if only fp shaders are used and 8 fp ops, plus two textures per pipline, if texture fetches are included. So the peak, full precision fp, arithmetic-shader op performance of NV35 should be 12 ops per clock, not 16.
Analyzing the pdf (http://developer.nvidia.com/docs/IO/1310/ATT/AUserProgrammableVertexEngine.pdf) documents of the NV2x architecture's vertex shader and its internal diagrams lead me to speculate it is very similar in functionality to the NV3x's fp pixel shader pipelines. I remember Nvidia stating in a CineFx document that the pixel shader would recieve the abilities of the previous generation's vertex shader. Taking a look at the NV2x's vertex shader should give us some insight about the NV3x's fp pixel shader and the information below should show why.
According to the pdf document:
-The vertex pipelines consist of a pipelined vector core composed of an simd vector unit and a special function fp unit. Each component is computed with 32-bit, fp precision.
-All instructions have a percieved single cycle execution rate (and only 1 instrucition per clock is sent to the shader unit; scalar instructions are replicated accross all four vector components.
These facts seem to hold true for NV3x's pixel piplines; you can verify them here (http://www.beyond3d.com/forum/viewtopic.php?p=99900#99900) and here (http://oss.sgi.com/projects/ogl-sample/registry/NV/fragment_program.txt).
(note: I believe rsq instructions and lrp require a bit more than one instruction (3 or 4), so they should have more than a 1 cycle latency within the architecture).
That is why I have decided to take a look at these vertex pipelines a little more closely, so we'll have an idea of what exactly may composes these elusive "floating-point pixel shaders".
Here is a diagram of the NV2x's vertex pipline internals, which should be very similar to NV3x's pixel pipline internals:
http://bbs.gzeasy.com/uploads/post-1-1043238922.jpg
In an extremetech interview (found here (http://www.extremetech.com/article2/0,3973,36878,00.asp)), we find the data which gives meaning to the diagram above and reason as to why each instruction (consisting of 1 or 2 operations) has a percieved single cycle execution time:
"Each vertex engine can simultaneously process three vertices, and the workload is divvied up such that the free vertex processor takes the next incoming chunk of vertex processing work. So there can be six vertices in flight within the two vertex pipelines, and during every clock cycle, each vertex engine performs one instruction on each vertex. According to Kirk, "The pipeline stages are as deep as the slowest operation. The architecture is designed to deliver single-cycle performance for all of the instructions, so latency is effectively hidden. A divide takes more than three instructions, but the latency is hidden, so it appears to take one cycle." In fact, every vertex shader instruction now has a perceived single-cycle execution time. Several pipeline stages were added to hide latency of more complex operations, and one example Kirk gave was when doing a divide, reading back the result and doing another operation."
By looking at the diagram of a single vertex shader (which may closely resemble a pixel shader of the NV3x) we see why the shader can have 3 vertices in flight. It has three units, which are probably pipelined. Since the pipelines are deep enough to facilitate single cycle execution of something like a divide, it makes sense that there is an inverse logic unit, an Alu, and an Mlu (I believe the mlu and alu compose the simd vector core and the ilu composes the special floating-point core). In a hypothetical scenario, such as the divide example David Kirk gave, it makes sense that the Ilu would take the reciprocal of one operand in an operation and the Mlu and Alu would take care of the madding (if there is such a term to express the pipelines multiply-add of an fp multiplication) of the Ilu's result and the other operand. All this would be done ("effectively") in a single cycle, using the concept of pipelining. The NV3x pixel shading unit, most probably consists of the same units, but offers extra abilities such as ddx/ddy, free conditional updates, and variable precision.
Now we know what the pipline might have in store for the CineFX pixel shader. I'm wondering if alll the fp shading pipelines of the NV35 are fully loaded with the simd vector core (Mlu+Alu) and the special floating point ops core (Ilu) (to have comparable performance with R3xx, it should). Unlike R3xx's pixel simd and scalar cores, I do not think the NV2x's vector simd core and special fp ops core aren't able to function simultaneoulsy, per clock cycle.
Dave Baumann
18-May-2003, 15:16
At the moment, given the Rightmark scores, the low difference in transistor count and the tour of the NVIDIA offices I had you can say I'm a little skeptical as to NV35's.
One of the areas they showed us was wuth actual silicon verification labs, where they go around finding the issues and potential resolutions with new silicon. The guy who runs this lab was saying that obviously NV30 was a very difficult bring up and they spent lots of time with it. However, he also stated that NV35, conversly, was easy as it was so close to NV30 in the first place.
I still cant see that if you have 3 times the float power then that wouldn't somwhow translate into more performance in a new shader benchmark. I'd like to see a few more new shader benchmarks for the NV35 preview here :)
LeStoffer
18-May-2003, 16:37
I still cant see that if you have 3 times the float power then that wouldn't somwhow translate into more performance in a new shader benchmark. I'd like to see a few more new shader benchmarks for the NV35 preview here :)
Yes, the differences between NV30 and NV35 are less than stellar except in Pixel Shader 7:
http://www.digit-life.com/articles2/gffx/5900u.html
Pixel Shader 7 use much more texture samples and sample some data out of 3D textures, according to ixbt. They argue that the difference has something to do greate bandwidth of the NV35. While that may well be true, I would point out that the differrence could be due to a change to the NV35 so the FP shaders isn't sharing logic with the FP texturing unit.
BTW: It is interesting to note that the jump in perfomance in ShaderMark and 3dMark03 (PS2 test) isn't really reflected in Rightmark. Better drivers as promised?
Dunno, so I'm looking forrward to your review with a host of shader investigations! 8)
demalion
18-May-2003, 17:08
With texture ops included, the NV35 is only capable of 8 (128-bit) fp ops per clock. Remeber, MDolnec stated it NV35 was capable of 12 ops if only fp shaders are used and 8 fp ops, plus two textures per pipline, if texture fetches are included. So the peak, full precision fp, arithmetic-shader op performance of NV35 should be 12 ops per clock, not 16.
Ok, I'm missing how that contradicts, but maybe because I've forgotten what he originally said.
I understand 12 instructions when counting arithmetic ops, and the 8 texture ops are precluded.
I am assuming that the 8 tex ops (even if they are restricted to PS 1.3 texture load usage) do not preclude the 8 register combiner ops (2 per pipe when 4 pipe) newly allowed to be floating point (for my current understanding of NV35). That's why I was offering the correction of "16 ops", inclusive of texture and arithmetic ops to match the R3xx's peak that I quoted, as an alternative to saying 12 arithmetic ops and necessitating saying that texture ops were precluded for that to occur to contrast it with the R3xx. Is it just a matter of my viewing it as (2 tex ops / 1 fp op) + (2 fp ops) per pipe when 4 pipes, when I should be viewing the nv35 as (2 tex ops / 2 fp ops) + (1 op, maybe)? I'd thought MDolenc's comment had been edited, but if that info is indicated in something remaining I'll try to find it.
For instance, I'm not currently under the impression that it is established that the NV35 can't be 8x1 for PS 1.3 shading (but maybe at floating point precision), which would be 16 ops per clock peak if you count texture load as an op as well.
The vertex pipeline picture presents interesting information, but I think branching and register control functionality are the key to the performance characteristics and isn't represented in any detail (that I can discern) in it. However, I haven't followed the detail in the information you've linked to yet, and it looks like their will be a wealth of information there on that. Perhaps that's where I'll find the reason you propse my 16 op correction is incorrect. :P
And speak of the caffeine addict who likes teasing, he seems to be up and around...
Remember the register usage problems. Wouldn't be surprised if nVidia optimized ShaderMark to inflate their scores there. Although if all they did was changing the shaders to give the same result but with less register usage, I'd hardly call that cheating. I guess without clear facts, it's hard to say what they did though...
Uttar
demalion
18-May-2003, 18:26
I still cant see that if you have 3 times the float power then that wouldn't somwhow translate into more performance in a new shader benchmark. I'd like to see a few more new shader benchmarks for the NV35 preview here :)
Yes, the differences between NV30 and NV35 are less than stellar except in Pixel Shader 7:
http://www.digit-life.com/articles2/gffx/5900u.html
Pixel Shader 7 use much more texture samples and sample some data out of 3D textures, according to ixbt. They argue that the difference has something to do greate bandwidth of the NV35. While that may well be true, I would point out that the differrence could be due to a change to the NV35 so the FP shaders isn't sharing logic with the FP texturing unit.
The type of lead the for the NV35 does seem to be explained by the NV30 simply using FX12 and both architectures stalling on texture dependency for further calculations...the 6 and 7 are frontloaded with texture loads to which calculations are then applied. With the nV35's bandwidth allowing it to overcome the processing clock speed deficit, I'd expect a larger lead if they were truly decoupled...it should offer a significant performance boost.
6 and 7 seem to be the procedural marble and fire shaders...if so, this should be the applicable code:
texld r0, t0, s0
mul r7.w, c0.x, r0.x
texld r2, t1, s0
mad r4.w, c0.y, r2.x, r7.w
texld r11, t2, s0
mad r1.w, c0.z, r11.x, r4.w
texld r8, t3, s0
mad r10.w, c0.w, r8.x, r1.w
mul r5.w, c2.x, r10.w
mad r7.w, c1.x, t0.x, r5.w
mad r9.w, r7.w, c4.x, c4.w
frc r4.w, r9.w
mad r6.w, r4.w, c4.y, c4.z
mul r1.w, r6.w, r6.w
mad r3.w, r1.w, c5.x, c5.w
mad r5.w, r1.w, r3.w, c5.y
mad r7.w, r1.w, r5.w, c5.z
mad r9.w, r1.w, r7.w, c3.x
mad r11.w, r1.w, r9.w, c3.w
mov r3.xy, r11.w
texld r6, r3, s1
mov oC0, r6
texld r0, t0, s0
mul r7.w, c0.x, r0.x
texld r2, t1, s0
mad r4.w, c0.y, r2.x, r7.w
texld r11, t2, s0
mad r1.w, c0.z, r11.x, r4.w
texld r8, t3, s0
mad r10.w, c0.w, r8.x, r1.w
mul r5.w, c2.x, r10.w
mad r7.w, c1.x, t0.x, r5.w
mad r9.w, r7.w, c4.x, c4.w
frc r4.w, r9.w
mad r6.w, r4.w, c4.y, c4.z
mul r1.w, r6.w, r6.w
mad r3.w, r1.w, c5.x, c5.w
mad r5.w, r1.w, r3.w, c5.y
mad r7.w, r1.w, r5.w, c5.z
mad r9.w, r1.w, r7.w, c3.x
mad r11.w, r1.w, r9.w, c3.w
mov r3.xy, r11.w
texld r6, r3, s1
mov oC0, r6
(Declarations omitted for brevity).
I think the difference would be significantly greater if 1) the nv30 actually did restrict itself to floating point processing for all instructions 2) the nv35 did have texture ops completely decoupled, given the code. With 2), I think the nv35 would be closer to the 9800 than it was, as well (119.6 versus 197.4 fps), but possible register usage issues do make that a bit harder to evaluate.
BTW: It is interesting to note that the jump in perfomance in ShaderMark and 3dMark03 (PS2 test) isn't really reflected in Rightmark. Better drivers as promised?
It is unclear whether Rightmark 3D actually fully depends on floating point precision for the pixel filling test output quality requirements, and therefore how visible the NV30 using FX 12 would be, and no screenshots were provided for comparison.
We also have no screenshots for comparison for any of the results in question, that I know of, for the new drivers.
Dunno, so I'm looking forrward to your review with a host of shader investigations! 8) Who foots the bill for Wavey's hot beverages? :P
LeStoffer
18-May-2003, 19:03
The type of lead the for the NV35 does seem to be explained by the NV30 simply using FX12 and both architectures stalling on texture dependency for further calculations...the 6 and 7 are frontloaded with texture loads to which calculations are then applied. With the nV35's bandwidth allowing it to overcome the processing clock speed deficit, I'd expect a larger lead if they were truly decoupled...it should offer a significant performance boost.
So you suspect/suggest that NV35 are only using pure FP calculations while the NV30 are using FX12 but that NV35 are still being handicapped by having to share the some of FP logic with it's FP texture units?
Maybe, just note that the readme only states that you can change between FP16/FP32 (not FX12). Anyway, I feel we needs more evidence to solve this case.
I think the difference would be significantly greater if 1) the nv30 actually did restrict itself to floating point processing for all instructions 2) the nv35 did have texture ops completely decoupled, given the code. With 2), I think the nv35 would be closer to the 9800 than it was, as well (119.6 versus 197.4 fps), but possible register usage issues do make that a bit harder to evaluate.
Good point.
BTW: Yes, we should start to collect some money to pay for Wavey's hot beverages! 8)
Luminescent
19-May-2003, 04:13
Demalion, I see where I misunderstood you. I thought you were evaluating maximum fp shader throughput, which is not attainable when textures are involved. If you include texture ops, you lose 1 fp shader op per pipeline but you gain 2 texture ops, so you are right when you say 16 possible ops per clock.
Now, as for that pdf, it only elaborates upon some of the internals of the NV2x vertex shader, restrictions, and capabilities. Reading that pdf pointed out, to me at least, many similarities in funcionality and performance between a NV2x vertex shader's performance and NV3x's fp pixel shader's (which is why I posted it). Here are some close similarities which lead me to my conclusion:
-The vertex shader cannot branch, only evaluate conditional calls
-It executes most instructions in one cycle, with scalar and vector ops requiring roughly the same time.
-The vertex pipeline executes only 1 instruction per cycle (per pipeline), no vector + scalar pairing.
All in all, I saw many similarities between the two, and I thought including a detailed pdf would prove this point. Knowing this, I wanted to post the microcode diagram to give insight to those who have never really seen what goes on in these units. As you can tell from other posts in a variety of threads, hardware architecture is one of my favorite facets of 3D tech.
Hope that cleared things up.
Luminescent
19-May-2003, 04:46
As to whether the shader benchmarks in Digit-life's review seem reflective of the NV35's supposed fp shader performance - at first glance, they seem to refute the 12 ops per clock capability. However, considering the penatly NV35 pays for more than 2 registers at fp32 precision, it wouldn't surprise me if registers were a big reason for Digit Life's Rightmark results.
If we can find a pattern between the different tests, like a correlation between the results and the amount of registers used or the amount of texture ops compared to arithmetic ops, we should be able to effectively evaluate NV35's performance with respect to R350, which would help Wavey to asses (if that's how you spell it) this in his review, unlike any of the other reviewers out there.
For example, let us say in test 1 there are a series of light attenuation instructions and one of them is a dp3 (which should have a 1 cycle execution latency, per pipeline, with a max of 2 registers on NV35), but 3 registers are used instead of 2, according to thepkrl's numbers, the instruction will take 1.45 cycles as opposed to 1 cycle.
Note: For those who are skeptical, this is how the numbers added up:
Here (http://www.beyond3d.com/forum/viewtopic.php?p=116135#116135) it says that for 16 adds (or 16 1 cycle ops, for the NV3x) and the use of 3 registers, the NV30 takes 5.8 cycles. Since the NV30 has 4 pipelines and each add instruction takes 1 cycle, the performance should be 4 cycles. This means that per pipeline each instrution is taking 5.8/4 cycles, yielding 1.45 cycles (almost 50% more time for using 3 versus 2 or 1 registers).
Can someone count the registers used per instruction and see if, on average, they exceed 2 (which is the mark at which performance penalties are supposed to kick in, compared to 1 or 2 registers)? Most of arithmetic can be done in 1 cycle (with NV3x), so loosing out to R350 that badly on instructions like dp3's, mads, etc. (not rsq), which should be cake for NV35 and 12 fp units, indicates to me register usage might be to blame (even with textures enabled, NV35's raw fp shader performance should be on par if not above R350's, considering clockspeed).
A careful assesment of the shader benchmarks vs. the results, should give us a better idea of the detrimental performance impact which registers can have on NV3x, as opposed to R3xx. I say NV3x because MDolnec affirmed NV35 also has the register performance drawbacks of the NV30.
Dave Baumann
19-May-2003, 12:06
I ran both Ilfirin's and MDolenc's Shader benchmarks last night on both NV30 and NV35 and NV35 did display an improved performance in both, however this was not a 2X performance increase by any means but a deformace delta (+25% or so at a guess, don't have the numbers in front of me now). Now, I'm wondering about the temp registers as well. NV30 reported a very odd number of registeres through DX Cap and I wonder if the drivers were using a number of them to assist in a number of workarounds for some buggy hardware - I'll have a check tonight to see if NV35 reports a difference number that NV30 did.
LeStoffer
19-May-2003, 12:19
Dave, interesting. Keep up the investigations, they are highly appreciated. 8)
Luminescent
19-May-2003, 12:23
Very nice, Dave; keep up the good work. :D
LeStoffer
19-May-2003, 15:29
Regarding Rightmark's shader test, here's what I got with a Radeon 9700 Pro:
Benchmark "RightMark 3D: Pixel Shading"
Test Time: 10.00
Width: 1024
Height: 768
Window: OFF
Shader: 1
Shader Profile: Pixel Shader 2.0
FPS: 231.93
Shader: 4
Shader Profile: Pixel Shader 2.X (16fp)
FPS: 102.94
Shader: 2
Shader Profile: Pixel Shader 2.X (32fp)
FPS: 131.39
Shader: 5
Shader Profile: Pixel Shader 2.0
FPS: 55.11
Shader: 3
Shader Profile: Pixel Shader 2.X (16fp)
FPS: 90.76
Shader: 7
Shader Profile: Pixel Shader 2.X (32fp)
FPS: 162.25
Dave Baumann
20-May-2003, 02:46
Nope, same number of PS temps in both NV30 and NV35: 28.
LeStoffer, where is the download for Rightmark? Cheers.
Luminescent
20-May-2003, 03:44
The download links for all Rightmark tests are found here (http://www.digit-life.com/articles2/gffx/5900u.html), at Digit-Life.
Specifically they are:
FillingRate (http://www.digit-life.com/rm3d/DX9Synth/FillingRate.zip)
GeometryProcessing Speed (http://www.digit-life.com/rm3d/DX9Synth/GeometryProcessingSpeed.zip)
HSR (http://www.digit-life.com/rm3d/DX9Synth/HSR.zip)
PixelShaders (http://www.digit-life.com/rm3d/DX9Synth/PixelShaders.zip)
PointSprites (http://www.digit-life.com/rm3d/DX9Synth/PointSprites.zip)
demalion
20-May-2003, 14:51
That's out of date. www.rightmark.org has the latest Direct3D benchmark somewhere, and "UncleSam" posted a link recently (a search on his user name should turn it up).
I think all links, to the "Cg Rightmark3D" and "Direct3D Rightmark3D" both, can be found in the recent Rightmark 3D thread. I'm not sure if the one on the rightmark.org download page is the most recent, but the most recent one that I know of can be found through one of the aforementioned links.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.