NV35 pipeline organization

Arun

Unknown.
Moderator
Legend
Hey everyone,

Just thought I should start a new thread about this, since it might become a fairly big subject. Didn't see any yet.

The NV30 is 1FP/TEX unit and 2FX units/pipe
There two possible things nVidia could have done, since they kept their 4 pipes:

- 1FP/TEX unit, 1 true FP unit and 1 FX unit/pipe
- 2FP/TEX units and 1 FX unit, with the FP/TEX units only being able to do 1 independent fetch/clock instead of 2.
- 1FP unit, 2 TEX units, 2 FX units with a ridiculous amount of cheating.

My guess is actually number two.
With the old configuration, there was some sharing between FP & Tex, but TEX could do 8/clock, so I guess it had quite a bit additional trannies too. So, with this, you wouldn't need as much additional trannies for the texturing, and the whole design thus becomes possible at 130M transistors with other overall optimizations.


Any feedback, comments, ideas?


Uttar

Uttar
 
I'm thinking about the same thing, but I'm haven't seen any info or benchmark that gives any hint at what have been changed. Number two option does look promising, but right now I feel clueless. Sorry.
 
Yeah, we do have very very little info about it ( still more than about the NV40, though, hehe! )
The most we have is:
http://www.hardocp.com/article.html?art=NDcyLDEy

It seems 100% obvious nVidia is capable of 2FP/clock, but got lower efficiency ( register usage performance hits remain I guess, although they might have been lowered, who knows ) in most cases. The cases where it wins would likely be when it benefits from its bigger native instruction set.

This would slightly increase efficiency with FX too, because you could do 1FX/FP and 1TEX op in parallel, instead of always having to do 2TEX ops to get max efficiency.

So now, the NV35 is a lot nearer to a 8x1 than the NV30, even though it's still practically a 4 pipelines architecture. Funny, eh?


Uttar

EDIT: For Joe: FP/TEX: unit who can do either FP or TEX ops, not in parallel. In the case of the NV30, you could do either 4FP ops to 8TEX ops.
True FP: Unit who can do FP ops in 1 clock, no sharing with texturing.
FX: Unit who can do FX ops in 1 clock
 
Uttar said:
EDIT: For Joe: FP/TEX: unit who can do either FP or TEX ops, not in parallel. In the case of the NV30, you could do either 4FP ops to 8TEX ops.

OK, but Im confused because you listed NV30 as "1FP/TEX unit and 2FX units/pipe". Doesn't that indicate only 1 texture operation/read per pipe total? (Doesn't NV30 have the ability to do Two?) Is the TEX unit more analogouse to the traditional TMU, or is the FX unit? I'm not clear on what the purpose of the "FX" unit is....
 
Well, based on the NV30 pipeline threads, I think it was finally agreed that there was a unit which could do either 1FP op/clock/pipe or 2TEX ops/clock/pipe
Or at least, that's the practical POV. There's obviously some dedicated trannies for each type of operation, but much of it is probably shared.

My idea is that with the NV35, it's 1FP op/clock/pipe or 1TEX op/clock/pipe for 2 FP/TEX units.

The FX unit is obviously the integer unit, for INT12 operations.


Uttar
 
OK, I think we're on the same wavelength now. ;)

Options 2 and 3 really seem like the only feasible possibilities to me. It might actually be somewhat of a combination of the two.

I think the only way to really ascertain what's going on, is to have both the 5800 and 5900 side by side and run through several pixel shading tests...with several sets of drivers.
 
I think it's either the first or the second variant. But I tend to believe the first. That would mean either 8xFP or 8xTex + 4xFP per clock, which IMO best explains why the FX5900 is close to R9800Pro, but rarely surpasses it in 2.0 shaders although the FX is clocked higher.
 
Xmas said:
I think it's either the first or the second variant. But I tend to believe the first. That would mean either 8xFP or 8xTex + 4xFP per clock, which IMO best explains why the FX5900 is close to R9800Pro, but rarely surpasses it in 2.0 shaders although the FX is clocked higher.

Actually, that's the second variant ;) The first is ( 4xFP or 8x Tex ) + 4xFP + 4xFX
There's two serious variants, and the third which is really much more of a paranoid dream.


Uttar
 
MDolenc said:
I actually got reply on this from NVidia 2 hours ago. ;)

That would mean then, that unlike the NV30, the NV35 should be able to run the ARB2 path of Doom3 at the same speed as the NV30 path. So the "default" path for NV35 should, like R3xx, be ARB2, where the default path of NV30 will be NV30....correct?
 
Uttar said:
Xmas said:
I think it's either the first or the second variant. But I tend to believe the first. That would mean either 8xFP or 8xTex + 4xFP per clock, which IMO best explains why the FX5900 is close to R9800Pro, but rarely surpasses it in 2.0 shaders although the FX is clocked higher.

Actually, that's the second variant ;) The first is ( 4xFP or 8x Tex ) + 4xFP + 4xFX
There's two serious variants, and the third which is really much more of a paranoid dream.
(4xFP or 8xTex) + 4xFP is equal to 8xFP or (8xTex + 4xFP). I left out the FX units.

The second variant would be 8xFP or (4xTex + 4xFP) or 8xTex


MDolenc,
interesting information. If that's true it should be significantly faster than R300 in shaders that use few registers.
 
No it's not :)
The difference all lies in parallelism. It's easier to get parallelism with ( 4x FP or 4x TEX ) x 2 than with (4xFP or 8xTex) + 4xFP

MDolenc: VERY interesting info! That would most definitively justify the "Force FP16" flag nVidia has got MS to put in a future revision of DX9!
That most certainly explains the "12 ops/clock" number from the outdated PR docs I leaked a while back.
Anyway, very nice info. I guess nVidia is gonna have a fair bit of trouble with the new FP16/FP32 switching though. I guess the hit comes when there's switching in the same pass. Funny performance hit, hehe.

Uttar
 
Joe DeFuria said:
MDolenc said:
I actually got reply on this from NVidia 2 hours ago. ;)

That would mean then, that unlike the NV30, the NV35 should be able to run the ARB2 path of Doom3 at the same speed as the NV30 path. So the "default" path for NV35 should, like R3xx, be ARB2, where the default path of NV30 will be NV30....correct?
Possibly. Maybe NV35 would still be faster in Doom3 when using FP16. Then even a modified NV30 path would make sense.
 
Uttar said:
No it's not :)
The difference all lies in parallelism. It's easier to get parallelism with ( 4x FP or 4x TEX ) x 2 than with (4xFP or 8xTex) + 4xFP
True, dependent texture reads are easier with (4xFP or 4xTex) x 2, which is your second variant. But (4xFP or 8xTex) + 4xFP can do more per clock.
 
MDolenc said:
I actually got reply on this from NVidia 2 hours ago. ;)
It actually seams that integer logic is gone from NV35 pixel shaders. It is capable of 3 floating point (and it doesn't care that much about fp16 vs. fp32 either) instructions per pipe per clock (12 floating point instructions per clock total) or 2 floating point instructions + 2 texture look-ups per pipe per clock.

Woah! If this is true - and why not? - it makes the orginal CineFX look somewhat outdated already. Thus nVidia's claim for CineFX vesion 2.0. I'm all for going full FP if peformance allows (like on R3x0), but I really wonder where this leave the NV31 and NV34 in the eyes of developer support now that NV30 - and the int12 lead with it - is de facto a dead end. :?
 
Nice thread :)

I don't have any number that could help me talking without any (or not too much) doubt about NV35 pipeline organization. Actually my guess was that NVIDIA has kept the same pipeline as NV30 (including FX units) with one more unit per pipeline: a floating point one or a FP/tex one (or FP/adress processor). In regard with HOCP Shadermark results, it seems like there is another change to increase FP shader power. I thought that NVIDIA had doubled the number of without-performance-hit-usable registers.

But MDolenc information makes sense too (but isn't it a too big change from NV31-34-30 ?). If it's true I think that it's a pretty nice design. This way, the NV35 has the same theoretical throughput that the Radeon 9800/9700 has in case of 2 texture lookups + 2 FP ops. The NV35 has an advantage when there's more FP ops than texture lookups but on the other side needs more optimised shader with less dependence.

If it's true, the only drawback from NV30 would be the loss of the double FX multiplication power in fixed point units (5 multiplication FX ops per cycle possible). Everything else should be faster or a lot faster. One possible question is: are the new FP units able to do every operation? Maybe they can just do simple operations and only the FP/tex unit is able to do every complex operation? (it's just a question I'm asking me ;) )

The FP16/32 question remains. If NVIDIA has kept the same register access organization, FP16 remains very gainful as it allows access with no performance drop to 4 registers instead of 2. Using FP16 and FP32 in the same pipeline could be a problem when dealing with registers usage optimisation. So it should be better to use only FP32 or only FP16.
 
Joe DeFuria said:
MDolenc said:
I actually got reply on this from NVidia 2 hours ago. ;)

That would mean then, that unlike the NV30, the NV35 should be able to run the ARB2 path of Doom3 at the same speed as the NV30 path.

THG said:
Due to a bug, ARB2 currently does not work with NVIDIA's DX9 cards when using the preview version of the Detonator FX driver. According to NVIDIA, ARB2 performance with the final driver should be identical to that of the NV30 code.

MuFu.
 
Woah, I hadn't expected that until NV40. I had no idea the NV30 was that broken. Well, I did, but I dismissed the possibility too soon, it appears.
 
demalion said:
Woah, I hadn't expected that until NV40.

I hadn't expected it too :p I thought that NVIDIA would try to use a pipeline very similar to NV30/1/4 pipelines to "help" developers make shader that every NV3x like.


It's great if NV35 can work properly at full speed with the ARB2 path.
 
Back
Top