Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 13-May-2003, 18:15   #1
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,883
Default NV35 pipeline organization

Hey everyone,

Just thought I should start a new thread about this, since it might become a fairly big subject. Didn't see any yet.

The NV30 is 1FP/TEX unit and 2FX units/pipe
There two possible things nVidia could have done, since they kept their 4 pipes:

- 1FP/TEX unit, 1 true FP unit and 1 FX unit/pipe
- 2FP/TEX units and 1 FX unit, with the FP/TEX units only being able to do 1 independent fetch/clock instead of 2.
- 1FP unit, 2 TEX units, 2 FX units with a ridiculous amount of cheating.

My guess is actually number two.
With the old configuration, there was some sharing between FP & Tex, but TEX could do 8/clock, so I guess it had quite a bit additional trannies too. So, with this, you wouldn't need as much additional trannies for the texturing, and the whole design thus becomes possible at 130M transistors with other overall optimizations.


Any feedback, comments, ideas?


Uttar

Uttar
Arun is offline   Reply With Quote
Old 13-May-2003, 18:19   #2
LeStoffer
Senior Member
 
Join Date: Feb 2002
Location: Somewhere not *that* rotten in Denmark
Posts: 1,197
Default

I'm thinking about the same thing, but I'm haven't seen any info or benchmark that gives any hint at what have been changed. Number two option does look promising, but right now I feel clueless. Sorry.
__________________
Best regards, LeStoffer
LeStoffer is offline   Reply With Quote
Old 13-May-2003, 18:27   #3
Joe DeFuria
Regular
 
Join Date: Feb 2002
Posts: 5,951
Default

Uttar,

Heh...could you clarify your definition of FP/TEX, FX and "True FP" units?
Joe DeFuria is offline   Reply With Quote
Old 13-May-2003, 18:42   #4
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,883
Default

Yeah, we do have very very little info about it ( still more than about the NV40, though, hehe! )
The most we have is:
http://www.hardocp.com/article.html?art=NDcyLDEy

It seems 100% obvious nVidia is capable of 2FP/clock, but got lower efficiency ( register usage performance hits remain I guess, although they might have been lowered, who knows ) in most cases. The cases where it wins would likely be when it benefits from its bigger native instruction set.

This would slightly increase efficiency with FX too, because you could do 1FX/FP and 1TEX op in parallel, instead of always having to do 2TEX ops to get max efficiency.

So now, the NV35 is a lot nearer to a 8x1 than the NV30, even though it's still practically a 4 pipelines architecture. Funny, eh?


Uttar

EDIT: For Joe: FP/TEX: unit who can do either FP or TEX ops, not in parallel. In the case of the NV30, you could do either 4FP ops to 8TEX ops.
True FP: Unit who can do FP ops in 1 clock, no sharing with texturing.
FX: Unit who can do FX ops in 1 clock
Arun is offline   Reply With Quote
Old 13-May-2003, 19:01   #5
Joe DeFuria
Regular
 
Join Date: Feb 2002
Posts: 5,951
Default

Quote:
Originally Posted by Uttar
EDIT: For Joe: FP/TEX: unit who can do either FP or TEX ops, not in parallel. In the case of the NV30, you could do either 4FP ops to 8TEX ops.
OK, but Im confused because you listed NV30 as "1FP/TEX unit and 2FX units/pipe". Doesn't that indicate only 1 texture operation/read per pipe total? (Doesn't NV30 have the ability to do Two?) Is the TEX unit more analogouse to the traditional TMU, or is the FX unit? I'm not clear on what the purpose of the "FX" unit is....
Joe DeFuria is offline   Reply With Quote
Old 13-May-2003, 19:18   #6
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,883
Default

Well, based on the NV30 pipeline threads, I think it was finally agreed that there was a unit which could do either 1FP op/clock/pipe or 2TEX ops/clock/pipe
Or at least, that's the practical POV. There's obviously some dedicated trannies for each type of operation, but much of it is probably shared.

My idea is that with the NV35, it's 1FP op/clock/pipe or 1TEX op/clock/pipe for 2 FP/TEX units.

The FX unit is obviously the integer unit, for INT12 operations.


Uttar
Arun is offline   Reply With Quote
Old 13-May-2003, 19:37   #7
Joe DeFuria
Regular
 
Join Date: Feb 2002
Posts: 5,951
Default

OK, I think we're on the same wavelength now.

Options 2 and 3 really seem like the only feasible possibilities to me. It might actually be somewhat of a combination of the two.

I think the only way to really ascertain what's going on, is to have both the 5800 and 5900 side by side and run through several pixel shading tests...with several sets of drivers.
Joe DeFuria is offline   Reply With Quote
Old 13-May-2003, 21:00   #8
Xmas
Off-season
 
Join Date: Feb 2002
Location: On the pursuit of happiness
Posts: 3,019
Default

I think it's either the first or the second variant. But I tend to believe the first. That would mean either 8xFP or 8xTex + 4xFP per clock, which IMO best explains why the FX5900 is close to R9800Pro, but rarely surpasses it in 2.0 shaders although the FX is clocked higher.
Xmas is offline   Reply With Quote
Old 13-May-2003, 21:07   #9
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,883
Default

Quote:
Originally Posted by Xmas
I think it's either the first or the second variant. But I tend to believe the first. That would mean either 8xFP or 8xTex + 4xFP per clock, which IMO best explains why the FX5900 is close to R9800Pro, but rarely surpasses it in 2.0 shaders although the FX is clocked higher.
Actually, that's the second variant The first is ( 4xFP or 8x Tex ) + 4xFP + 4xFX
There's two serious variants, and the third which is really much more of a paranoid dream.


Uttar
Arun is offline   Reply With Quote
Old 13-May-2003, 21:09   #10
MDolenc
Member
 
Join Date: May 2002
Location: Slovenia
Posts: 420
Default

I actually got reply on this from NVidia 2 hours ago.
MDolenc is offline   Reply With Quote
Old 13-May-2003, 21:19   #11
Joe DeFuria
Regular
 
Join Date: Feb 2002
Posts: 5,951
Default

Quote:
Originally Posted by MDolenc
I actually got reply on this from NVidia 2 hours ago.
That would mean then, that unlike the NV30, the NV35 should be able to run the ARB2 path of Doom3 at the same speed as the NV30 path. So the "default" path for NV35 should, like R3xx, be ARB2, where the default path of NV30 will be NV30....correct?
Joe DeFuria is offline   Reply With Quote
Old 13-May-2003, 21:28   #12
Xmas
Off-season
 
Join Date: Feb 2002
Location: On the pursuit of happiness
Posts: 3,019
Default

Quote:
Originally Posted by Uttar
Quote:
Originally Posted by Xmas
I think it's either the first or the second variant. But I tend to believe the first. That would mean either 8xFP or 8xTex + 4xFP per clock, which IMO best explains why the FX5900 is close to R9800Pro, but rarely surpasses it in 2.0 shaders although the FX is clocked higher.
Actually, that's the second variant The first is ( 4xFP or 8x Tex ) + 4xFP + 4xFX
There's two serious variants, and the third which is really much more of a paranoid dream.
(4xFP or 8xTex) + 4xFP is equal to 8xFP or (8xTex + 4xFP). I left out the FX units.

The second variant would be 8xFP or (4xTex + 4xFP) or 8xTex


MDolenc,
interesting information. If that's true it should be significantly faster than R300 in shaders that use few registers.
Xmas is offline   Reply With Quote
Old 13-May-2003, 21:35   #13
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,883
Default

No it's not
The difference all lies in parallelism. It's easier to get parallelism with ( 4x FP or 4x TEX ) x 2 than with (4xFP or 8xTex) + 4xFP

MDolenc: VERY interesting info! That would most definitively justify the "Force FP16" flag nVidia has got MS to put in a future revision of DX9!
That most certainly explains the "12 ops/clock" number from the outdated PR docs I leaked a while back.
Anyway, very nice info. I guess nVidia is gonna have a fair bit of trouble with the new FP16/FP32 switching though. I guess the hit comes when there's switching in the same pass. Funny performance hit, hehe.

Uttar
Arun is offline   Reply With Quote
Old 13-May-2003, 21:36   #14
Xmas
Off-season
 
Join Date: Feb 2002
Location: On the pursuit of happiness
Posts: 3,019
Default

Quote:
Originally Posted by Joe DeFuria
Quote:
Originally Posted by MDolenc
I actually got reply on this from NVidia 2 hours ago.
That would mean then, that unlike the NV30, the NV35 should be able to run the ARB2 path of Doom3 at the same speed as the NV30 path. So the "default" path for NV35 should, like R3xx, be ARB2, where the default path of NV30 will be NV30....correct?
Possibly. Maybe NV35 would still be faster in Doom3 when using FP16. Then even a modified NV30 path would make sense.
Xmas is offline   Reply With Quote
Old 13-May-2003, 21:46   #15
Xmas
Off-season
 
Join Date: Feb 2002
Location: On the pursuit of happiness
Posts: 3,019
Default

Quote:
Originally Posted by Uttar
No it's not
The difference all lies in parallelism. It's easier to get parallelism with ( 4x FP or 4x TEX ) x 2 than with (4xFP or 8xTex) + 4xFP
True, dependent texture reads are easier with (4xFP or 4xTex) x 2, which is your second variant. But (4xFP or 8xTex) + 4xFP can do more per clock.
Xmas is offline   Reply With Quote
Old 13-May-2003, 21:56   #16
LeStoffer
Senior Member
 
Join Date: Feb 2002
Location: Somewhere not *that* rotten in Denmark
Posts: 1,197
Default

Quote:
Originally Posted by MDolenc
I actually got reply on this from NVidia 2 hours ago.
It actually seams that integer logic is gone from NV35 pixel shaders. It is capable of 3 floating point (and it doesn't care that much about fp16 vs. fp32 either) instructions per pipe per clock (12 floating point instructions per clock total) or 2 floating point instructions + 2 texture look-ups per pipe per clock.
Woah! If this is true - and why not? - it makes the orginal CineFX look somewhat outdated already. Thus nVidia's claim for CineFX vesion 2.0. I'm all for going full FP if peformance allows (like on R3x0), but I really wonder where this leave the NV31 and NV34 in the eyes of developer support now that NV30 - and the int12 lead with it - is de facto a dead end. :?
__________________
Best regards, LeStoffer
LeStoffer is offline   Reply With Quote
Old 13-May-2003, 22:07   #17
Tridam
Regular
 
Join Date: Apr 2003
Location: Louvain-la-Neuve, Belgium
Posts: 523
Default

Nice thread

I don't have any number that could help me talking without any (or not too much) doubt about NV35 pipeline organization. Actually my guess was that NVIDIA has kept the same pipeline as NV30 (including FX units) with one more unit per pipeline: a floating point one or a FP/tex one (or FP/adress processor). In regard with HOCP Shadermark results, it seems like there is another change to increase FP shader power. I thought that NVIDIA had doubled the number of without-performance-hit-usable registers.

But MDolenc information makes sense too (but isn't it a too big change from NV31-34-30 ?). If it's true I think that it's a pretty nice design. This way, the NV35 has the same theoretical throughput that the Radeon 9800/9700 has in case of 2 texture lookups + 2 FP ops. The NV35 has an advantage when there's more FP ops than texture lookups but on the other side needs more optimised shader with less dependence.

If it's true, the only drawback from NV30 would be the loss of the double FX multiplication power in fixed point units (5 multiplication FX ops per cycle possible). Everything else should be faster or a lot faster. One possible question is: are the new FP units able to do every operation? Maybe they can just do simple operations and only the FP/tex unit is able to do every complex operation? (it's just a question I'm asking me )

The FP16/32 question remains. If NVIDIA has kept the same register access organization, FP16 remains very gainful as it allows access with no performance drop to 4 registers instead of 2. Using FP16 and FP32 in the same pipeline could be a problem when dealing with registers usage optimisation. So it should be better to use only FP32 or only FP16.
Tridam is offline   Reply With Quote
Old 13-May-2003, 22:17   #18
MuFu
Chief Spastic Baboon
 
Join Date: Jun 2002
Location: Location, Location with Kirstie Allsopp
Posts: 2,258
Default

Quote:
Originally Posted by Joe DeFuria
Quote:
Originally Posted by MDolenc
I actually got reply on this from NVidia 2 hours ago.
That would mean then, that unlike the NV30, the NV35 should be able to run the ARB2 path of Doom3 at the same speed as the NV30 path.
Quote:
Originally Posted by THG
Due to a bug, ARB2 currently does not work with NVIDIA's DX9 cards when using the preview version of the Detonator FX driver. According to NVIDIA, ARB2 performance with the final driver should be identical to that of the NV30 code.
MuFu.
MuFu is offline   Reply With Quote
Old 13-May-2003, 22:22   #19
demalion
Senior Member
 
Join Date: Feb 2002
Location: CT
Posts: 2,024
Default

Woah, I hadn't expected that until NV40. I had no idea the NV30 was that broken. Well, I did, but I dismissed the possibility too soon, it appears.
demalion is offline   Reply With Quote
Old 13-May-2003, 22:36   #20
Tridam
Regular
 
Join Date: Apr 2003
Location: Louvain-la-Neuve, Belgium
Posts: 523
Default

Quote:
Originally Posted by demalion
Woah, I hadn't expected that until NV40.
I hadn't expected it too :P I thought that NVIDIA would try to use a pipeline very similar to NV30/1/4 pipelines to "help" developers make shader that every NV3x like.


It's great if NV35 can work properly at full speed with the ARB2 path.
Tridam is offline   Reply With Quote
Old 13-May-2003, 22:38   #21
Ostsol
Senior Member
 
Join Date: Nov 2002
Location: Edmonton, Alberta, Canada
Posts: 1,765
Default

Eck. . . definitely a case of driver optimization. If we go into "conspiracy theory mode", we can speculate that NVidia purposely broke ARB_fragment_program support so that hardware sites would have no choice at all but to use the NV30 path for the benchmark. . .

EDIT: misread a post. . .
Ostsol is offline   Reply With Quote
Old 13-May-2003, 22:50   #22
Luminescent
Senior Member
 
Join Date: Aug 2002
Location: Miami, Fl
Posts: 1,036
Default

But are those 12 fp units just the shader units, the number of shader units and texture units combined, or are they all capable of functioning as either?
__________________
"Friendship is unnecessary, like philosophy, like art... It has no survival value; rather it is one of those things that give value to survival."
-C.S. Lewis
Luminescent is offline   Reply With Quote
Old 13-May-2003, 22:58   #23
MuFu
Chief Spastic Baboon
 
Join Date: Jun 2002
Location: Location, Location with Kirstie Allsopp
Posts: 2,258
Default

Quote:
Originally Posted by Luminescent
But are those 12 fp units just the shader units, the number of shader units and texture units combined, or are they all capable of functioning as either?
No. It's just that the TEX and FP units are intrinsically linked in such a way that allows 3 FP/pipe/clock OR 2 FP + 2 TEX lookup. I suppose this is most likely due to shared physical logic. I doubt they are discrete, "multi-purpose" units as that would rule out the inter-dependency.

MuFu.
MuFu is offline   Reply With Quote
Old 13-May-2003, 22:59   #24
demalion
Senior Member
 
Join Date: Feb 2002
Location: CT
Posts: 2,024
Default

Any increase in actual floating point performance is a very good thing, as long as testing bears out that it is real...all the other performance issues were not nearly as significant as this and the impact on DX 9 moving forward. Thinking back to the GDC slides and what it proposes for the HLSL ps_2_a target, this evolution seems natural and according to nVidia's original plan for NV3x (hmm...and also in line with some speculation I had intended for the forums, but restricted to some PMs due to a disappearing thread).

I don't see nVidia blatantly lying about this, and it makes sense within the assumptions about the NV30 transistor count that I abandoned a while ago as unrealistic, and the good news is that Wavey has an NV35 to put through its paces.

The bad news is that he won't have as much to tease us about with regards to surprises with the results until he finishes.
Oh, wait, that's only bad news for him :P.
demalion is offline   Reply With Quote
Old 13-May-2003, 23:02   #25
Luminescent
Senior Member
 
Join Date: Aug 2002
Location: Miami, Fl
Posts: 1,036
Default

I hope he puts NV35 through various shader benchmarks, including the ones developed by our very own forum members.
__________________
"Friendship is unnecessary, like philosophy, like art... It has no survival value; rather it is one of those things that give value to survival."
-C.S. Lewis
Luminescent is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
NV35 might be misunderstood... Luminescent 3D Architectures & Chips 52 17-Jun-2003 01:10
Is the NV3x influenced by ILDP? Arun 3D Architectures & Chips 10 12-Jun-2003 12:34
NV31 closer to NV35 than NV30? - More pipeline mysteries. boobs 3D Architectures & Chips 26 14-Mar-2003 02:38
NV30 AND NV35 specs revealed? Steve 3D Architectures & Chips 1 15-Jul-2002 15:57
"leaked" NV30 & NV35 specs. Nappe1 3D Architectures & Chips 22 15-May-2002 18:23


All times are GMT +1. The time now is 06:12.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.