NV35 might be misunderstood...

Tridam · Jun 15, 2003

McElvis said:
Hi Tridam,

If the NV30 is -
4 pipelines : 2 text units or 1 FP32 unit + 2 FX12 units (with double mul ability)

And the NV34 is -
2 pipelines : 2 text units or 1 FP32 unit + 2 FX12 units (with double mul ability)

What is the NV31?

I don't have one to test

I think it's the same overall architecture as the NV34 architecture. Maybe the pipelines of every unit are smaller and don't hide latency as well. Maybe the buffer are smaller... I don't really know

WaltC · Jun 15, 2003

DaveBaumann said:
Honestly, I wish people would get over this "its not pipelined" thing - where is the proof that its not pipelined? I'd like to see something that really suggest this.

On the other hand I've seen lots to suggest that it is pipelined - such as the inability to texture and do an FP operation at the same time (in NV30) and the fact that at the D2D developer conference they were talking about all the operations that rely on a 2x2 configuration (ddx/ddy, filtering, etc.).

Seriously, can anyone prove that it differs from this.

Exactly. I remember in an nV30 interview with somebody at nVidia (Kirk, Spock, Perez--somebody--I think it was a question proffered by Tech Report), the very precise question was put to nVidia:

"Is nv30 architecture an 8x1 pipeline organization?"

Which question was answered by nVidia thusly:

"Yes, we do 8 ops per clock."

This kind of thing is typical for nVidia. They first advertise an R300-like 8x1 pipeline organization and then deliberately confuse the issue by talking about ops instead of pixels rendered per clock (which is what the original question concerned instead of "ops" as defined here--nVidia's definition of "ops" is also fluid it would appear but that's another story.)

Anyway, nVidia has danced around the question with the grace of a professional ballet performer ever since.

This gives rise to all kinds of "connect-the-dot" hypotheses which attempt to make sense out of the contradictory and evasive "answers" nVidia provides to such questions (forgetting all about their liking for the "trade secret" mantra.) nVidia's never said "we don't use traditional pipelines in our nv3x architecture." Never said anything close to it.

Instead, they equivocate and insinuate and people pick this up and attempt to make some kind of rational apology for it--such as the "virtual" pipeline theory or the "non-traditional" pipeline theory or the "conditional" pipeline theory, etc. ad infinitum. nVidia has never stated anything of the kind.

IMO, they just don't want it known that the 8x1 organization they literally advertise for these products isn't correct.

So why do they do it? What's behind all of their evasiveness on this and other topics? R3xx.

MuFu · Jun 15, 2003

Luminescent said:
I was sitting on the toilet when it hit me!

My toilet's never done that, but once the lid fell down when I was taking a piss and nearly severed my meat truncheon.

MuFu.

Arun · Jun 15, 2003

MuFu said:
Luminescent said:

I was sitting on the toilet when it hit me!

Click to expand...

My toilet's never done that, but once the lid fell down when I was taking a piss and nearly severed my meat truncheon.

MuFu.

There you have it, people - MuFu's only post for several days is about "meat truncheons"

And no, I don't have the time to reply to the rest of the topic now. Don't worry, I'm not trying to escape it, just don't have the time

Uttar

Hyp-X · Jun 15, 2003

Tridam said:
NV25 : 4 pipelines : 2 text units + 2 FX units (with double mul ability)
NV30 : 4 pipelines : 2 text units or 1 FP32 unit + 2 FX12 units (with double mul ability)
NV34 : 2 pipelines : 2 text units or 1 FP32 unit + 2 FX12 units (with double mul ability)
R350 : 8 pipelines : 1 text unit + 1 FP24 unit + 1 FP24-just-MUL unit (MOV seems free)
R200 : 4 pipelines : 2 text units + 2 FX12 units
RV250 : 4 pipelines : 1 text unit + 1 FX12 unit

I have strange results with R350/R300. It seems able to do one MUL for free with every instruction. And MOV seems free. I'll do more testing when I'll find time.

NV25 is FX9.

I believe the NV34 is proved to be reconfigurable to
4 pipelines : 1 text unit + 1 FX12 unit
but this configuration is used for single texturing only.

R200 and RV250 are FX16 not FX12.
(Altough RV250 is slightly less precise in arithmetic calculations.)

Tridam · Jun 15, 2003

Hyp-X said:
Tridam said:

NV25 : 4 pipelines : 2 text units + 2 FX units (with double mul ability)
NV30 : 4 pipelines : 2 text units or 1 FP32 unit + 2 FX12 units (with double mul ability)
NV34 : 2 pipelines : 2 text units or 1 FP32 unit + 2 FX12 units (with double mul ability)
R350 : 8 pipelines : 1 text unit + 1 FP24 unit + 1 FP24-just-MUL unit (MOV seems free)
R200 : 4 pipelines : 2 text units + 2 FX12 units
RV250 : 4 pipelines : 1 text unit + 1 FX12 unit

I have strange results with R350/R300. It seems able to do one MUL for free with every instruction. And MOV seems free. I'll do more testing when I'll find time.

Click to expand...

NV25 is FX9.

I believe the NV34 is proved to be reconfigurable to
4 pipelines : 1 text unit + 1 FX12 unit
but this configuration is used for single texturing only.

R200 and RV250 are FX16 not FX12.
(Altough RV250 is slightly less precise in arithmetic calculations.)

Ok, thank you. I haven't tested precision on these cards.

For the NV34, I don't think that it can be configurable like you say. I think that when it uses 4 pipelines, it can't make any arithmetic calculations. I'll try it : answer in 5 minutes

Tridam · Jun 15, 2003

You were right. NV34 can work like that :

4 pipelines with 1 text unit and 1 FX12 unit in ps_1_1

It's not possible in ps_1_4. The FX12 units can't work in ps_1_4. So the results are similar of FP32/16 results... -> 3-4 times slower than ps_1_1 results !

Dave H · Jun 16, 2003

So the conclusion on NV34 is:

2 tex/FP32 units, where each unit can do either 1 FP op or 2 texture ops per clock; if available, the 2 texture ops can be on different pixels (i.e. in "4 pipeline mode")
4 FX12 register combiner units
ability to output up to 4 pixels/clock

In fixed-function or PS <=1.3, it operates as a 4x1; in PS >=1.4 it operates as a 2x2, because there are only 2 FP arithmetic units available.

Right?

If this is the case, I wonder if, using NV_fragment_program in FX12 mode, you can get NV34 to behave as a 4x1 when using PS >=1.4 functionality?

Hyp-X · Jun 16, 2003

Dave H said:
In fixed-function or PS <=1.3, it operates as a 4x1; in PS >=1.4 it operates as a 2x2, because there are only 2 FP arithmetic units available.

Right?

If this is the case, I wonder if, using NV_fragment_program in FX12 mode, you can get NV34 to behave as a 4x1 when using PS >=1.4 functionality?

It think it is much simpler:
If you use single texturing and 1-2 PS op that can be fit in a single FX12 (reg combiner operation), than it is 4x1, otherwise 2x2.
I don't see why should nVidia implement lookback for the 4x1 mode...

That means 4x1 should even work in PS 1.4 but there's no reason to use a PS1.4 program that is so simple (read: noone will do that).

Dave H · Jun 16, 2003

Hyp-X said:
It think it is much simpler:
If you use single texturing and 1-2 PS op that can be fit in a single FX12 (reg combiner operation), than it is 4x1, otherwise 2x2.
I don't see why should nVidia implement lookback for the 4x1 mode...

In theory there should still be a performance advantage when applying an odd number of textures. Granted it becomes pretty insignificant once you have a shader of any real length, but I don't see why they would pass it up. Triple texturing in particular is not terribly uncommon for DX7-style games.

(In fact, this suggests a way to test the theory: in (e.g.) MDolenc's fillrate tester, if the triple texturing fillrate is above 1 pixel/clock, we know it is using three loopbacks of a 4x1 configuration rather than 2 loopbacks of a 2x2 configuration. Of course, if the fillrate is below 1 PPC that doesn't prove anything. Meanwhile, I can't seem to find any results posted for a 5200--could someone who has one throw some up?)

Of course it's possible that loopback involves some overhead that makes it more efficient to use 2x2 whenever a single time through the 4x1 pipeline is not enough. But there's no indication (that I know of) why that would necessarily be the case.

That means 4x1 should even work in PS 1.4 but there's no reason to use a PS1.4 program that is so simple (read: noone will do that).

Others have said that FX12 can't be used for PS 1.4 because it doesn't provide the necessary range (i.e. it is [-2, 2] and PS 1.4 requires [-8, 8]). I don't know enough to comment definitively one way or the other, but you might.

In any case, if you're right about no loopback in 4x1, it's a moot point as you mention.

Tridam · Jun 16, 2003

Hyp-X said:
Dave H said:

In fixed-function or PS <=1.3, it operates as a 4x1; in PS >=1.4 it operates as a 2x2, because there are only 2 FP arithmetic units available.

Right?

If this is the case, I wonder if, using NV_fragment_program in FX12 mode, you can get NV34 to behave as a 4x1 when using PS >=1.4 functionality?

Click to expand...

It think it is much simpler:
If you use single texturing and 1-2 PS op that can be fit in a single FX12 (reg combiner operation), than it is 4x1, otherwise 2x2.
I don't see why should nVidia implement lookback for the 4x1 mode...

That means 4x1 should even work in PS 1.4 but there's no reason to use a PS1.4 program that is so simple (read: noone will do that).

I think like you : 4x1 if 1 texture and 1 operation (or 2 MUL) and 2X2 for anything else.

4x1 can't work in PS 1.4. I try it. It's not possible. When you use PS1.4 the FX12 units don't work. PS 1.4 arithmetic calculations are done by the FP unit.

Arun · Jun 16, 2003

Now that is very interesting stuff at least!

So if the NV34 can do 4x(TEX+FX12) - that means it's fairly obvious the NV30 is really 8x(TEX+FX12) - with the difference that it cannot do 8 color outputs, only 8 Z outputs.

So we get another question here: Does the NV30 operates like a 8 pipeline architecture when using at least one FP32 operation without color output, only Z? Or is the 8 zixels thingy only good for FX12 & Texturing?

I think that this really makes the whole ILDP thing less likely. But still, saying it's 100% traditional doesn't seem accurate either. So...

What about imagining a 8 pipeline design, each pipeline with, let's say, 4 stages, but that the units aren't actually IN the pipeline. It's just units who call the required unit in a pool.

Now, the obvious question here is: Why in the world would you want to do that?
First, let's conceptualize it:

-------
register pool
-------
calculators pool
-------
pipeline 1/2/3/4/5/6/7/8, each with multiple "calling" units
-------
instruction pool
-------

So, the pipeline got NO idea what the heck it's ordering the calculators to do. It reads from the instruction pool and then just says things like "Pixel 487 use unit 21 with operation 56, outputting in register 3 and inputting from register 2 and 5".
Then, calculation unit 21 would read register 2 and 5, do the operation and output the result to register 3.

Why do that?
As I explained before, this way, you can do things in ANY order. You can do Texturing->FX or FX->Texturing or whatever. The GPU doesn't care. It just does what it's supposed to do.
Now, of course, if I'm again bein stupid and that the R300 is also capable of that, please say it to me before I embarass myself further. Thanks!

Uttar

Tridam · Jun 16, 2003

Dave H said:
(In fact, this suggests a way to test the theory: in (e.g.) MDolenc's fillrate tester, if the triple texturing fillrate is above 1 pixel/clock, we know it is using three loopbacks of a 4x1 configuration rather than 2 loopbacks of a 2x2 configuration. Of course, if the fillrate is below 1 PPC that doesn't prove anything. Meanwhile, I can't seem to find any results posted for a 5200--could someone who has one throw some up?)

In any case, if you're right about no loopback in 4x1, it's a moot point as you mention.

I just made some more tests. I tested shader with 3 textures and 3 calculations (or 3 text and 2 calc or 1-2 text and 3 calc). There's no loopback in 4x1.

Tridam · Jun 16, 2003

Uttar said:
Now that is very interesting stuff at least!

So if the NV34 can do 4x(TEX+FX12) - that means it's fairly obvious the NV30 is really 8x(TEX+FX12) - with the difference that it cannot do 8 color outputs, only 8 Z outputs.

So we get another question here: Does the NV30 operates like a 8 pipeline architecture when using at least one FP32 operation without color output, only Z? Or is the 8 zixels thingy only good for FX12 & Texturing?

A pixel has color values. The NV30 can't have a throughput of more than 4 pixels. It's not similar to the NV34.

NV34 could be 4x1 or 2x2
NV30 is always 4x2

... when rendering pixels.

Dave Baumann · Jun 16, 2003

Tridam said:
I just made some more tests. I tested shader with 3 textures and 3 calculations (or 3 text and 2 calc or 1-2 text and 3 calc). There's no loopback in 4x1.

<scratches head>
I gotta ask - what the hell were they thinking? If there's no loopback at all with 4x1 this means that in any multitexturing operation it can't use 4 pipes. Its operation as a 4x1 is basically marginalised to the Fill-rate test in 3DMark, a few sky boxes and blend effects - why not spend the silicon on something more useful?

Tridam · Jun 16, 2003

Uttar said:
Why do that?
As I explained before, this way, you can do things in ANY order. You can do Texturing->FX or FX->Texturing or whatever. The GPU doesn't care. It just does what it's supposed to do.
Now, of course, if I'm again bein stupid and that the R300 is also capable of that, please say it to me before I embarass myself further. Thanks!

It's very difficult to verify what you say because the shader engine in the drivers can change instruction order to increase efficiency. And in PS 1.1 you can't have a TEX instruction after a calculation instruction. So every TEX instruction has to be at the beginning of the shader.

I made some tests with the R350. Per cycle and per pipeline it can do 1 instruction + 1 multiplication. The multiplication can be done before the general instruction or after. With some instructions like MAD it can't be done before.

So you could do ADD + MUL or MUL + ADD or MAD + MUL but not MUL + MAD. Of course, if there's no dependency, the driver change the order of the instructions. Maybe this exemple can help.

Tridam · Jun 16, 2003

DaveBaumann said:
Tridam said:

I just made some more tests. I tested shader with 3 textures and 3 calculations (or 3 text and 2 calc or 1-2 text and 3 calc). There's no loopback in 4x1.

Click to expand...

<scratches head>
I gotta ask - what the hell were they thinking? If there's no loopback at all with 4x1 this means that in any multitexturing operation it can't use 4 pipes. Its operation as a 4x1 is basically marginalised to the Fill-rate test in 3DMark, a few sky boxes and blend effects - why not spend the silicon on something more useful?

lol I think exactly the same

Hyp-X · Jun 16, 2003

Single texturing is more common that you would expect.
A lot of models in a lot of games are using single texturing.
Also the HUD is likely to be single textured and it can take fillrate too.
If they have the bandwidth to support it why shouldn't they do it?

Even as a 2 pipeline part, it's likely still working on pixels in 2x2 blocks for texturing purposes. So it was possibly cheap (esp. without loopback) to add this special case feature.

Adding 8x1 on the NV30 would have probably cost much more, and wouldn't make much sense due to lack of bandwidth.

Arun · Jun 16, 2003

DaveBaumann said:
Tridam said:

I just made some more tests. I tested shader with 3 textures and 3 calculations (or 3 text and 2 calc or 1-2 text and 3 calc). There's no loopback in 4x1.

Click to expand...

<scratches head>
I gotta ask - what the hell were they thinking? If there's no loopback at all with 4x1 this means that in any multitexturing operation it can't use 4 pipes. Its operation as a 4x1 is basically marginalised to the Fill-rate test in 3DMark, a few sky boxes and blend effects - why not spend the silicon on something more useful?

Marketing, I guess.
But you're also forgetting that it isn't limited to single texturing: it can also work as a 4x0 in the case of Z Passes, for example. So it certainly is no where as good as if we had the possbility to have loopback, but it isn't entirely useless either

Also, something I'd really like to test is whether the register usage performance hit is bigger with 8 pipelines than with 4 pipelines.

Uttar

Dave Baumann · Jun 16, 2003

Hyp-X said:
Single texturing is more common that you would expect.
A lot of models in a lot of games are using single texturing.

Oh, guess I payed too much attention to NVIDIA marketting there!

NV35 might be misunderstood...

Tridam

WaltC

MuFu

Chief Spastic Baboon

Arun

Unknown.

Hyp-X

Irregular

Tridam

Tridam

Dave H

Hyp-X

Irregular

Dave H

Tridam

Arun

Unknown.

Tridam

Tridam

Dave Baumann

Gamerscore Wh...

Tridam

Tridam

Hyp-X

Irregular

Arun

Unknown.

Dave Baumann

Gamerscore Wh...

Similar threads