NV35 might be misunderstood...

:idea: :oops:
I was sitting on the toilet when it hit me! The puzzle of NV35's rather odd FillrateTester scores has been solved! While relaxing, I came to the conclusion that fragment, not pixel, output and throughput are two completely different performance metrics. Output is the tangiable result obtained from an input, while throughput is
Dictionary.com said:
the amount passing through a system from input to output (especially of a computer program over a period of time)
Thus, the fragment throughput of a system can be used to measure the vpu/gpu's pipeline efficiency; the accessability of a given ouput.

If a conveyer belt has 2 lanes of 2 workers each (each worker and lane held equal), assembling a two piece component, while another has 4 lanes, of 1 worker, assembling the same component, the 2 lane set of workers maintains a greater likelyhood of upholding a certain ratio of output per lane, since there are 2 workers assigned to every single (global) task.

Similiary we find NV35 (the little cinematic engine that tried to do the right thing), only capable of writing 4 pixel per clock (color ops, with z buffering). Althoug NV35 could work on 8 pixels simultaneously (on apps with shaders of more than 1 instruction or textures per pixel) because it contains more than 1 fp unit per pipeline (like the 2*2 lane employes), it can only output 4 pixels per clock; it contains only 4 parallel pipelines (lanes). On the contrary, R300/350 is able to output 8 pixels per clock because it houses 8 discrete pipelines (although each pipeline is composed of only 1 fp unit). Thus, if there is a two instruction pixel shader (assuming both instructions are micro/core ops, executable in 1 cycle), NV35 should be able to sustain its fillrate (assuming the ops are dependent/optimized for 2 serially arranged units), but R300/350's will drop by a factor of 2, since R3* can only execute 1 fp op per pipeline and there are two (per pixel) in the shader.

Now, because of the inherent nature of fillrate testers (such as MDolenc's), and the fact that they are limited to obtaining the number of pixels written, not ops executed, per second, we must do a little computational work on our own to determine the average number of instructions a certain processor is executing per cycle.

One way to do this, with a fillrate tester, is by dividing the pixle shader fillrate (with no shaders/texturing) by the fillrate result obtained for the particular fragment program in question. The result should tell us the factor by which the original fillrate was cut (number of cycles it took to complete). Next, count the number of instructions in the fragment program and compare the two results. Assuming each fragment op takes an approximate number of cycles (usually 1 for any pixel shader op, excluding things like pow, lit, lrp, etc.) divide the number of pixel shader instructions by the factor of "cut" fillrate, and the average instruction execution rate for each test will be available.

If I were to write a program which did this work for the end user/reviewer, the code would look read the following way(I only know 1 semesters worth of basic C programming :oops: ):
Luminescent said:
#include <stdio.h>

int
main (void)
{float max_rate, app_rate, number_cycles, exec_rate;
int number_instruc;

/*Prompt user for necessary fillrate tester inputs*/
printf("Enter maximum fillrate");
scanf("%lf", & max_rate);
printf("Enter application fillrate");
scanf("%lf", &app_rate);
printf("Enter number of instructions in application");
scanf("%d", &number_instruc);

/*Determine the number of cycles required for desired application*/
number_cycles=max_rate/app_rate;

/*Extrapolate the average number of cycles per instruction*/
exec_rate=number_instruc/number_cycles;

/*Return results to user*/
printf("Your vpu averages a shader execution rate of %f instructions/cycle/pipeline\n", exec_rate);

return 0;
}
 
First of all the nv30/nv35 series do not have any fp units per pipeline from what I understand. They have a pool of fp units and a scheduler that dynamically issues fp instructions.

Because Nvidia's fp and ATI's fp may be piplined we really don't know how they really compare. It would depend on how many stages they have and how efficient they are. We know that Nvidia has 32 units and they have control lines that let them use them as 32 fp32 units or 64 fp16 units. (If I understand correctly)

ATI can do multiple parrallel operations other than just floating point in each pipeline. (eg. Texture lookup, color operation)

I am sure that Nvidia can also do multiple parallel operations in one "pipeline". (I believe they have 8 texture units)

Without knowing more about Nvidia's mystery architecture I think you can't really do an apples-to-apples comparison...

By the way, I don't know why Nvidia insists on keeping their architecture a secret. I am sure by now that ATI has looked at the NV30 under their electron microscope to see what they are doing....
 
I don't know why Nvidia insists on keeping their architecture a secret.

I believe that there are 2 reasons:

1. They buggered up the architecture and don't want to look like fools.

2. They have something in store in the far away future with their NV3x architecture.

I have no hard evidence to support either case, hence I used the word "believe". If anyone has any hard evidence to suggest either of my beliefs are right then please show me.
 
What I mean by pipelines are "virtual" pipelines. The only fact I am disclosing regarding NV35 is its 4 pipeline 2-3 fp shader op capability.

The point of this thread was not even to discuss NV35's architecture but to shed some light on how we should absorb any fillrate tester's results.
 
Luminescent:

  • The NV3x (for the purpose of this, NV30 or NV35) would have a throughput of 12 ops per cycle, maximum.

    The R3xx (for the purpose of this, R300 or R350) would have a throughput of 24 ops per cycle, maximum.
  • The NV3x would have a throughput of 4 pixels. The ops are on the same pixels.

    The R3xx would have a throughput of 8 pixels. The NV35 might have a throughput of 8 pixels maximum as well, if the architecture is flexible enough to configure the operation throughput for this, but this seems likely limited to PS 1.3 and less (doesn't seem to offer it now, atleast not with any efficiency...which seems to indicate that the architecture is not flexible enough).
  • The NV3x would have a throughput of 8 z elements and 8 stencil elements. The ops are on different elements, because the architecture has that flexibility.

    The R3xx would still have the same throughput of 8 z elements and 8 stencil elements. Actually, I'm not sure how much hardware redesign it would take to enable the scalar/vec3 coissue to process to work on discrete z and stencil elements. Probably a lot, as it hasn't been done yet.

I'm not seeing what you are saying is new in what you said.

Your calculations apply to the R3xx as well. Looks exactly like my proxel pipeline/throughput discussion too, but the name seems to have relegated my discussion to a permenant "quickly forgotten" status. :-?

Pretty similar to a 8x1 versus 4x2 discussion as well, with "proxels" or "ops" resembling the roles of texels (but not completely similar, again part of the aforementioned discussion).
 
Re: NV35 mistreated; find out why...

Luminescent said:
If I were to write a program which did this work for the end user/reviewer, the code would look read the following way(I only know 1 semesters worth of basic C programming :oops: ):
Luminescent said:
#include <stdio.h>

int
main (void)
{float max_rate, app_rate, number_cycles, exec_rate;
int number_instruc;

/*Prompt user for necessary fillrate tester inputs*/
printf("Enter maximum fillrate");
scanf("%lf", & max_rate);
printf("Enter application fillrate");
scanf("%lf", &app_rate);
printf("Enter number of instructions in application");
scanf("%d", &number_instruc);

/*Determine the number of cycles required for desired application*/
number_cycles=max_rate/app_rate;

/*Extrapolate the average number of cycles per instruction*/
exec_rate=number_instruc/number_cycles;

/*Return results to user*/
printf("Your vpu averages a shader execution rate of %f instructions per cycle\n", exec_rate);

return 0;
}

well theres no reason to make the main an int, you can just void it and not bother with the return 0 ;)
 
Well demalion, I guess your pixel/proxel pipeline flew over my head at the time it was written and I wasn't able to grasp the concept until now; I was a bit "fuzzy" myself ;) . I guess I'll stick to your terminology from now on, but the different take on the concept might help someone else.

You mention that NV35 is not flexible enough to throughput 8 "proxels" (pixel fragments) per clock cycle, but AnteP's benchmark seem prove such a theory wrong. Here are the calculations for the per pixel test (using my aforementioned algorithm, which you seem to understand):
NV35:
Maximum fillrate=1772.702026M pixels/sec
Per-pixel fillrate=105.561607M pixels/sec (at fp16 for the sake of neglecting fp32's register overhead)
Maximum fillrate/per-pixel fillrate=~16.79
Considering there are 21 instructions:
21/16.79=~1.25 instructions/cycle per pipeline

Now, compare this to NV30:
Maximum fillrate=1957.946899M pixels/sec
Per-pixel fillrate=67.032890M pixels/sec
(at fp16 for the sake of neglecting fp32's register overhead)
Maximum fillrate/per-pixel fillrate=~29.209
Considering there are 21 instructions:
21/29.209=~.7189 instructions/cycle per pipeline

Clock for clock, the improvement between NV35 and NV30 in number of instructions executed per clock, for the per pixel shader test is:
1.25*1.10/.7189=1.91 or almost 91%, which is close a two-fold improvement.
Seems Nvidia wasn't too far off with their shader performance claims. However, it seems they were referring to throughput rather than output.

By the way, thanks for the programming tip, gokickrocks.
 
well theres no reason to make the main an int, you can just void it and not bother with the return 0

Why even bother with C when you can do this...
Code:
#include <stdio.h>
#include <iostream>
using namespace std;

void main()
{
	float max_rate, app_rate, number_cycles, exec_rate;
	int number_instruc; 

	cout << "Enter maximum fillrate: ";
	cin >> max_rate;
	cout << "Enter application fillrate: ";
	cin >> app_rate;
	cout << "Enter number of instructions in application: ";
	cin >> number_instruc;

	number_cycles=max_rate/app_rate;

	exec_rate=number_instruc/number_cycles; 
	cout << "Your vpu averages a shader execution rate of: " <<  exec_rate << " instructions per cycle" << endl;
}

The point of this thread was not even to discuss NV35's architecture but to shed some light on how we should absorb any fillrate tester's results.

Fillrate scores for the NV3x chips are really strange. I interpret them as is. In other words, I see the fillrate score and believe that's how much there is for that specific test.

I'm also want to know how much fillrate is being utilised in games compared to benchmarks.
 
Yes, Lum, but that's not pixel throughput, that's "proxel" throughput. That's for 4 pixel pipelines, not 8.

BTW, it's only fair to warn you: if you start using "proxel", beware the wrath and foul language of cranky old men.
 
Honestly, I wish people would get over this "its not pipelined" thing - where is the proof that its not pipelined? I'd like to see something that really suggest this.

On the other hand I've seen lots to suggest that it is pipelined - such as the inability to texture and do an FP operation at the same time (in NV30) and the fact that at the D2D developer conference they were talking about all the operations that rely on a 2x2 configuration (ddx/ddy, filtering, etc.).

Seriously, can anyone prove that it differs from this.
 
DaveBaumann said:
Honestly, I wish people would get over this "its not pipelined" thing - where is the proof that its not pipelined? I'd like to see something that really suggest this.

On the other hand I've seen lots to suggest that it is pipelined - such as the inability to texture and do an FP operation at the same time (in NV30) and the fact that at the D2D developer conference they were talking about all the operations that rely on a 2x2 configuration (ddx/ddy, filtering, etc.).

Seriously, can anyone prove that it differs from this.

I admit we don't have enough proofs right now, sorry if you're getting annoyed by this.
And I never said it's "not pipelined" - it's just that the PS and VS aren't pipelined. The overall structure *is* still a pipeline ( Input->VS->Triangle Setup->Rasterization->PS->Output ) - it's just that PS and VS are, IMO, not really pipelines.
I'm also NOT saying, as you once seemed to understand, that the NV3x only works on one pixel at once very fast. It does work on 4 pixels, or 8 pixels - practically, it *emulates* as many pipelines as it wishes to ( with of course some limitations, being able to emulate 128 pipelines is, err, unlikely ). That means making it look like a 2x2 configuration is really no problem I believe.
Yes, yes, I know, I'm gonna have trouble proving my claims here... And thus, I accept not to insist too much on this whole story anymore until I can get a NV35 in my hands and test a few things.

BTW, in reference to your link...
http://www.tt-hardware.com/img/divers01/geffx_5600.gif
Would it really be optimal to do that if it was traditional pipelining? Accessing the units of another pipeline like that? Seems possible, but rather icky in a traditional design.
And no, I'm not claiming this to be any kind of proof - it franky isn't - just one of the many strange things happening in the NV3x.


Uttar
 
Hey, that is one interesting diagram, Uttar. Now I guess we'll have to endlessly reference that pick. ;)

However, Uttar, if NV35 contains what are essentially 8 fp shader units two of them per pipeline, what makes you so sure it is a 2x2 architecture.

It seems the pic you linked to displays 4 independent pathways. How is it that the 2nd and third units per pipeline get their instruction? How can an instruction go to each one simultaneously if there are only 4 independent pathways? I still haven't figured that out.
 
Re: NV35 mistreated; find out why...

gokickrocks said:
well theres no reason to make the main an int, you can just void it and not bother with the return 0 ;)

You mean apart from the fact that main() must be declared as returning an int?

1.25b: What's the right declaration for main()?
Is void main() correct?

A: See questions 11.12a to 11.15. (But no, it's not correct.)

[...]

11.12a: What's the correct declaration of main()?

A: Either int main(), int main(void), or int main(int argc,
char *argv[]) (with alternate spellings of argc and *argv[]
obviously allowed). See also questions 11.12b to 11.15 below.

References: ISO Sec. 5.1.2.2.1, Sec. G.5.1; H&S Sec. 20.1 p.
416; CT&P Sec. 3.10 pp. 50-51.

11.12b: Can I declare main() as void, to shut off these annoying
"main returns no value" messages?

A: No. main() must be declared as returning an int, and as
taking either zero or two arguments, of the appropriate types.
If you're calling exit() but still getting warnings, you may
have to insert a redundant return statement (or use some kind
of "not reached" directive, if available).

Declaring a function as void does not merely shut off or
rearrange warnings: it may also result in a different function
call/return sequence, incompatible with what the caller (in
main's case, the C run-time startup code) expects.

(Note that this discussion of main() pertains only to "hosted"
implementations; none of it applies to "freestanding"
implementations, which may not even have main(). However,
freestanding implementations are comparatively rare, and if
you're using one, you probably know it. If you've never heard
of the distinction, you're probably using a hosted
implementation, and the above rules apply.)

References: ISO Sec. 5.1.2.2.1, Sec. G.5.1; H&S Sec. 20.1 p.
416; CT&P Sec. 3.10 pp. 50-51.

11.13: But what about main's third argument, envp?

A: It's a non-standard (though common) extension. If you really
need to access the environment in ways beyond what the standard
getenv() function provides, though, the global variable environ
is probably a better avenue (though it's equally non-standard).

References: ISO Sec. G.5.1; H&S Sec. 20.1 pp. 416-7.

11.14: I believe that declaring void main() can't fail, since I'm
calling exit() instead of returning, and anyway my operating
system ignores a program's exit/return status.

A: It doesn't matter whether main() returns or not, or whether
anyone looks at the status; the problem is that when main() is
misdeclared, its caller (the runtime startup code) may not even
be able to *call* it correctly (due to the potential clash of
calling conventions; see question 11.12b).

It has been reported that programs using void main() and
compiled using BC++ 4.5 can crash. Some compilers (including
DEC C V4.1 and gcc with certain warnings enabled) will complain
about void main().

Your operating system may ignore the exit status, and
void main() may work for you, but it is not portable and not
correct.

11.15: The book I've been using, _C Programing for the Compleat Idiot_,
always uses void main().

A: Perhaps its author counts himself among the target audience.
Many books unaccountably use void main() in examples, and assert
that it's correct. They're wrong.
oh, and it's not valid in c++, either, in fact, the standard expressly disallows it.
 
Uttar said:
And I never said it's "not pipelined" - it's just that the PS and VS aren't pipelined.

VS is probably the single largest change from their previous generations - clearly that was developed in near isolation to the PS pipeline otherwise they probably would have both had the same capabilities. However, and array of processors may not necessarily mean that they are not pipelined, but possibly very widely pipeline - take a good read of the P10 pipeline article.

The overall structure *is* still a pipeline ( Input->VS->Triangle Setup->Rasterization->PS->Output )

This is not in dispute.

I'm also NOT saying, as you once seemed to understand, that the NV3x only works on one pixel at once very fast. It does work on 4 pixels, or 8 pixels - practically, it *emulates* as many pipelines as it wishes to ( with of course some limitations, being able to emulate 128 pipelines is, err, unlikely ). That means making it look like a 2x2 configuration is really no problem I believe.
Yes, yes, I know, I'm gonna have trouble proving my claims here... And thus, I accept not to insist too much on this whole story anymore until I can get a NV35 in my hands and test a few things.

And this is where it all falls to bits. I'm saying that things are much the same as the diagram I linked to earlier (although I'd dispute the location of the texture units in relation to the FP32 unit in that diagram, but thats another issue).

You are introducing a level of complexity into the understanding here that just isn't required or needed, and would probably be ungodly to manage. Fundamentaly the PS pipeline architecture does not appear to have changed significantly from NV25, what they have done is exended the functionality of the units to include the now Shader processing functionality. If we compare NV25 to NV30 the actual pipeline arrangement isn't significantly different - they both had floating point texture address units (but NV30's is expanded now for FP processing) and they both had integer units further down. Whats missing from the diagram is the register combiners that can combine the pipelines as the back end.

Your agrument that it could be shifting the number of pipelines around earlier up the pipe than the register combiners shurely must fall over by the dependance the the PS units also doing the texture addressing.

BTW, in reference to your link...
http://www.tt-hardware.com/img/divers01/geffx_5600.gif
Would it really be optimal to do that if it was traditional pipelining? Accessing the units of another pipeline like that? Seems possible, but rather icky in a traditional design.

But that's old technology - that exactly how TNT/TNT2/VSA100 did multitexturing - why would it be any more difficult now?
 
For NV35, pretty much scrap those 2 fx12 units per pipe, and replace them with a 128-bit fp32 one.

On a separate, but ralated note, anyone have an idea as to how a command processor issues instructions to a 4 pipelines with 2 units; is it the same as issuing instructions to 8 discrete units?
 
I think Dave is right.

NV3x have pixel pipelines. I haven't seen any proof of Uttar's theory. I've tested a lot of different shader to highlight the different units. NV30/35 has 4 pipelines. They can't have a throughput of more than 4 pixels in any case. Of course, they work on many pixels at the same time to hide latency (it's the signification of "pipeline") and to reduce the latency of complex shader.

(Dave -> I'm not sure of the location of the different units of course. But it don't really matter).

I still haven't tried any GeForce FX 5900 so I don't really know the change made in NV35. But I'm sure it's still capable of 2 textures ops/cycle/pipeline AND 2 FX12 (or 4 FX12 mul) ops/clock/pipeline.


I made a mistake in my diagrams. The NV34 diagrams are false. I'll correct them. NV34 has 2 full pipelines with 1 FP/2tex units and 2 FX12 units (with double mul ability). When there's no maths, it can work with 4 pipelines. But this possibility doesn't seem very efficient in the tests I made.


Why I think NV3x must have pipeline? NV3x would be very inefficient if it's not the case. I tested everything I can and every test says it's a fixed architecture. Of course NV3x could have some part of units shared with other units.


I also think that NV30 doesn't differ that much from NV25/28. PS testing with NV30 and NV25 at the same frequency should show that. When you use PS 1.1, NV25 and NV30 use the same pipeline architecture (with a little advantage to NV30 when no texturing is used).
NV25's 4 pipelines have: 2 texturing units and 2 FX9 (I've not tested if it's FX9 or another precision) units (double mul able)
NV30's 4 pipelines have: 2 texturing units (or one FP/FX instruction) and 2 FX12 units (double mul able)

The really big change is the VS. More units and more complex ones. And the vertex buffer seems to be bigger.

The big problem of NV3x is the register problem. The no-perf-drop limit is 2 register... which is the PS1.1 limit! We're far from DX9 12 registers, DX9+ 32 registers and DX8.1 6 registers. Of course, they can use 4 FP16 registers (and 4 FX12 I assume). It seems like NVIDIA targets the DX8-style register use when they design NV3x.


Here are my first results (I did some tests very quickly on ATI boards, I'll do more when I'll have time):

NV25 : 4 pipelines : 2 text units + 2 FX9 units (with double mul ability)
NV30 : 4 pipelines : 2 text units or 1 FP32 unit + 2 FX12 units (with double mul ability)
NV34 : 2 pipelines : 2 text units or 1 FP32 unit + 2 FX12 units (with double mul ability)
R350 : 8 pipelines : 1 text unit + 1 FP24 unit + 1 FP24-just-MUL unit (MOV seems free)
R200 : 4 pipelines : 2 text units + 2 FX16 units
RV250 : 4 pipelines : 1 text unit + 1 FX16 unit

I have strange results with R350/R300. It seems able to do one MUL for free with every instruction. And MOV seems free. I'll do more testing when I'll find time.

[edit : precision of NV25/R200/RV250. thanks hyp-x :) ]
 
Luminescent said:
Very nice K.i.l.e.r. This thread, however, is for the nerds which bother to find needles in a haystack. :)

What you are trying to find out is fruitless without the proper NV3x design notes.

See this thread why it is fruitless.
 
Hi Tridam,

If the NV30 is -
4 pipelines : 2 text units or 1 FP32 unit + 2 FX12 units (with double mul ability)

And the NV34 is -
2 pipelines : 2 text units or 1 FP32 unit + 2 FX12 units (with double mul ability)

What is the NV31?
 
Back
Top