Toms Hardware: GeForce 5200 is 2x2?

Luminescent said:
Hmm...I don't understand how your reasoning relates to pocketmoon's Cg results at all, at this time.

Back to top
Demalion, I'm sorry if I didn't make it clear enough, but my "reasoning" was given as the second and main point in my response to address this doubt:
It seems you are implying your comments address fp32 performance, and I'm not clear on why you propose they do.
My main reasoning behind the fp16 and fp32 performance delta has nothing to do with pocketmoon (that was extra info which seems to support my point somewhat).

But I still don't understand your reasoning, since pocketmoon's shaders seem to illustrate the opposite of what you seem to be saying they do...that is what I was addressing in the main body of my text. I addressed that in detail because you mentioned it and I don't understand your interpretation of it.

I put in my $.02 of reasoning explaining why the pixel shader 2.0 3DMark bench would not really exploit the difference between the two types of fp formats. In a nutshell: because it is not a really long shader (I'm assuming; a long shader would require allot of data streaming and register access) would be more fit to exploit the pros and cons which NV3X faces by using less/more shading precesion.

Hmm...well, you said: "if you take a look at pocektmoon's benchmarks...you'll find that the descrepencies between partial and full precision on the NV30 are little, if anything, under FP30 mode". I don't see this at all, and in fact see the opposite as far as I am able to understand right now.

This extends to your 3dmark 03 benchmark comments, since I'm not sure how you are relating "does not even exceed...96 instruction count" as applying to that, and not applying to the pocketmoon benchmark you mentioned by name (Shader test 4, the median filter) which seems to directly contradict my understanding of what you propose.

Again, I state that, AFAIK, "nv30 instruction count" and "dx 9 instruction count" need not necessarily correspond for the same functionality to be expressed, which seems to be the root of some of the assumptions being made.

Again, I ask, are you mistaking the numbers in the table which indicate instruction count results with fps results?
 
Ahh, I did not realize those numbers were instructions (thought they were some sort of FPmark). I should have paid more attention. :oops:

I'll take a look at pocketmoon's page and be right back.
 
I am assuming the greater the number, the better performance. Therefore, I will cite examples straigh off of pocketmoon's page and use them to back up my previous statements:

-For the first benchmark, NV30 gets a rating of 169 for fp32 and 159 for fp16 (in FP30 mode). This mark showes a negligeable difference in system taxation between the two formats. Fp32 value was actually higher than fp16.

-The second benchmark: fp32=87 while fp16=107. There is a difference, but not a major one.

-In the third benchmark, fp32 and fp16 results are almost the same: 344 and 345, comparatively.

-The 24 vs. 49 result from the fourth benchmark is significant, I have no point there.

-Finally we have 236 vs. 300 (fp32 and fp16 respectively) in the last benchmark. This isn't such a large disparity; not even a 33% difference.

Then, how is it that the benchmarks refute the fact that fp16 performance is not much greater than fp32 performance in the NV3X architecture (under the same compiler)?
 
Luminescent said:
I'm confused now, you seem to indicate that the marks are given in terms of instruction access'? What exactly does each mark measure?

The bar graphs are fps, and are performance metrics.

The table after Shader test 4 is instruction count of the output as generated by Cg and DX 9 HLSL.

A good place for discussing shader instruction count differences, but a bad place for presenting the instruction count results for all the tests with such amiguous labelling.

My comments about "instruction execution speeds" were in the nature of trying to make sense of what metric you were using in relation to pocketmoon's benchmarks when you said: "There are some cases (benchmark 4 in pocketmoon's cg suite) where DX9 performance is nowhere near NV3X's full capabilities." In that benchmark, DX 9 HLSL outperforms all floating point precision nv30 paths, and ties the fixed result.
 
What I didn't realize before was that PS 2X is just a label for the output, with pixel shaders, generated by the cg compiler; PS2A was the result given under HLSL. I thought PS 2X meant HLSL with full float, PS 2A with half float and so on.

Why did pocketmoon only provide the instruction counts for test 4? It would be a nicer measure if we had them for all the benchmarks.

Observing these results once more, I still believe, that, under the same compiler, there is not a major difference between full and half float precision. Whether or not NV30 DX9 drivers are forcing full or half float is another story, but I believe it would not make a significant performance difference. You may argue the fact that Nvidia disables fp32 for a reason (if they do, in fact), but performance is already low with fp16. Imagine the psychological damage which would result from observing benchmarks made using fp32 (even lower performance, but not by much, I claim).

I recant, however, from my theory that cg output is more streamlined for NV3X than HLSL. If 3DMark03 was indeed compiled with HLSL I see how it could remain faster even though Cg is more optimized. If a shader uses significantly less instuction counts than another, with similar operations, it will most definitely be faster.
 
Luminescent said:
...
Why did pocketmoon only provide the instruction counts for test 4? It would be a nicer measure if we had them for all the benchmarks.
...

The table has instruction counts for all the shader tests. That is why it is such an odd place for it (i.e., before Shader 5 is discussed).

What is missing is a link to what the instruction outputs actually were.

Observing this, I still say that, under the same compiler, there is not a major difference between full and half float precision.

And I still don't understand this at all, but maybe reading this again now will clarify what I was trying to discuss regarding that comment.
 
Demalion, I previously failed to see the data analysis provided in the comment you refered me to in the above post. Now I see what you are saying. Basically, the only mark which is trully representative of fp16 v. fp32 would be the fourth, because it uses less texture reads and more math ops. I see your point. 8)

Forgive my ignorance, but now that I see this, does it completely rule out the possibility of single cycle fp32 in NV3X, or does it indicate that there is only a greater latency (and not execution) penalty for it. According to pocketmoon's findings, it seems to be the later. I'll refer you to pocketmoon and tED's posts here.
 
Dave H:
But, having said that, calculating a texture address involves FP math and so it's quite plausible that they are indeed using the same ALUs for texture addressing that they use for FP shader ops. This is the only reason I can think of why the restriction would be in place, particularly because it is said not to exist for texture addressing + int shader ops.
What about int shader ops + fp shader ops. According to this info, if texture and shader alu's are assumed the same, wouldn't the NV30 be theoretically capable of 8 fp16/fp32 ops per clock, when texture accesses are not required? If so, why does unclesam write this. He states (implicitly referencing NV30) that it can only accomplish 4[4] fp16/fp32 ops per cycle. Could it be that shader commands are more difficult to issue per clock?
 
About 2x2 / 4x1 configuration of NV34/NV31, here is some data to help out (not sure if anyone pointed it out yet):

http://www.digit-life.com/articles2/gffx/nv31-nv34.html

Some unusual behaviour, but for the most part it matches what was said in this thread, no doubt due to UncleSam's input (I guess he worked on the article). NV34 is 2x2 always, NV31 can be 4x1 for single or no texturing.

EDIT: Typo
 
Yikes! An avalanche of graphs with an abundance of data with an absence of coherency!

What is the reasoning behind the color/hatching schemes? There doesn't seem to be any coherent scheme to facilitate the user's comprehension of the data, and the wording of some conclusions is anemic and not related well to the data being discussed. For example, even for the ample amount of it, there still seem to be some gaping holes apparent in the aniso analysis data that aren't addressed by the analyses. Also, a good recognition of a possible effect of aggressive shader optimizations effect is exhibited at one point, but no analysis of how that could relate, or not relate, to results is provided anywhere else.

But...for a reference for data to be analyzed for this discussion, it is, as usual, a good resource if you can spend the time sifting through everything, though I really have no idea why they actually seem to try and make the data hard to analyze.
 
PostPosted: Fri Mar 14, 2003 10:39 am Post subject:
About 2x2 / 4x2 configuration of NV34/NV31, here is some data to help out (not sure if anyone pointed it out yet):

http://www.digit-life.com/articles2/gffx/nv31-nv34.html

Those data do indeed strongly imply that NV34 is 2x2.

However, the data from hardware.fr, Tom's, and Extreme Tech indicate that it is 4x1.

Interestingly, the benches at ET show a huge difference between 3DMark01 and 03 single texture fillrate; the 03 tests indicate a potential 2x2 but the 01 tests prove it's 4x1. What could be the cause of this? I dunno, but I wonder if it has something to do with NV34's lack of z-compression. This shouldn't matter for the 3DMark01 fillrate test, which AFAICT doesn't use the z-buffer. Perhaps the 3DMark03 fillrate test does? Just an idea.

So, there are definitely some interesting questions here, but it seems impossible to claim NV34 is 2x2 when it is obviously achieving >3 pixels per clock in some of these single-textured fillrate tests. I suppose it is plausible that there is some sort of "2x2 mode" that is for some reason being triggered by the 3DMark03 fillrate test but not by the 3DMark01 test. But it seems more likely to me that there's a simpler explanation.
 
Back
Top