Wierd NV40 Fillrate results

jolle · Apr 18, 2004

I ran into this yesterday, someone noted some oddities in this benchmark
http://www.xbitlabs.com/articles/video/display/nv40_18.html

at the bottom there, it gets, with only Z writes, 19800 MegaPixelx/sec..
Since it is supposedly only be running with 32 pipes in that situation it
should be closer to 12800MP/sec... but its overshooting its own theoretical
limit..

It might just be some irregularity with the Fillrate tester they used, but i
did some math about it, remembering some old rumour about Virtual pipelines via VS units..

32 pipes + 16 virtual = 48 x 400Mhz = 19200 Mp/ sec, which is closer to
the result..

Can anyone who have any knowledge about this stuff give me some info
on what this could be? is it just a anomaly?

EDIT
Beyond3ds preview did not show this result btw, so it might just be the
tool Xbitlabs used that is wierd..
http://www.beyond3d.com/previews/nvidia/nv40/index.php?p=18

radar1200gs · Apr 18, 2004

Interesting.

But tell me, how did you arrive at 16 virtual units thru the vertex shaders? (there are 6 actual VS and 6 does not divide neatly into 16).

jolle · Apr 18, 2004

radar1200gs said:
Interesting.

But tell me, how did you arrive at 16 virtual units thru the vertex shaders? (there are 6 actual VS and 6 does not divide neatly into 16).

just out of the air really, I got no idea how that stuff works..
I just remembered that there was some rumour about such a thing a while
back, and knowing as little about it as i do, i thought id go ask where there
are people that might know hehe..

and i also remembered some writing on the NV30, from Uttar i think it was,
and how it was a effort to move towards a more flexible architecture where
the different parts of the GPU could dynamicly offload eachother...
Seing how NV40 is more evolved, it might not be entirely impossible, dunno..

But yeah, 16 pipes from 6 VS units sounds odd, unless it works in some way that isnt that dependant on how many units there are.. no idea..

EDIT
Who knows, IF there anything to it, maybe its the PS units, there are 2 of them on each pipeline isnt there?
IF 1 PS unit on each pipelines is the cause of why it can do 32x0, then
maybe the 2nd one can be used that way aswell..
ehm, unless 32x0 is due to something totally different, hehe i dunno..

radar1200gs · Apr 18, 2004

There has to be something wrong with the test they performed. NV40 isn't capable of doing what you suggested.

You can see how the pixel (ROP) engine works here:
http://www.beyond3d.com/previews/nvidia/nv40/index.php?p=11

jolle · Apr 18, 2004

well, it was just shots in the dark anyway.. hehe

991060 · Apr 18, 2004

I didn't read the whole review, did they use the same fillrate tester as Dave, or they wrote one themselves?

jolle · Apr 18, 2004

no idea really, its not the same one i guess tho..

Fillrate Tester opens our cycle of synthetic tests by measuring the scene fill-rate speed and the speed of executing pixel shaders.

dunno if that means its a tool called "Fillrate Tester" or if they just mean
A fillrate tester..
and the other results seems to be in order.. its just FX6800 Ultra in
Z writes that is odd..

KimB · Apr 18, 2004

Check this out from Extremetech:

The Z and Color ROP units have another very cool attribute. If a color operation is not being carried out, the scheduler can reassign that color ROP processor to be a Z ROP, thereby doubling the 6800's Z-checking performance. Because Z-depth testing is so memory-intensive, memory bandwidth may prevent the 6800 from actually achieving a 2X speedup.

Now, I think their calculation is a bit off, but here's a new one:
The 6800 is capable of 2 Z ROPs per pipeline per clock. It is capable of 1 color ROP per pipeline per clock.

So, if no color write is needed, it may be able to instead do 3 Z ROPs per pipeline per clock, for a total of 48 Z ROPs. That would coincide well with xbitlabs' results. I don't know why Mark Dolenc's fillrate tester didn't test the same.

Edit 2:
nevermind

jolle · Apr 18, 2004

sooo.. in short it can acctually do something similar to 48 pipelines in such
situations?

A bit odd that it doesnt show in every tool tho..
I guess it could be some driver thing, in this early stage and all that..

KimB · Apr 18, 2004

jolle said:
sooo.. in short it can acctually do something similar to 48 pipelines in such
situations?

Provided the xbitlabs' results are not erroneous, yes.

Note that Dave got a different impression on how this works:

The NV40 pipeline has both a Z ROP, which does the Z writing, and a C ROP. The C ROP is a combined Z and Colour ROP. The use of the C ROP is what achieves NV2Aâ€™s, NV3xâ€™s, and now NV4xâ€™s optimised Z / Stencil rendering path such that during non-colour rendering situations the C ROP can be utilised to write a second Z/Stencil value per clock cycle, but will be used for colour writes when value need to be written to the frame buffer.

....we need more testing!

jolle · Apr 18, 2004

interesting stuff, how useful it will be depends on how games are rendered
I guess, is there any other engine that uses the same approach as Doom3
with a separate quick Z-pass?
Cause if not, well its a good idea provided games will operate under the
conditions necessary to expose it..

and exactly how much difference would it acctually make?
If the Z pass is very fast in anycase, and if the time spent on it is a very small
percentage of the total spent a full frame the difference from doing
Z pass in 48 pipes
the rest in 16
and doing everything in 16, might not be a whole lot of gains.. right?

991060 · Apr 18, 2004

Chalnoth, nVIDIA should claim NV40 to be a 16/48 pipes chip if your speculation turn out to be true.

I think it's more like there's a flaw in the software used in xbitlabs's test.

jolle · Apr 18, 2004

991060 said:
Chalnoth, nVIDIA should claim NV40 to be a 16/48 pipes chip if your speculation turn out to be true.

I think it's more like there's a flaw in the software used in xbitlabs's test.

As uttar mentioned earlier there might be alot sneaky tactics and hit run stuff
going on this time, Nv shows off NV40 (at 400Mhz core) then releases it with
475Mhz core.. maybe..
might be one of those things their saving for after ATIs next move, and so
the dance continues.. but who knows..

but a additionl 16 pipes in z-pass which seems to be utilized only by one
engine atm doesnt seem to me as if it carries a ton of PR value i guess.

Arun · Apr 18, 2004

Or perhaps, 48 is reachable only in very specific cases, perhaps not even those of Doom 3; maybe when there is no Z read, just writes? Such as in an initial Z pass, but not in stencil passes thus? Or perhaps the opposite? No idea, just speculating here.

Uttar

jolle · Apr 18, 2004

Uttar said:
Or perhaps, 48 is reachable only in very specific cases, perhaps not even those of Doom 3; maybe when there is no Z read, just writes? Such as in an initial Z pass, but not in stencil passes thus? Or perhaps the opposite? No idea, just speculating here.

Uttar

yeah, that would explain the benchmarks results..
if Xbits tool only tested while doing writes, and the tools not showing this
did both...

But what would the point of such a implementation be?
where does this situation show up? is it a comon thing in games?
and would the speed gains in such a pass be tangable in the final performance of a game with 16 vs 48 pipes in that pass?
or just a small percent gain on same card, whatwith the other passes
running in 16 pipe mode..

KimB · Apr 18, 2004

Uttar said:
Or perhaps, 48 is reachable only in very specific cases, perhaps not even those of Doom 3; maybe when there is no Z read, just writes? Such as in an initial Z pass, but not in stencil passes thus? Or perhaps the opposite? No idea, just speculating here.

Uttar

Well, first of all, I don't think there'd ever be a situation without z reads but with z writes.

Anyway, one possibility may be that leaving blending enabled while disabling color writes could prevent the use of the color pipelines as z-pipelines.

Regardless, what we do know is that the NV40 can use both its color and z pipelines as z pipelines. But if we take the case of FSAA, the z-pipes can each do two simultaneous z tests/writes.

So, the question then is: with FSAA disabled, can the z pipelines still do two z reads/writes per clock?

jolle · Apr 18, 2004

without FSAA Fillrate isnt that critical right?
so if it cant, will it matter?

the Z pass doesnt seem that critical to me, maybe it is?

EDIT..
Could the 2 shaders on each pipeline be related to this in any way?

KimB · Apr 18, 2004

jolle said:
without FSAA Fillrate isnt that critical right?
so if it cant, will it matter?

the Z pass doesnt seem that critical to me, maybe it is?

Z-only rendering is important for stencil shadowing.

As for fillrate without FSAA, well, it is interesting, but no, it's not all that critical. It would be interesting to see the fillrate tests with FSAA enabled.

Evildeus · Apr 18, 2004

I don't know how they obtained that score. From HFR, i can see that:

jolle · Apr 18, 2004

again, could one of the PS units, since there seems to be 2 of them per
pipeline, act like a extra pipeline when it comes to Z passes?
need someone to poke a hole in that theory so i can move on hehe..

Wierd NV40 Fillrate results

jolle

radar1200gs

jolle

radar1200gs

jolle

991060

jolle

KimB

jolle

KimB

jolle

991060

jolle

Arun

Unknown.

jolle

KimB

jolle

KimB

Evildeus

jolle

Similar threads