NVidia Z Perf

Dark Helmet · Jul 14, 2004

On a GeForce FX 5950 Ultra with a stock ortho projection, does anyone know why the performance of this loop which renders unoccluded polygons that fill the screen varies by 20% if you just swap the comments on which z values are used (-512, -511, ... -257 versus -64, -63.875, ... -32.125).

Code:

  glOrtho        ( 0, HRES, 0, VRES, 0, 1000 ) ;
  ...
  for ( i = 0 ; i &lt; 256 ; i++ )
  {
    float z = -512 + i;
    //float z = -64 + i / 8.0;
    glTexCoord2f(0,0);glVertex3f ( 0   , 0   , z ) ;
    glTexCoord2f(1,0);glVertex3f ( HRES, 0   , z ) ;
    glTexCoord2f(1,1);glVertex3f ( HRES, VRES, z ) ;
    glTexCoord2f(0,1);glVertex3f ( 0   , VRES, z ) ;
  }

This is with a 24-bit depth buffer cleared to 1, depth range set to 0..1, and depth testing is on. I'm at a loss to explain this one.

Rolf N · Jul 14, 2004

"Hierarchical Z" I think. I've seen this sort of performance difference depending on Z, too, on all sorts of different chips.

Mintmaster · Jul 14, 2004

I think this has to do the with lower precision for Hierarchical Z, though I thought NVidia didn't use HiZ. zeckensack seems pretty sure about it.

AFAIK, ATI uses 8-bit for HiZ, which is why they say to keep the near and far planes as close as possible for best results in their performance guide.

KimB · Jul 14, 2004

Well, for one, don't ever set the near range to zero. That's bad. (Depth error depends upon the ratio of the far plane to the near plane: if the near plane is set to zero, the error diverges).

Secondly, it tends to be dangerous to rely upon your compiler to properly cast the expression:
i / 8.0

...This may end up always returning integer values, even if the "i" is supposed to be cast to a float first.

Xmas · Jul 14, 2004

Chalnoth said:
Secondly, it tends to be dangerous to rely upon your compiler to properly cast the expression:
i / 8.0

...This may end up always returning integer values, even if the "i" is supposed to be cast to a float first.

Which compiler does this wrong?

KimB · Jul 14, 2004

All I know is that I occasionally have problems with automatic casting, so I simply do all my casting manually.

Dave Baumann · Jul 14, 2004

Mintmaster said:
I think this has to do the with lower precision for Hierarchical Z, though I thought NVidia didn't use HiZ. zeckensack seems pretty sure about it.

Last time I asked they stated it wasn't a HiZ. It performs a similar task, is on-chip, but organised differently. Thats the best I've got so far.

Rolf N · Jul 14, 2004

Not sure if it's really HierZ. That's my best guess.
However, I am sure that it's not GeforceFX specific. I've seen this behaviour on Radeon 8500/9200/9500Pro, on GeforceFX5700 and on a DeltaChrome S8. To be more precise, z rejection performance depends on the distance from occluder to incoming fragment.

Chalnoth said:
Well, for one, don't ever set the near range to zero. That's bad. (Depth error depends upon the ratio of the far plane to the near plane: if the near plane is set to zero, the error diverges).

True for frustum type projections, but irrelevant for ortho projections.

dominikbehr · Jul 14, 2004

Chalnoth said:
Well, for one, don't ever set the near range to zero. That's bad. (Depth error depends upon the ratio of the far plane to the near plane: if the near plane is set to zero, the error diverges).

Secondly, it tends to be dangerous to rely upon your compiler to properly cast the expression:
i / 8.0

...This may end up always returning integer values, even if the "i" is supposed to be cast to a float first.

Chalnoth, this is ortho projection here, z is linearly mapped

ERP · Jul 15, 2004

DaveBaumann said:
Mintmaster said:

I think this has to do the with lower precision for Hierarchical Z, though I thought NVidia didn't use HiZ. zeckensack seems pretty sure about it.

Click to expand...

Last time I asked they stated it wasn't a HiZ. It performs a similar task, is on-chip, but organised differently. Thats the best I've got so far.

Yep that's pretty accurate

It is possible on NV2X or later to reject based on Z with no external memory references. When this can happen and when it can't is none obvious and depends on a whole host of variables.

Simon F · Jul 15, 2004

Chalnoth said:
All I know is that I occasionally have problems with automatic casting, so I simply do all my casting manually.

Are you sure it wasn't something else you were doing? That code was perfectly OK (although I would code it as i * (1.0f/8.0f)

)

Dark Helmet · Jul 15, 2004

Simon F said:
Are you sure it wasn't something else you were doing? That code was perfectly OK (although I would code it as i * (1.0f/8.0f) )

Fairly sure. This is exactly the code between the timer snapshots:

Code:

  glBegin ( GL_QUADS ) ;
  for ( i = 0 ; i &lt; 256 ; i++ )
  {
    //float z = -512 + i;
    float z = -64 + i / 8.0;
    glVertex3f ( 0   ,    0, z ) ;
    glVertex3f ( HRES,    0, z ) ;
    glVertex3f ( HRES, VRES, z ) ;
    glVertex3f ( 0   , VRES, z ) ;
  }
  glEnd () ;
  glFlush () ; glFinish () ;

The only thing changing is the Zs. There is also a glFlush/glFinish right before we start the timer above the loop, and right before that we glClear the depth and color buffers.

Thanks for all the responses. This one is apparently more subtle than I thought.

Simon F · Jul 15, 2004

Dark Helmet said:
Simon F said:

Are you sure it wasn't something else you were doing? That code was perfectly OK (although I would code it as i * (1.0f/8.0f) )

Click to expand...

Fairly sure. This is exactly the code between the timer snapshots:

OH I was answering Chalnoth! It was in response to his concern about the implicit typecasting.

Dark Helmet · Jul 15, 2004

zeckensack said:
To be more precise, z rejection performance depends on the distance from occluder to incoming fragment.

I've run a bunch of tests, and in this case it has nothing to do with where in Z these polys (stacked closer and closer to the viewer) are located but rather how closely together they're bunched.

That is, change the spacing from 1, to 1/2.0, 1/4.0, 1/8.0, ... 1/128.0 and you see a gradual 20% performance improvement. For any given spacing, try -999, -512, or -256 as a base Z value and there is no change in performance.

Sure wish I knew what was going on here.

KimB · Jul 15, 2004

Is performance higher when they're closer together or further apart?

I could easily explain a performance increase if they're further apart, as that would indicate that there is some low-precision buffer that is used in addition to the z-buffer for certain approximations. The further apart the surfaces are, the more often the hardware can rule out writing before going to the full-precision z-buffer.

Mintmaster · Jul 15, 2004

Dark Helmet said:
zeckensack said:

To be more precise, z rejection performance depends on the distance from occluder to incoming fragment.

Click to expand...

I've run a bunch of tests, and in this case it has nothing to do with where in Z these polys (stacked closer and closer to the viewer) are located but rather how closely together they're bunched.

Yup, that's why I mentioned precision and 8-bit z-values in an on-chip buffer. I really didn't know NVidia did this, but apparently they do something similar to HiZ as Dave/ERP/zeckensack mentioned. They may just store a value alongside their z-compression flags for each tile. (Yes, I'm making assumptions about NV architecture)

ATI's optimization document:

Last but not least, for the highest Hierarchical Z efficiency place near and far clipping planes to enclose scene geometry as tightly as possible,

This is analogous to what you're doing by changing the spacing.

EDIT: Sorry Chalnoth, I guess I repeated some of the stuff you were saying.

Dark Helmet · Jul 19, 2004

Chalnoth said:
Is performance higher when they're closer together or further apart?

Closer together actually.

I could easily explain a performance increase if they're further apart...

Dark Helmet · Jul 19, 2004

Mintmaster said:
Dark Helmet said:

I've run a bunch of tests, and in this case it has nothing to do with where in Z these polys (stacked closer and closer to the viewer) are located but rather how closely together they're bunched.

Click to expand...

Yup, that's why I mentioned precision and 8-bit z-values in an on-chip buffer. I really didn't know NVidia did this, but apparently they do something similar to HiZ as Dave/ERP/zeckensack mentioned. They may just store a value alongside their z-compression flags for each tile. (Yes, I'm making assumptions about NV architecture)

Hmmm. Ok.

ATI's optimization document:

Last but not least, for the highest Hierarchical Z efficiency place near and far clipping planes to enclose scene geometry as tightly as possible,

Click to expand...

This is analogous to what you're doing by changing the spacing.

Really? It seems to be the opposite of what I'm doing. Pushing the clip planes closer together effectively increases the distance between two Z values in clip space (0..1). What I'm doing is reducing the distance between the planes in clip space to increase performance (i.e. leaving the clip planes alone and pushing the slices closer together). Am I missing something?

KimB · Jul 19, 2004

Closer? Hmmm, that is strange. I suppose it could be due to nVidia assuming that most optimizations would be required when geometry was close, not when it was far away, and so decided to store 1/z (for example) in a low-precision buffer.

NVidia Z Perf

Dark Helmet

Rolf N

Recurring Membmare

Mintmaster

KimB

Xmas

Porous

KimB

Dave Baumann

Gamerscore Wh...

Rolf N

Recurring Membmare

dominikbehr

ERP

Simon F

Tea maker

Dark Helmet

Simon F

Tea maker

Dark Helmet

KimB

Mintmaster

Dark Helmet

Dark Helmet

KimB

Similar threads