NVidia Z Perf

Dark Helmet

Newcomer
On a GeForce FX 5950 Ultra with a stock ortho projection, does anyone know why the performance of this loop which renders unoccluded polygons that fill the screen varies by 20% if you just swap the comments on which z values are used (-512, -511, ... -257 versus -64, -63.875, ... -32.125).

Code:
  glOrtho        ( 0, HRES, 0, VRES, 0, 1000 ) ;
  ...
  for ( i = 0 ; i < 256 ; i++ )
  {
    float z = -512 + i;
    //float z = -64 + i / 8.0;
    glTexCoord2f(0,0);glVertex3f ( 0   , 0   , z ) ;
    glTexCoord2f(1,0);glVertex3f ( HRES, 0   , z ) ;
    glTexCoord2f(1,1);glVertex3f ( HRES, VRES, z ) ;
    glTexCoord2f(0,1);glVertex3f ( 0   , VRES, z ) ;
  }

This is with a 24-bit depth buffer cleared to 1, depth range set to 0..1, and depth testing is on. I'm at a loss to explain this one.
 
"Hierarchical Z" I think. I've seen this sort of performance difference depending on Z, too, on all sorts of different chips.
 
I think this has to do the with lower precision for Hierarchical Z, though I thought NVidia didn't use HiZ. zeckensack seems pretty sure about it.

AFAIK, ATI uses 8-bit for HiZ, which is why they say to keep the near and far planes as close as possible for best results in their performance guide.
 
Well, for one, don't ever set the near range to zero. That's bad. (Depth error depends upon the ratio of the far plane to the near plane: if the near plane is set to zero, the error diverges).

Secondly, it tends to be dangerous to rely upon your compiler to properly cast the expression:
i / 8.0

...This may end up always returning integer values, even if the "i" is supposed to be cast to a float first.
 
Chalnoth said:
Secondly, it tends to be dangerous to rely upon your compiler to properly cast the expression:
i / 8.0

...This may end up always returning integer values, even if the "i" is supposed to be cast to a float first.
Which compiler does this wrong?
 
All I know is that I occasionally have problems with automatic casting, so I simply do all my casting manually.
 
Mintmaster said:
I think this has to do the with lower precision for Hierarchical Z, though I thought NVidia didn't use HiZ. zeckensack seems pretty sure about it.

Last time I asked they stated it wasn't a HiZ. It performs a similar task, is on-chip, but organised differently. Thats the best I've got so far.
 
Not sure if it's really HierZ. That's my best guess.
However, I am sure that it's not GeforceFX specific. I've seen this behaviour on Radeon 8500/9200/9500Pro, on GeforceFX5700 and on a DeltaChrome S8. To be more precise, z rejection performance depends on the distance from occluder to incoming fragment.

Chalnoth said:
Well, for one, don't ever set the near range to zero. That's bad. (Depth error depends upon the ratio of the far plane to the near plane: if the near plane is set to zero, the error diverges).
True for frustum type projections, but irrelevant for ortho projections.
 
Chalnoth said:
Well, for one, don't ever set the near range to zero. That's bad. (Depth error depends upon the ratio of the far plane to the near plane: if the near plane is set to zero, the error diverges).

Secondly, it tends to be dangerous to rely upon your compiler to properly cast the expression:
i / 8.0

...This may end up always returning integer values, even if the "i" is supposed to be cast to a float first.

Chalnoth, this is ortho projection here, z is linearly mapped
 
DaveBaumann said:
Mintmaster said:
I think this has to do the with lower precision for Hierarchical Z, though I thought NVidia didn't use HiZ. zeckensack seems pretty sure about it.

Last time I asked they stated it wasn't a HiZ. It performs a similar task, is on-chip, but organised differently. Thats the best I've got so far.

Yep that's pretty accurate ;)

It is possible on NV2X or later to reject based on Z with no external memory references. When this can happen and when it can't is none obvious and depends on a whole host of variables.
 
Chalnoth said:
All I know is that I occasionally have problems with automatic casting, so I simply do all my casting manually.
Are you sure it wasn't something else you were doing? That code was perfectly OK (although I would code it as i * (1.0f/8.0f) :) )
 
Simon F said:
Are you sure it wasn't something else you were doing? That code was perfectly OK (although I would code it as i * (1.0f/8.0f) :) )
Fairly sure. This is exactly the code between the timer snapshots:
Code:
  glBegin ( GL_QUADS ) ;
  for ( i = 0 ; i < 256 ; i++ )
  {
    //float z = -512 + i;
    float z = -64 + i / 8.0;
    glVertex3f ( 0   ,    0, z ) ;
    glVertex3f ( HRES,    0, z ) ;
    glVertex3f ( HRES, VRES, z ) ;
    glVertex3f ( 0   , VRES, z ) ;
  }
  glEnd () ;
  glFlush () ; glFinish () ;
The only thing changing is the Zs. There is also a glFlush/glFinish right before we start the timer above the loop, and right before that we glClear the depth and color buffers.

Thanks for all the responses. This one is apparently more subtle than I thought.
 
Dark Helmet said:
Simon F said:
Are you sure it wasn't something else you were doing? That code was perfectly OK (although I would code it as i * (1.0f/8.0f) :) )
Fairly sure. This is exactly the code between the timer snapshots:
OH I was answering Chalnoth! It was in response to his concern about the implicit typecasting.
 
zeckensack said:
To be more precise, z rejection performance depends on the distance from occluder to incoming fragment.
I've run a bunch of tests, and in this case it has nothing to do with where in Z these polys (stacked closer and closer to the viewer) are located but rather how closely together they're bunched.

That is, change the spacing from 1, to 1/2.0, 1/4.0, 1/8.0, ... 1/128.0 and you see a gradual 20% performance improvement. For any given spacing, try -999, -512, or -256 as a base Z value and there is no change in performance.

Sure wish I knew what was going on here.
 
Is performance higher when they're closer together or further apart?

I could easily explain a performance increase if they're further apart, as that would indicate that there is some low-precision buffer that is used in addition to the z-buffer for certain approximations. The further apart the surfaces are, the more often the hardware can rule out writing before going to the full-precision z-buffer.
 
Dark Helmet said:
zeckensack said:
To be more precise, z rejection performance depends on the distance from occluder to incoming fragment.
I've run a bunch of tests, and in this case it has nothing to do with where in Z these polys (stacked closer and closer to the viewer) are located but rather how closely together they're bunched.
Yup, that's why I mentioned precision and 8-bit z-values in an on-chip buffer. I really didn't know NVidia did this, but apparently they do something similar to HiZ as Dave/ERP/zeckensack mentioned. They may just store a value alongside their z-compression flags for each tile. (Yes, I'm making assumptions about NV architecture)

ATI's optimization document:
Last but not least, for the highest Hierarchical Z efficiency place near and far clipping planes to enclose scene geometry as tightly as possible,
This is analogous to what you're doing by changing the spacing.

EDIT: Sorry Chalnoth, I guess I repeated some of the stuff you were saying.
 
Mintmaster said:
Dark Helmet said:
I've run a bunch of tests, and in this case it has nothing to do with where in Z these polys (stacked closer and closer to the viewer) are located but rather how closely together they're bunched.
Yup, that's why I mentioned precision and 8-bit z-values in an on-chip buffer. I really didn't know NVidia did this, but apparently they do something similar to HiZ as Dave/ERP/zeckensack mentioned. They may just store a value alongside their z-compression flags for each tile. (Yes, I'm making assumptions about NV architecture)
Hmmm. Ok.

ATI's optimization document:
Last but not least, for the highest Hierarchical Z efficiency place near and far clipping planes to enclose scene geometry as tightly as possible,
This is analogous to what you're doing by changing the spacing.
Really? It seems to be the opposite of what I'm doing. Pushing the clip planes closer together effectively increases the distance between two Z values in clip space (0..1). What I'm doing is reducing the distance between the planes in clip space to increase performance (i.e. leaving the clip planes alone and pushing the slices closer together). Am I missing something?
 
Closer? Hmmm, that is strange. I suppose it could be due to nVidia assuming that most optimizations would be required when geometry was close, not when it was far away, and so decided to store 1/z (for example) in a low-precision buffer.
 
Back
Top