DeltaChrome Two Pass rendering scheme

Inspirer

Newcomer
Did anybody understand the "Advanced Deffered Rendering with a two pass rendering scheme" feature ?

read here

please don't comment on s3 in general or compare their past products to ATI's, Nvidia's or 3dfx's
this is a question specifically about a certain technological concept.

what do they mean by "analizing depth complexity" and by repeatedly referring to the Z Buffer in Plural ?

is it possible that the card is dynamically allocating Z storage space in order to store multiple Z values for each pixel?
 
what I think is going on, is that in the first pass they only calculate Z values for all the pixels in the scene.
instead of storing the z values, they use them to build for every pixel an array of 1bit values that determine if the pixel is drawn(1) or discarded(0).
the order of the values in the array is the same order the app is sending the commands.
This way, effectively you don't need a Z buffer. you use a regular single Z buffer in the first pass and use it to build the bit arrays.
but when you have the arrays built (together they act as multiple Z buffers) you don't need your original Z buffer anymore, you can ditch it.

in the second pass, when the card has to decide whether or not to render a pixel, it reads the array's realevant block
(which is the amount of times the current pixel was accessed in the current frame).
if the block equals 0, the pixel is skipped.
if the block equals 1, the pixel is proccessed and blended with the currently stored pixel.

things I saw on s3's site that made me come up with this :
* they said that in the first pass, no values are stored in the frame buffer.
* they used the phrase "depth complexity" (the length of the bit array equals the scene's maximum depth complexity)
 
Inspirer said:
what I think is going on, is that in the first pass they only calculate Z values for all the pixels in the scene.
instead of storing the z values, they use them to build for every pixel an array of 1bit values that determine if the pixel is drawn(1) or discarded(0)
Where this per-pixel mask is stored? how many bits per pixel are stored/managed? is it a static or dynamic resource? and the most important thing...who are you Inspirer? :)

ciao,
Marco
 
Not sure what that saves you as you still need to do multiple read/writes of a full Z buffer while you're building up your arrays of "next object to land on this pixel is valid" bits. Might as well just do a Z only pass keep the Z and then re-render the scene aka doom3.
 
JohnH said:
Not sure what that saves you as you still need to do multiple read/writes of a full Z buffer while you're building up your arrays of "next object to land on this pixel is valid" bits. Might as well just do a Z only pass keep the Z and then re-render the scene aka doom3.
That is why I asked to him where and how this weird buffer is stored. If it was stored onchip it could save some z-read bandwith in the second (the real one) pass on unseen fragments. More unseen fragments per pixel would save more z-reads bandwith, but higher the number of per pixel bits is needed to store the passes-mask, lower is the probablity to fit in on chip...with a fixed budget per pixel at last. In the end doesn't make much sense..

ciao,
Marco
 
Inspirer said:
what I think is going on, is that in the first pass they only calculate Z values for all the pixels in the scene.
instead of storing the z values, they use them to build for every pixel an array of 1bit values that determine if the pixel is drawn(1) or discarded(0).

I have been thinking out loud about something like this before, but I came to realize that you still need to check against z values at full 24 bit sooner or later. :(
 
to nAo:
first and foremost, I'm a nobody 8) (why, would you like to hire me?)
I'm an israeli 21 year old, fresh out of the army.
I'm into programing, math and physics and have been following 3D tech ever since Voodoo Graphics.

second, my guess would be that it's a dynamic resource.

third, I never said it was on-chip !

to JohnH:
it definetly doesn't save you z reads/writes !!
for that it has the hirarchial Z Buffer that it claims to be using.
what it does save you is the texture reads, Anisotrophic Filtering, Pixel Shader Programs, Bump Mapping calculations, reads/writes of color values to the Frame Buffer and all the other things you would have otherwise done in vain to an invisible Pixel.

once again: in the first pass, it does z tests like every other card, but, that's only to build the arrays, and then the final Z values are irrelevant.

regarding doom3

Carmack didn't fully disclose his magical mechanism in full detail. I don't understand what I read about it. some things like transparency don't seem to fit. :rolleyes:
 
side note: thanks a lot for participating in this thread it realy helps me to gather my thoughts.

I'm also glad to see that the main Idea is understood by most therefore it can't be complete nonesense ;)
 
oops !!
it just hit me that my method doesn't cull pixels that are drawn back to front. :oops:

in order to cull back to front, a bit array won't do.
:idea: But, suppose you do store (nevermind how, I can think of many ways) for every pixel how many times you draw on it and which times are valid.

In the second pass, you draw the applications commands in opposite order.
and so, when you draw a pixel:
if it's invalid, it's skipped.
if it's valid, it's rendered and then tested for transparency:
if transparent AND not last on the list, stored for future blending.
if opaque OR last on the list, blended with all previously stored transparent pixels, written to the frame buffer and then, all the following drawings listed for that pixel are marked invalid.
 
During Pass One evaluation of depth complexity occurs, but no pixels are actually rendered to the destination or Z-buffers. During Pass Two final processing to the destination and Z-buffers is optimized to occur with minimal reads.

So the first pass does not write to the Z-buffer(s)...

They say that
Deferred Two-Pass rendering virtually eliminates opaque (redundant) back-to-front overdraw and saves precious conventional Z-write traffic.
.

Could the first pass simply be a front to back sort?

The wording isn't very consistent and it doesn't sound like 100% of overdraw is eliminated.
 
psurge said:
Could the first pass simply be a front to back sort?

The problem is that I cannot see how this could be done without the use of Z storage (buffer). For a sort you still have to check nearly very damn pixel's Z-value against the others pixels Z-values - both the pixels that have been checked ready at that time and those how are about to be checked for that specific frame. (There are optimizations like those ATI and Nvidia are doing on coarse Z-reject, but you get my point).

Dunno what the hell they're talking about to be honest.
 
The people who wrote up the text for the website probably don't know what the hell they're talking about either, which is why I find it amusing that everyone is spending so much time and energy trying to decrypt it. :)
 
LeStoffer,

somewhat OT: I don't understand how to generate an accurate (useable) coarse z-pyramid without generating a z-buffer down to the target render resolution. (For visible triangles, you have to propagate the modifications to the lowest level of the z-pyramid upwards - or am I missing something?).

2. A sort could be something as simple as painters sort, not necessarily involving any coverage/visibility computations at all... but, more along the lines of what I was thinking: how about a primitive level sort?

Pass 1: For each primitive, record vertex shader/pixel shader ID. Process geometry such that a bounding box for the primitive is computed (perhaps screenspace x,y,z min/max). Reorder such that primitives are drawn in front-to-back order.

Pass 2: as normal. Generate HZ pyramid.

That has the same geomtry cost as a z-only first pass scheme, but does not require rasterization and much less memory than a full Z-buffer. In addition, you can use the bounding box info generated in pass1 for rejection of entire primitives (using the HZ buffer), so that you are not in fact duplicating all of the geomtry processing performed in pass 1.

What do you think?
Serge

EDIT: another speedup: in pass1, all vertex programs could be pruned of instructions not contributing to output vertex position.
 
there are problems with sorting 3D data in the Video Card.

1. it's a slow process whic doesn't fit a real time solution.
2. the application is already sending most of the data back to front so there's no reason to do it all over again.

if you want to correctly display transparency, you have to send data to the vid card back to front because if you draw foreground pixel 'a' and then background pixel 'b', 'b' won't be seen even if 'a' is completely transparent.

this is true due to the limitations of immediate mode rendering.

which causes early Z tests on immediate mode renderers to be too early, and thus not cull back to front.

all apps are designed to expect this and act accordingly.
which could cause visual artifacts on an accelerator that doesn't have this limitation, such as one that that sorts the draw commands.
 
couple comments.

- making the process of sorting a list of screen-space, axis aligned bounding boxes too slow for real time operation requires skill. Consider that there are far fewer primitives than triangles,vertices, and pixels.

- the first pass is precidely for those cases where the app does not send things front to back (presumably what you meant). Otherwise yes, the first pass can be skipped... Depending on the vertex shaders and depth complexity for ascene, computing bounding boxes in pass1 might actually lead to a performance win versus a single pass solution (which has no bounding box info and hence cannot reject entire primitives at a time).

- primitives with pixel programs that generate transparent fragments are not reordered. They should be rendered in original order, after all opaque primitives have been rendered.

I'm not saying this is perfect, but I'm not seeing how this would be too slow for real-time use, or lead to artifacts. After all, if PVR can reorder in similar fashion (in hardware) on a triangle by triangle basis, why shouldn't it be possible in the above scenario?
 
psurge said:
LeStoffer,

somewhat OT: I don't understand how to generate an accurate (useable) coarse z-pyramid without generating a z-buffer down to the target render resolution. (For visible triangles, you have to propagate the modifications to the lowest level of the z-pyramid upwards - or am I missing something?).

Well, that's exactly my problem too! I cannot see how we can skip having a z-buffer at full resolution somewhere to check against either.

psurge said:
How about a primitive level sort?

Pass 1: For each primitive, record vertex shader/pixel shader ID. Process geometry such that a bounding box for the primitive is computed (perhaps screenspace x,y,z min/max). Reorder such that primitives are drawn in front-to-back order.

Pass 2: as normal. Generate HZ pyramid.

That has the same geomtry cost as a z-only first pass scheme, but does not require rasterization and much less memory than a full Z-buffer. In addition, you can use the bounding box info generated in pass1 for rejection of entire primitives (using the HZ buffer), so that you are not in fact duplicating all of the geomtry processing performed in pass 1.

What do you think?

Well, why work on a per bounding box? While we are at reordering (which needs that darn storage!), why don't we just go full deferred rendering aka PowerVR? I don't think I really understand your idea. :oops:

Anyway, I think some of those really clever folks here might wanna step in? hint... ;)
 
is it clear to everyone that what I'm suggesting IS in fact a full deferred rendering and completely eliminates overdraw ??

Tile based solutions have the benefits of doing all the Z buffering on chip, but they need to store more data in their first phase.

----------

sorting bounding boxes can be done fast.
building the bounding boxes on the other hand, sounds to me like pretty complicated math.
 
LeStoffer,

- all you need to store per primitive is bounding box info, pixel shader ID, vertex shader ID, and a reference to the input data set. So, the amount of bandwidth/space required for scene capture is negligeable compared to storing post-transform vertices (which I think is how PowerVR operates).

- sorting primitives is efficient because there are orders of magnitude less primitives than vertices.

- The sort presents primitives in somewhat-optimal order (front to back) for a heirarchical-Z rendering system, i.e. a fair amount of occluded pixels won't be shaded.

- Since the first pass is only concerned with determining primitive position, the vertex shaders need not compute texture coordinates, vertex normals, or lighting information. In pass2, bounding box info is available per primitive, so it can be used to cull entire primitives at a time. For culled primitives, texture coordinates, normals, lighting, etc... will never be computed. If you are storing post-transform vertices, then the full vertex shader must be run for every vertex in the scene, even if the vertex does not belong to any visible triangles.

AFAICS the geometry read bandwidth for this scheme and storing post-transform vertices is similar. But it doesn't write very much scene data at all, and limits full vertex shading to vertices "likely" to be visible.

I guess that instead of using a HZ buffer, you could tile the primitives and proceed a la PowerVR (saving even more frame z/buffer bandwidth). I think MfA proposed something like this with heirarchical, program specifiable bounding boxes a while ago.


Edit: Inspirer - what I'm suggesting is this :
(Pass 1)

For each primitive (all vertices processed in a given GPU state) :

For first vertex :
- run vertex program, obtain screenspace (x,y,z). Set Min = Max = (x,y,z)

For each following vertex :
- run vertex program, obtain screenspace (x,y,z).
Minx = min(Minx, x);
Miny = min(Miny, y);
Minz = min(Minz, z);
similarly for Max

after the last vertex in the primitive, Min,Max is your bounding box.

Output bounding box and state information for each primitive.
 
Back
Top