Does OpenGL 2.0 automaticly generate multipass???

Mephisto

Newcomer
A question mainly to the OpenGL gurus here. Will OpenGL 2.0 automaticly generate multipass if a shader can't be mapped to the underlying hardware, like the stanford shading language?
 
I believe that's up to the driver, so I imagine almost all OpenGL 2.0 implementations will just fail. Resorting to multi-pass in the driver is a very bad idea on many levels. It may make things easier for developers, but performance (and potentially quality) would suffer dramatically as a result, potentially even hurting performance when multi-pass isn't needed.
 
Actually, according to the OpenGL 2.0 Shading Language document at 3dlabs.com (see section 3.5.1), an OpenGL 2.0 driver, when faced with a too large shader, is required to split the shader up into multiple passes. The driver is also required to report back to the client program whether or not it compiled the given shader into a multipass setup.
 
gking said:
Resorting to multi-pass in the driver is a very bad idea on many levels. It may make things easier for developers, but performance (and potentially quality) would suffer dramatically as a result, potentially even hurting performance when multi-pass isn't needed.

Sure performance will go down, but todays high end DX8 cards and coming DX9 cards (where these languages are targeded for) certainly do not have any performance problems. Just have a look at Anands latest UT2K3 scores, a Ti 4600 already is just blazing fast in the latest 3d engine.

Further, performance problems can be compensated by the user by lowering the resolution, lowering geometry detail, lowering rendering quality and so on. Dealing with the limits of the hardware is just a pain for developers when coding with a high level language.
 
Further, performance problems can be compensated by the user by lowering the resolution, lowering geometry detail, lowering rendering quality and so on. Dealing with the limits of the hardware is just a pain for developers when coding with a high level language.

This seems to be the obvious answer, but you are overlooking a large class of problems that can not be trivially multi-passed. Some problems simply can not be solved with multipass, and even those that can shouldn't be multipassed at the driver level.

When you multipass, you typically want to render the whole scene with the first pass, and then render the subsequent passes with an equals depth test in order to take advantage of early Z rejection hardware to minimize the number of cycles spent on shading unseen pixels. Unfortunately, the hardware doesn't know when it renders a triangle if a later triangle will obscure it, so all passess will be rendered on all visible triangles. This is *much, much* less efficient than letting the application developer handle the multipassing.

Additionally, some problems that can be solved in hardware on some GPUs (e.g., power functions per-pixel) may require textures on other GPUs. How does the driver decide what resolution texture to create for this lookup function (assuming a lookup can be created at all)? If the game has detected a card with 64M free memory, what happens when the driver creates a 1M lookup texture?

Also, should the driver look for all differentiable functions in order to collapse extra math passes into textures? If it finds such a function, can it collapse it on hardware that doesn't require multi-pass?

Most people assume that turning shaders into multiple passes is just some simple flag that should be done automatically. Sure, it's nice for application developers; however, it has *many* negative consequences at the driver level, and you really don't want unnecessary baggage weighting down your drivers.

The best solution for allowing graceful multipass would be to make the driver-level interface hardware-limited (like it is currently), and then introduce a GLU-like layer that can gracefully create multipass when requested. That way, it's the application developer's fault if their program runs like crap.

Sure performance will go down, but todays high end DX8 cards and coming DX9 cards (where these languages are targeded for) certainly do not have any performance problems. Just have a look at Anands latest UT2K3 scores, a Ti 4600 already is just blazing fast in the latest 3d engine

UT2K3 is currently a hi-poly DX7 app. The Ti4600 doesn't need to resort to multi-pass. Game developers want to be given as much performance as they can get. Taking performance away from them (and from users) in order to gracefully multipass shaders on older hardware, when the better solution would be to just run a simpler shader, is a bad idea on many levels. And just becuase hardware is fast doesn't mean you should throw away performance. That's just ridiculous.
 
gking said:
Most people assume that turning shaders into multiple passes is just some simple flag that should be done automatically. Sure, it's nice for application developers; however, it has *many* negative consequences at the driver level, and you really don't want unnecessary baggage weighting down your drivers.

Well, NVIDIA has already shown an implementation of stanford shading language running (http://www.pentondigital.com/dmr/showreports/siggraph2.asp#part2), so I guess it is not that of a problem for the drivers.

And just becuase hardware is fast doesn't mean you should throw away performance. That's just ridiculous.

I see your point, but as a developer, if you have the skill and the time, you could still write different shaders for different hardware so that there is no need for multipass. But writing shaders that can be splited automaticly should be an option.
 
gking said:
When you multipass, you typically want to render the whole scene with the first pass, and then render the subsequent passes with an equals depth test in order to take advantage of early Z rejection hardware to minimize the number of cycles spent on shading unseen pixels. Unfortunately, the hardware doesn't know when it renders a triangle if a later triangle will obscure it, so all passess will be rendered on all visible triangles. This is *much, much* less efficient than letting the application developer handle the multipassing.

This is why the compiler has to report back that multipass was enabled. As long as the software knows that multipass is required, and each compiled pass is stored in a different location, there should be no performance problems (other than less-than-perfect optimizations).

Oh, and UT2k3 does need multipass on the GF4 Ti 4600 in some situations. I'll see if I can dig up a quote on that to be certain...
 
gking said:
When you multipass, you typically want to render the whole scene with the first pass, and then render the subsequent passes with an equals depth test in order to take advantage of early Z rejection hardware to minimize the number of cycles spent on shading unseen pixels. Unfortunately, the hardware doesn't know when it renders a triangle if a later triangle will obscure it, so all passess will be rendered on all visible triangles. This is *much, much* less efficient than letting the application developer handle the multipassing.

yes, the developer would handle this at a material level, i.e. each pass would go through all surfaces sharing a given shader. why wouldn't the driver/HLSL compiler carry out the same obvious solution? just for the sake of being IMR -politically-correct? -nah.

Additionally, some problems that can be solved in hardware on some GPUs (e.g., power functions per-pixel) may require textures on other GPUs. How does the driver decide what resolution texture to create for this lookup function (assuming a lookup can be created at all)? If the game has detected a card with 64M free memory, what happens when the driver creates a 1M lookup texture?

through HLSL pragma-like constructs, e.g. #pragma power_func_A_lookup_granularity 10

Also, should the driver look for all differentiable functions in order to collapse extra math passes into textures? If it finds such a function, can it collapse it on hardware that doesn't require multi-pass?

yes, i'd like the HLSL compiler to.

Most people assume that turning shaders into multiple passes is just some simple flag that should be done automatically.

i believe that many of the regulars on this forum don't belong to that category.

Sure, it's nice for application developers; however, it has *many* negative consequences at the driver level, and you really don't want unnecessary baggage weighting down your drivers.

it's not just "nice for application developer". with all your knowledge and proficiency in CG (which i get the impression that you have) how viable do you think would be for you (as i also get the impression you're not an app developer) to code a 3MB renderman shader into, say, dx9-like assembly shaders? apparently many see Cg in its present form to defy its very purpose (or at least to be a half-step in the right direction) - imagine a CC which would require you to know the exact register model of the architecture you code for?

as whether this whole functionality should be implemented exactly at the driver level or a little bit above it (or even in GLUT), i don't really care, neither does it really matter, as long as i have the choice of both HLSL and assembly.

The best solution for allowing graceful multipass would be to make the driver-level interface hardware-limited (like it is currently), and then introduce a GLU-like layer that can gracefully create multipass when requested. That way, it's the application developer's fault if their program runs like crap.

you mean exactly the way it's exposed in ogl2 ?

Taking performance away from them (and from users) in order to gracefully multipass shaders on older hardware, when the better solution would be to just run a simpler shader, is a bad idea on many levels.

a "simple shader"? somehow thriggers in my mind a recollection of the notorious "who'd ever need more than 640KB" statement..

And just becuase hardware is fast doesn't mean you should throw away performance. That's just ridiculous.

so if it was up to you higher-level computer languages wouldn't have been invented and so we would not throw away performance. or did i miss you point?
 
Auto-multipass is a pipedream until the APIs operate at the scenegraph level. Yes, it can be done at the triangle level, but it will be woefully inefficient. You could get pathological cases out of relatively simple shaders that require dozens or more passes! And each pass is inefficient because of all the state changes, no ability to be intelligent about Z, transparency, stencil, etc.

Sure, the fallback could be made to "work", but so does software rendering of pixel shaders! I bet such fallbacks will be so horribly inefficient that developers will simply avoid the possibility and code 2 code paths: One for OpenGL2.0 hardware that has no resource limits and second, one for OpenGL1.1

Moreover, they will probably choose 1 or 2 OGL2.0 cards, make sure the shaders work on those (with no driver multipass) and all other cards will default to using the OpenGL1.1 execution pass.
 
DemoCoder said:
Auto-multipass is a pipedream until the APIs operate at the scenegraph level. Yes, it can be done at the triangle level, but it will be woefully inefficient. You could get pathological cases out of relatively simple shaders that require dozens or more passes! And each pass is inefficient because of all the state changes, no ability to be intelligent about Z, transparency, stencil, etc.

1. As long as it's up to the developer to manage the multiple passes, there's no problem with state changes.
2. In theory, a good compiler wouldn't use significantly more passes than pure assembly.

But, I think the biggest problem right now is that current DX8 hardware just isn't programmable enough for you to not know the render target. Hopefully all DX9 and above hardware (which means, hopefully, R300 and NV30) will be programmable enough to forever hide assembly from developers.
 
DemoCoder said:
Auto-multipass is a pipedream until the APIs operate at the scenegraph level. Yes, it can be done at the triangle level, but it will be woefully inefficient. You could get pathological cases out of relatively simple shaders that require dozens or more passes! And each pass is inefficient because of all the state changes, no ability to be intelligent about Z, transparency, stencil, etc.

care to pin-point the arguments you have re auto-multipass inefficiency so we could go through them more thoroughly?
 
As far as I can see, many people are assuming that HLSL's will compile to a multipass algorithm, where all passes would be called for each primitive at once (i.e. the program has no knowledge or control over multipass execution).

The problem here is the same as one of the problems that exists in Unreal Tournament. You see, 3D renderers are far, far better at rendering the entire scene one pass at a time. If you can render the entire scene without changing any rendering data (# of lights, textures, etc.), you will get optimal performance. However, games like the original Unreal render all passes on each primitive before moving onto the next. This causes massive stalls and performance problems, and is the same if multipass in HLSL's was transparent to the game developer.

However, any decent HLSL that compiles to multipass should certainly allow developer control over multipass, allowing something like this (pseudocode):

compile program
for x = 1 to numpasses
render pass x
next x

With something of this form, there shouldn't be any problem with state change stalls.

Some people are also stating that some programs will compile to an unfeasible number of passes. All I can say is, hopefully compiler technology will be good enough that this doesn't have to happen.
 
Here. With Cg, GL2 and such publications becoming available i believe widespread adoption of HW pixel shaders is inevitable REAL SOON NOW :D
I will never ever advise anyone to by graphics HW that doesnt have pixel shaders :p[/url]
 
bah,nvidia just wanna hold back technology,
for them to say multi-pass is bad just because
their GPUs can't handle it,is loads of poop! :rolleyes:

why should everybody has to write simple
shaders and dumb everything down to nvidia's
level just to accomadate the limitation in
nvidia's hardware? :rolleyes:

but if some coder wants write some
back-end for certain crappy GPUs to avoid
multi-passing,they can do so and it'll work
in OGL2. :)

what's matter,nvidia can't design hardware to
handle it or their drivers team is too lazy and
really sucks? :)

anyways,OpenGL v2.0 is moving on,
with or without nvidia.
and as john carmack says "get over it" :)
 
Chalnoth said:
However, any decent HLSL that compiles to multipass should certainly allow developer control over multipass, allowing something like this (pseudocode):

compile program
for x = 1 to numpasses
render pass x
next x

With something of this form, there shouldn't be any problem with state change stalls.

Actually the DX8 "Effects and Techniques" tries to do the same thing. It does not automatically compile into multiple passes but it allows similar operation.

Unfotunately, this does not always work. For example, if you want to draw transparent triangles, multi-pass algorithm may fail. You'll need to render the triangles into a temporary off-screen buffer and blend them into the frame buffer when they are all processed.
 
pcchen said:
Actually the DX8 "Effects and Techniques" tries to do the same thing. It does not automatically compile into multiple passes but it allows similar operation.

Unfotunately, this does not always work. For example, if you want to draw transparent triangles, multi-pass algorithm may fail. You'll need to render the triangles into a temporary off-screen buffer and blend them into the frame buffer when they are all processed.

or, alternatively, the underlying driver could behave in a scene-capturing manner, re-arranging all primitive draws by the context of their "material" _and_ "blendability", honoring the original order for each such context (as it may have been deliberately set for an IMR architecture, early-z checks, etc), eventually deferring the blend-able primitives' passes past the passes for the non-blendable primitives. all that off the top of my hat.
 
darkblu said:
DemoCoder said:
Auto-multipass is a pipedream until the APIs operate at the scenegraph level. Yes, it can be done at the triangle level, but it will be woefully inefficient. You could get pathological cases out of relatively simple shaders that require dozens or more passes! And each pass is inefficient because of all the state changes, no ability to be intelligent about Z, transparency, stencil, etc.

care to pin-point the arguments you have re auto-multipass inefficiency so we could go through them more thoroughly?

I posted them awhile ago on these forums, I'm not about to rewrite such a huge post. Essentially, #1 you want to group primitives together which share like state. Unless the application takes control of pass rendering (polling the driver for each pass, getting the next set of compiled shaders, and doing the rendering itslef), you can't do this without scene capture.


Chalnoth keeps talking about this API, but I don't see it in the OpenGL2.0 specs online. Maybe someone can point me to it. When I talk about auto-multipass, I am talking about the driver multipassing the scene without ANY control from the application.


State changes are one of the biggest performance killers today, and is detailed in every DX/OGL programming FAQ (that, and mismanaging vertex buffers/stalling the GPU) However, immediate mode renderers are for the most part, procedural/stream based APIs, not object/stateful APIs, so any driver wanting to do the right thing has to do scene capture.

#2 transparency kills you.

#3 There are very simple operations like a user implemented version pow(), noise(), image shading, procedural textures, that require a very high number of passes and/or render-to-texture. Take computing Perlin noise/turbulence. How will a fragment shader that has a loop with 128 iterations, be unrolled onto the GF4 into god knows how many passes, perform if this procedural texture is used all over the place?



Yes, if the API supports polling, e.g.

while ( (shaderpart = compiledShader.getNextPass() ) != null)
{
render pass with shaderpart
}


Then things will be more efficient (but still can degenerate if you allow looping constructs). But we're talking about hypothetical application assists to the API, which I haven't seen yet.
 
I think most of the arguments so far can be summed up with 'Multipass is a Good Thing. Unfortunately, it's also hard to handle in all but simple cases because there are so many interactions'.

Multipass vs. multitexture can be easily handled if the algorithm is initially expressed as a multipass algorithm and then collapsed to a multitexture algorithm (as Q3 does it). Going the other way gets hairy pretty fast - you need to get into solutions like the Peercy et.al. paper.
 
DemoCoder said:
I posted them awhile ago on these forums, I'm not about to rewrite such a huge post.

as i've apparently missed your original post, thanks for the effort of re-stating those in this thread!

Essentially, #1 you want to group primitives together which share like state. Unless the application takes control of pass rendering (polling the driver for each pass, getting the next set of compiled shaders, and doing the rendering itslef), you can't do this without scene capture.

Chalnoth keeps talking about this API, but I don't see it in the OpenGL2.0 specs online. Maybe someone can point me to it. When I talk about auto-multipass, I am talking about the driver multipassing the scene without ANY control from the application.

State changes are one of the biggest performance killers today, and is detailed in every DX/OGL programming FAQ (that, and mismanaging vertex buffers/stalling the GPU) However, immediate mode renderers are for the most part, procedural/stream based APIs, not object/stateful APIs, so any driver wanting to do the right thing has to do scene capture.

so we have a commonly-recognised solution for this and it's called 'scene capturing'.

#2 transparency kills you.

transparency kills you anyway with and IMR, so? i don't see it as being a greater problem than usual.

#3 There are very simple operations like a user implemented version pow(), noise(), image shading, procedural textures, that require a very high number of passes and/or render-to-texture. Take computing Perlin noise/turbulence. How will a fragment shader that has a loop with 128 iterations, be unrolled onto the GF4 into god knows how many passes, perform if this procedural texture is used all over the place?

wait a moment here, we were talking the mechanics of auto-multipassing per se, not how implementing this particular worst-case-sample function would actually be carried out on that particularly-challanged gpu. see, just like w/ every computing system (be it a von neumann one, or a quantum calculator, or what not) there are particular tasks and classes of algorithms which happen to be prohibitively expensive. in which case it's the developer's task and responsibility to decide "this will do/this won't do (and to sit down and tailor the problem at hand to what the computer system can do reasonably)". nevertheless, it's the obligation of the compiler/higher abstraction tool to provide a working output from whatever valid code the developer passed to its input. imagine the following hypotetical situation:

i'm doing a visual systems research, using proprietary custom-coded shading algorithms. my attempts to run those on a cpu showed prohibitively low performance. i find out that a standard (for my time) gpu can do that 1000x times faster, be that actualy at 0.01 FPS. so i sit down and code my stuff, and though i don't get "playable" frame rates i'm still more than happy to get a 1000x speed up for my research. i don't care how many passes my algorithm was compiled into. ...and what is best of all, in 5 years i can get the code of my ultra-sophisticated shaders and w/o (or with very little) modifications compile it for the new cutting-edge tech just made availabe and maybe enjoy FPS in the tens. oh, miracle! i will have just benefitted from the fruits of higher-level abstraction coding! ;)

But we're talking about hypothetical application assists to the API, which I haven't seen yet.

yes, indeed, that's exactly what we're talking. as well as the general steps and directions that need be followed to get us where we'd like to be in the future.
 
pikkachu said:
bah,nvidia just wanna hold back technology,
for them to say multi-pass is bad just because
their GPUs can't handle it,is loads of poop! :rolleyes:

No, multipass would be especially good for nVidia's current GPUs. The problem most of the people here have with them is that it would be far from easy for the compiler/driver to handle the multipass totally internally.

The limitations I spoke of were in regards to other things, where I suspect there are some shaders that are simply impossible to run on GF3 or GF4-level hardware.
 
Back
Top