Joe DeFuria said:
Well, the only real issue with that, is (correct me if I'm wrong), this is not a transparent thing to do on different sets of hardware. I would agree if you're basically developing for one hardware target. But I'd think the transparent "driver controlled" technique, while probably not performance ideal, is probably ideal from a usability standpoint.
And with shaders that length, I'm not sure the "difference" in performance between the two techniques will be all that interesting. (Going to be "slow" regardless.)
It's not transparent, but you would normally just target DX9. Once you start to go beyond DX9, there are few shaders, if ever, that will bump up against the limits. (e.g. say, 512 or 1024 instructions). Especially if you add looping constructs to the mix.
The performance difference will be quite large actually, since the 2D-post-process approach has the benefit of eliminating overdraw, storing only the data that is needed (instead of 784 bytes per pixel), and it doesn't require retransforming any vertices either (which is what you might be thinking)
Consider, for example, a shader that iterately computes some value. Typically, this is a function of just a few variables: F(X_n) = G(X_n-1, Y_n-1, Z_n-1, ...) repeated over and over. Examples: Newton Raphson Approximation, Fractals, Noise, Procedural textures, etc.
You can save the few variables that are required to continue the iteration in the frame buffer. You can then perform a whole bunch of 2D passes over the frame buffer (very very quick) to incrementally update these values. No retransforms of geometry neccessary. The only issue is how to interact with AA.
Basically, in any given program, whether it is HLSL, C, or Java, there are only a few "live" variables at any point in the program, and I can't really see the need to ever save the entire pipeline state to temporary storage.