Does the F-buffer increase the # of instructions per pass?

None, but JC has stated that he has already hit the instruction limits of the R300. This probably doesn't mean that DOOM3 will make use of shaders that require multipass on the R300, but JC's next project will.
 
Chalnoth said:
None, but JC has stated that he has already hit the instruction limits of the R300. This probably doesn't mean that DOOM3 will make use of shaders that require multipass on the R300, but JC's next project will.

But didn't someone here at B3D program a small application that tested super long shaders?

Kinda sucks if no one can test it.
 
Chalnoth said:
None, but JC has stated that he has already hit the instruction limits of the R300. This probably doesn't mean that DOOM3 will make use of shaders that require multipass on the R300, but JC's next project will.

..in 4years
 
Judging by the speed JC develops new engines we'll have second generation DX10 cards by the time we'll see the next id game...
 
Chalnoth said:
None, but JC has stated that he has already hit the instruction limits of the R300. This probably doesn't mean that DOOM3 will make use of shaders that require multipass on the R300, but JC's next project will.

Everybody has hit the limit in 'research',
i.e. I've got an effect where I want to use 64 texture lookups just to work out the light colour but I don't think thats fast enough just yet :( I get away with 4 but the filter would make it so much nicer.

Speed is the issue, no architecture currently here or on the horizon is fast enough to really use long shaders (but it would be nice for the odd effect etc). To really use longer shaders we need at least a x10 speed increase.

Say 3 years, about time for a 'proper' Dx9/OGL2 engine (like JC's post Doom3)
 
arjan...

This seems to me to be one of the simplest ways of doing data dependent branching, since you can still use a single control unit and i-cache for all 8 "pipes" simultaneously, without having to stall pipes. I think predication would help keep the average kernel size large enough that spilling to an F-buffer doesn't become a major factor in performance.

If (in addition), the standard interpolated PS inputs (or even the VS inputs) are viewed as elements of a record in an input stream-buffer, and PS/VS outputs are viewed as elements of a record in an output stream-buffer, then (provided the appropriate math units are available), you can execute
any VS/PS program on these units. Assuming the stream-buffer is shared by all the pipes and connected to the control unit, you effectively have a centralized location in which you can store a set of dynamically sized FIFOs. The control unit can monitor the size of the FIFO being generated and do a context switch once it reaches a certain size (probably chosen so that on chip stream buffer storage is not exceeded). This way you don't run into the situation where PS pipes stall waiting for VS data and vice versa, and you also avoid off-chip memory traffic.

The main problem with this AFAICS is that you lose the 2x2 (or 4x2) pixel stamp coherency across the pipes when data dependent branching is involved...

Edit:
Might as well just be more general and do something along the lines of
Imagine
I think this basically allows more than one stream input/output per kernel, conditional stream inputs as well as outputs, etc...
 
I think a developer-controlled techniques will offer better performance. That is, the developer manually splits his shader, saves only the data he needs in a framebuffer, and runs a 2D post-process on only the pixels that need it. It requires an "f-buffer" the size of the entire screen, but few shaders need all shader state saved saved, and with 4 128-bit FP render targets, you can save up to 16 FP32 values or 4 vectors.

The f-buffer sounds like it will cost performance in the few situations where it will be used, and if those situations are rare, the developer can do it manually. It's only a win for developers if a huge portion of their shaders exceed HW limits.
 
I think a developer-controlled techniques will offer better performance. That is, the developer manually splits his shader, saves only the data he needs in a framebuffer, and runs a 2D post-process on only the pixels that need it.

Well, the only real issue with that, is (correct me if I'm wrong), this is not a transparent thing to do on different sets of hardware. I would agree if you're basically developing for one hardware target. But I'd think the transparent "driver controlled" technique, while probably not performance ideal, is probably ideal from a usability standpoint.

And with shaders that length, I'm not sure the "difference" in performance between the two techniques will be all that interesting. (Going to be "slow" regardless.)
 
DeanoC said:
Speed is the issue, no architecture currently here or on the horizon is fast enough to really use long shaders (but it would be nice for the odd effect etc). To really use longer shaders we need at least a x10 speed increase.

But on the other hand, it could be useful for non-game applications, both for research and actual work. Being able to run any shader you want in hardware without having to worry about length could be a great boon for integrating graphics cards into actual rendering environments.
 
Joe DeFuria said:
Well, the only real issue with that, is (correct me if I'm wrong), this is not a transparent thing to do on different sets of hardware. I would agree if you're basically developing for one hardware target. But I'd think the transparent "driver controlled" technique, while probably not performance ideal, is probably ideal from a usability standpoint.

And with shaders that length, I'm not sure the "difference" in performance between the two techniques will be all that interesting. (Going to be "slow" regardless.)

It's not transparent, but you would normally just target DX9. Once you start to go beyond DX9, there are few shaders, if ever, that will bump up against the limits. (e.g. say, 512 or 1024 instructions). Especially if you add looping constructs to the mix.

The performance difference will be quite large actually, since the 2D-post-process approach has the benefit of eliminating overdraw, storing only the data that is needed (instead of 784 bytes per pixel), and it doesn't require retransforming any vertices either (which is what you might be thinking)

Consider, for example, a shader that iterately computes some value. Typically, this is a function of just a few variables: F(X_n) = G(X_n-1, Y_n-1, Z_n-1, ...) repeated over and over. Examples: Newton Raphson Approximation, Fractals, Noise, Procedural textures, etc.

You can save the few variables that are required to continue the iteration in the frame buffer. You can then perform a whole bunch of 2D passes over the frame buffer (very very quick) to incrementally update these values. No retransforms of geometry neccessary. The only issue is how to interact with AA.

Basically, in any given program, whether it is HLSL, C, or Java, there are only a few "live" variables at any point in the program, and I can't really see the need to ever save the entire pipeline state to temporary storage.
 
Basically, in any given program, whether it is HLSL, C, or Java, there are only a few "live" variables at any point in the program, and I can't really see the need to ever save the entire pipeline state to temporary storage.

If you're working in a pre-emptive system, you must save the entire context because you have no idea what registers are active at any given time.

How exactly that applies to graphics pipelines, I have no idea. :)

It could be because you might have hit the actual limits of the shader at any time (requiring a context switch, i.e. saving off the state), and there's no easy way to know which registers are active and which aren't--so you must save them all.
 
At least it is very easy to find out which registers are used at all in a shader at compile time. And without branching, it should be equally easy to precisely know which registers are 'active' at any given time.
 
DemoCoder,

Unless I have misunderstood, you can't really compare frame-buffers and f-buffers, since the f-buffer stores data at the fragment level (not the pixel level). So, if the first pass writes its output to an f-buffer, the data written will be stored in the f-buffer, even if that particular fragment is occluded by another fragment arriving later. You only save geometry work if each fragment carries it's position and all necessary interpolants with it in the f-buffer.

To get overdraw elimination and save on geomtry, I think you would have to proceed as follows:

pass 1: z-only

pass 2: retransform geometry (with lighting this time), output position, interpolants (possibly a program id?) to f-buffer.

pass 3 - N: taking an f-buffer as input, write output to f-buffer (your 2D post processing passes).

pass N+1: composite fragments to pixels in framebuffer.

So the number of fragments in the f-buffer (if you let it cover the entire screen) will be
(number of screen pixels)*(average # of opaque fragments per pixel) + total number of transparent fragments (no overdraw reduction here).

Let's assume 1600*1200, with no transparent fragments, ~1.25 opaque fragments per pixel, and the shader requiring 2 textures and a normal as input on average.

that would be ~128*3 + 32 bits per f-buffer entry, or 416 bits = 52 bytes of storage per fragment.

Storage space would be ~ 150 MB for the entire screen. Given that this will be written once and read once, each frame requires 300MB of bandwidth. A target of 60fps on a card with 30GB/s gives you 500MB per frame to play with.

It seems to me that giving the hardware the complete sequence of shaders to run on a primitive, and then letting it decide how many fragments to run each shader on before context switching would be more efficient than this (since it should be possible to keep all F-buffer traffic on chip).
 
ET said:
DeanoC said:
Speed is the issue, no architecture currently here or on the horizon is fast enough to really use long shaders (but it would be nice for the odd effect etc). To really use longer shaders we need at least a x10 speed increase.

But on the other hand, it could be useful for non-game applications, both for research and actual work. Being able to run any shader you want in hardware without having to worry about length could be a great boon for integrating graphics cards into actual rendering environments.

I agree but my post was in response to what game/demo needs/uses long shaders. I sometimes have to do research on the Ref device, so I know about slow rendering. (Currently 5 minutes per frame :( )

I've just read the 9800 review and not happy that the f-buffer is being confined to 'high end' cards by the drivers. Why IHV's have to decide that they know best is beyond me! I have several areas where I already want to use long shaders, while there use may be academic on the current cards I'd still like to support them as an option. Time to hassle ATI...
 
I too will be extremely disappointed if the f-buffer is confined solely to the workstation cards. It doesn't require much effort at all to bump into the 96-instruction limit of the 9700Pro, and manually multi-passing everything can be truly annoying and slow.

Maybe if we yell enough at devrel they will activate it? :LOL:
 
I'm still not sure on the staut of the F-Buffer driver support. One thing that ATI were keen to point out is that Rendermonkey will automatically produce multi-pass shader code, so you might be able to get long shaders running if that is used.
 
When they get over the GDC hangovers I see what the status is and possible solutions.

I'd imagine one problem is that HLSL only has PS_2_0 and all caps on PS_2_X support, I suspect we would need to get a PS_2_0 but with unlimited instructions path. That shouldn't be hard to convince MS if ATI set the right caps bits.

I kind of suspect they will support it, its what the PS_2_0 extended cap bits were designed for. Can't think of a reason why not to support it... unless its very very slow and would give NVIDIA some easy PR (not that, that is a good reason not to expose it to us).

I'm also curious about some other things, F-Buffers presumebly remove the texture dependency limit? (>4 just force a flush to the F-Buffer?), Do the number of texture loads and samplers per shader go up? (when you upload a new shader you could also rebind the texture samples?).

I ask the questions and see what they say.
 
Is the f-buffer able to prefetch the new pass' shader instructions after writing intermediate pixel results to the video memory? The instructions are held in FIFO (or some other) order, right, which means pre-fetching wouldn't require any fancy hardware speculation. It would effectively hide the latency inherent to multipassing, right?

I was also wondering what the f-buffer, or processor for that matter, does with the intermediate pixel results before the final results are recorded in the framebuffer. If the final pass is a function or an operation of the previous passes, what occurs? Are the intermediate results thrown back on to the onchip registers and used as operands?
 
Back
Top