Does the F-buffer increase the # of instructions per pass?

In theory, would the R350's speculated F-buffer allow for an unlimited amount of pixel shader instructions per pass (no multipassing)? I am assuming the buffer is a sort of memory pool outside the chip which reverts to read/write registers on chip. If true, it should allow an arbitrarily long pixel shader to continue loading/offloading pixel instructions to/from the buffer without multipassing? What do you guys think (correct my perception of F-buffer, if necessary)?
 
I wonder how many transistors Ati spent on this implementation? Ati supposedly (if rumors hold true) improved aniso, AA, render target, texture and stencil support (hopefully truform too), so it seems the R350 will be around 5 to 10 million transistors greater than the R300.
 
Re: Does the F-buffer increase the # of instructions per pas

Luminescent said:
In theory, would the R350's speculated F-buffer allow for an unlimited amount of pixel shader instructions per pass (no multipassing)? I am assuming the buffer is a sort of memory pool outside the chip which reverts to read/write registers on chip. If true, it should allow an arbitrarily long pixel shader to continue loading/offloading pixel instructions to/from the buffer without multipassing? What do you guys think (correct my perception of F-buffer, if necessary)?

I'll take a crack at it, though I'm no wiz. ;)

Yes an F-buffer is a buffer which contains temporary data outside the frame buffer.
The actual buffer could be on-chip cache, in VRAM or normal system DRAM, preferebly two of those (ie on-chip and VRAM for an example, could even be all three).

As for your explanation of what it's for, I'd say that's pretty much it.
Instead of reprocessing geometry and other uneccesary information each pass it only fetches the needed fragment data from the F-buffer.
It's still "multi passing", it's just storing and fetching temporary results much more efficiently.
In short: it only multi passes the pixels that absolutely need it. So to answer the topic question I'd say that it increases the amount of instuctions per traditional pass to virtually unlimited numbers.

There's many different ways to implement an F-buffer and there's also variantions to what kind of data you want it to store. (perhaps just RGBA values or perhaps depth values for those fragments too etc.)
There's also a few variations of where you want to input the buffer data into the pixel pipeline.
So I guess it's pretty hard to give any straight answers about what it will do for this or that architecture unless you have pretty specific detailed information.

Sorry if this sounded dumbed down or anything, not my intention to give any lectures.
I'm mostly in read only mode when it comes to the more technical bits here at B3D for a reason ya know. (ie feel free to correct all my misunderstandings hehe) :LOL:
 
The basic idea of the F-buffer is to store intermediate results of pixel shading on a per-pixel basis, in a FIFO-like buffer, where data are both written in and read out in rasterization order. An F-buffer cannot by itself reduce the number of passes needed per pixel in hardware, as it doesn't by itself affect the number of instructions the hardware can perform per pass. It can, however, make the multipassing transparent to software, so that to at least the application software sees only a shader that can support arbitrarily many instructions per pass.

Since the F-buffer needs to contain the complete per-pixel shader state for each pixel, it can grow absolutely huge unless some care is taken: for the R300/R350, the worst-case state is at least 32 temporary registers * 96 bits per register = 384 bytes per pixel ( > 700 MB @ 1600x1200 ). So we need some mechanism to limit the number of pixels in the F-buffer. Given that a large enough polygon can overflow an F-buffer of any reasonable size, we need some hardware mechanism that can swap between render states in the middle of rasterization of the polygon, so that the F-buffer can be emptied before we reach the end of the polygon.

This implies a number of things about the F-buffer. First, it must be managed completely by hardware, as follows:
  1. First, we rasterize N pixels (as many as the F-buffer can hold) and run pass1 of the shader program on all these pixels, storing the result to the F-buffer
  2. Then, we stall the rasterizer, switch render-state, loading the pixel shader program for pass2
  3. Then, we run pass2 on all the N pixels from the F-buffer, storing the result in framebuffer and/or the F-buffer, depending on whether this pass was the last or not.
  4. Repeat points 2-3 for each pass of the pixel shader program, until we have run all the necessary passes.
  5. Load the pixel shader program for pass1 and go back to point 1, until we have no pixels left to rasterize.
Also, for this scheme to work, framebuffer location for every pixel in the F-buffer needs to be passed along with all the other per-pixel data. This scheme will require the driver to allocate some memory for the F-buffer at the start of operation (a few megabytes at most), but will otherwise support arbitrary-length shaders in a manner fully transparent to both application and driver (except that performance will likely take a massive hit once the F-buffer starts its operation). In particular, it will not require geometry data to be passed multiple times.

This scheme will exhibit better texture cache locality than a long-shader impementation without the F-buffer, but at the cost of consuming lots of bandwidth for the F-buffer.

The F-buffer scheme will, however, fail when flow control/data dependent jumps are allowed into the pixel shaders (Pixel Shaders 3.0; if you allow N instructions per pass, a for-loop with N+1 instructions will break the F-buffer)
 
What is it, then, that determines the amount of instructions the hardware can perform per pass? I thought the F-buffer would iliminate this bottleneck. Isn't it the cache which determines the number of instructions availabe per pass? If there is a pool of virtual memory to replace this cache, wouldn't a single pass be unrestricted in terms of instruction counts?
 
Luminescent said:
What is it, then, that determines the amount of instructions the hardware can perform per pass? I thought the F-buffer would iliminate this bottleneck. Isn't it the cache which determines the number of instructions availabe per pass? If there is a pool of virtual memory to replace this cache, wouldn't a single pass be unrestricted in terms of instruction counts?

The size of the cache itself doesn't restrict the amounts of instructions it can process in a single pass.
It's still doing another pass, it's just not doing traditional multi passing ie it's not passing through the whole rendering pipeline more than once. So I guess what I meaning to say is that the amount of instructions is virtually unlimited per each whole pass through the pipeline.

Sorry if I'm being a 'tard. :)
 
Ante P:
The size of the cache itself doesn't restrict the amounts of instructions it can process in a single pass.
It's still doing another pass, it's just not doing traditional multi passing ie it's not passing through the whole rendering pipeline more than once. So I guess what I meaning to say is that the amount of instructions is virtually unlimited per each whole pass through the pipeline.
Sorry, the "cache" I was referring to there was the on-chip alu cache (if there is one), which feeds the pixel processing units , not the "effective" F-buffer cache. So my question still stands: what is the limiting factor of instruction-lengths-available, per pixel pass? Is it not the "cache"? Doesn't the F-buffer "effectively" extend this "cache" sort of like virtual memory on the hardrive extends the functionality of system ram?

You are definitely not a" 'tard" :LOL: Ante, it is great to have someone else's opinion in the mix. You seem to be quite knowledgeable nonetheless. Just trying to learn what there is from a variety of perspectives; synthesize information. This is just one of the many questions regarding architecture that come up from reading articles, such as that on the F-buffer.
 
If the F-buffer works as I described, the factor that will limit shader length per HW pixel pass will most likely be the size of the cache that holds the active portion of the shader on-chip. Smaller that that is too inefficient to make sense and larger will cause the cache to thrash, which is something you want to avoid with the F-buffer already sucking up loads and loads of bandwidth. How large this cache is may be impossible to find out without analyzing the shader performance as a function of shader program length.

You can probably view the F-buffer as a way of virtually extending the memory used to hold the pixel shader program..
 
arjan de lumens said:
Smaller that that is too inefficient to make sense and larger will cause the cache to thrash, which is something you want to avoid with the F-buffer already sucking up loads and loads of bandwidth.

Loads and loads? In relation to what?
 
DaveBaumann said:
Loads and loads? In relation to what?
Unless I have completely misunderstood the F-buffer, it has to write out and read in the entire per-pixel shader state for every pixel entering or leaving the buffer. For R300 and derivatives, this is at least 32 temp registers * 96 bits per register = 384 bytes. I would count that as 'loads and loads'.
 
arjan,

AFAICS, the compiler could split the for loop in half (temps for each loop iteration => f-buffer, read on next "pass"). It's not immediately obvious to me why data-dependent control would break an F-buffer (so long as a program can write an identifier for the next program to run on the values it has output to the f-buffer). Could you explain?

I was thinking that if you allow each pass to ouput such a program id, then something similar to an F-buffer could in fact be used to support
data-dependant branching:

- split the program into blocks (let's just call them kernels :) with only static control flow. The execution of one such kernel K1 will produce state of some fixed size. You allow this state to be saved to one of two possible output streams/F-buffers F1_1 or F1_2 based on some data-dependent condition...

Once K1 has been executed as many times as needed (all pixels in a triangle are done) or allowed (to keep the size of the F-buffer manageable), context switch to kernel K2 taking F1_1 as an input. Run K2 as many times as necessary/allowed, then context switch to kernel K3 taking F1_2 as an input.

Maybe you could also use the F-buffer to hide texture access latency, as follows:

- Allocate space in an F-buffer for a filtered texture result (but defer writing the actual result).
- Arrange the kernel such that the last instruction is the texture lookup.
- Pipeline the texture unit so that it can have many texture accessess in-flight.


Regards,
Serge
 
OK ... If you store a program counter along with all the other per-pixel state data, you can make F-buffering work with as arbitrary data-dependent branching as you want.

But: the way I see it, F-buffering mostly makes sense as a method to avoid pixel shader instruction cache thrashing (which IIRC hits NV30's performance hard on long shaders) by making sure to process as much data as practically possible for each time you reload the instruction cache. If you allow arbitrary data-dependent jumps into the mix, you will at worst risk running into a situation where you get BOTH the bandwidth overhead of F-buffering and instruction cache thrashing. You may be able to mitigate the damage, as you suggest, most of the time by placing out rendezvous points such that all pixels must reach such a point before further processing is allowed - this should fix most smaller loops/kernels and the N+1-instruction for-loop, but you lose when I put that for-loop into a while-loop within a switch statement within a function called from within an if-statement :devilish: OK, it will probably work fairly well most (>99%?) of the time, for most shaders except truly evil ones, but when it fails, it fails spectacularly.

For non-data-dependent flow control, like that found in VS2.0, F-buffering should always work fine, though.
 
If bandwidth was an issue don't you think they would combine the F-Buffer with lossless compression. The would minimize the impact a great deal.
 
Given that most of the per-pixel state is floating-point numbers, I don't expect the F-buffer data to be very compressible (except possibly data freshly read out of a texture map, and the framebuffer location data). What you can do, however, is to run a live-variable analysis at shader compile time, split the pixel shader up at points with few live variables, and at each of those points F-buffer only the live variables. I would guess that you can get away with 5-10 registers much of the time, giving a 70-85% saving over naively storing all the 32 registers.
 
This would only be a serious issue if every pixel (or a large number) used a long shader program though.
 
rwolf said:
This would only be a serious issue if every pixel (or a large number) used a long shader program though.

why would it be an issue than?
If something is using long enough shaders to require the F-buffer, dont you think the R300 might have bandwidth to spare?

It it has to sit there for a 500 instruction shader, how many clocks is that?
I'd bet that because of shader throughput limitations, bandwidth is a non-concern.
 
Just running a long shader program is quite a hit without having to swap information out to memory. But most shaders, I would imagine are short (I believe) so the F-buffer would be to handle the exception rather than the rule.
 
Back
Top