Can F-buffer mask the importance of single-pass abilities?

Luminescent · Mar 6, 2003

If the F-buffer is able to pre-fetch incoming fragment instructions (organized in FIFO order), it seems it would have the ability to relieve the limitations of per-pass performance in the R3xx (particularly R350) architecture. The R3xx core is limited to 4 dependent texture reads per pass, with diminishing performance returns if over 2 dependencies are required. Assuming an arbitrarily long shader program with multiple texture dependencies, woudln't the F-buffer "effectively" be able to fetch the intstructions following the 2 dependencies (even if 4 can be computed per pass) and hold them while the fragment pipeline is finishes computing the first two, loading the next 2 into the registers as they are cleared to the main memory (repeating this how ever many times it takes to complete the shader). The first half of this pass (results for 2 dependent lookups and arbitrary shader computations) would write to the video memory (through the F-buffer), while the next half of the shader pass continues (again, continuing this as many times as necessary throughout the program). This would mask the performance penalty of going over 2 texture dependencies per shader, no? Does this dilute the value of Geforce FX's supposed superiority to R300 in long, complex shaders?

demalion · Mar 6, 2003

I'm not sure to what degree the GF FX has an advantage, as the R300 is slower relative to itself with over 2 dependencies. Presumably, it is slower relative to the GF FX for over 4, but even that hasn't been established clearly to my knowledge.

After that, it seems the F-buffer could present interesting on-chip optimization possibilities, but do we have any idea of latency issues with it?

Forgive me if I'm missing the obvious (I'm not understanding how your explanation can be universal), but I'm a bit rushed for an appointment.

Luminescent · Mar 6, 2003

I am mainly referring to the possibilities of the F-buffer in the R350 (or Rxxx architecture). From what seems to be, it allows the R350 to attain the major incentives of the NV30 (in terms of shader length, and resource management in the fragment pipeline), lest all the NV30's vices.

Even if F-buffer usage and management introduces a small latency overhead (reading and writing to and from the buffer), the performance of the R350 on shaders, longer than it supports (per pass), should grow significantly (in comparison to the NV30). With the F-buffer only fragment multipassing is necessary, and the latency can be hidden by pipelining the memory reads and writes.

The pass splitting (splitting the what would be able to execute in a single pass, to two) for performance gain, in my dependent texture example would seem to require developer support. However, masking the latency of shader changes per pass should be possible, and very effective; it would essentially make seem as though the R350 can execute arbitrarily long shaders with dependencies, and pay no significant penalty aside from those it takes in a single pass.

Cool (if and only if the F-buffer can work in such a way; seems logical). 8)

sireric · Mar 6, 2003

In our implementation of the F-Buffer, we can completly hide the latency of accessing the buffer. Writes from the fragment shader to the F-Buffer are similar to other outputs, and have no effect on the fragment execution. F-Buffer reads are similar to texture reads and we already, by architecture, hide that latency from the shader execution.

The only issue that is left is BW. The thing to note is that the F-Buffer will be invoked when the instruction count exceeds the 160 instruction limit. That means that the F-Buffer reads/writes only occur a few times every 160 instruction pass (which is at most 64 cycles). That means that F-Buffer BW is very low. Texture reads from the shader program would still dominate the BW.

In general, real time applications will not take advantage of the F-Buffer, since real time applications will limit their shader count to, at most, 1 to 2 dozens of instructions (i.e. 3dmark03 or D3). Of course, they could use them for small high-complexity object. That being said, using our F-Buffer, we were able to execute a compiled renderman shader (~500 instructions) at 50 FPS (our quadraFX board executed it at 2.7 FPS -- must be a driver bug?). In the same way, but at lower fps, we can execute much more complexe shaders (10's of thousands of instructions).

Later

Joe DeFuria · Mar 6, 2003

Can you clarify if the F-Buffer will be supported on the 9800/Pro, or only the FireGL version of the product? (Assuming a GL product is forthcoming). And in both DirectX and/or OpenGL?

Nebuchadnezzar · Mar 6, 2003

yay for sireric.

_GeLeTo_ · Mar 6, 2003

TEXKILL & F-Buffers

Would it be possible to implement dynamic flow control by using TEXKILL in the F-Buffer passes?

sireric · Mar 6, 2003

For now, the plan is to support the F-Buffer in all products, in the GL2 and possibly as an extension (i.e. uber-buffers) in GL1.x.

We haven't commited to bringing it to DX yet.

Static flow control is easy to implement with F-Buffers. Dynamic flow control is possible, but a little more tricky. You would need to program all alternatives as a seperate group of passes, and use a kill function to prevent the writes. Since, in general, F-Buffers are not real time, it should work fine. If it's just a few hundred instructions, then real time even should be fine. Depends on the complexity.

Luminescent · Mar 6, 2003

cool, sireric, your help is always appreciated.

Joe DeFuria · Mar 6, 2003

[In the requisite Homer Simpson voice]

"Mmmmm.... uber-buffers...."

thanks for the info, sireric!

Xmas · Mar 6, 2003

sireric, could you give us some infos on how big that F-buffer is/can be and whether it is (partly) on-chip?

Bambers · Mar 7, 2003

Presumably it doesn't matter what limit the shader runs into ie constants etc? It will just start a new 'pass' ?

Luminescent · Mar 7, 2003

X-mas, the F-buffer is just temporary storage for the next pixel instructions going to the fragment processor registers. Following each necessary pass, the intermediate results are taken from the fragement registers and stored in the f-buffer; the data is then used as the input for the next series of instruction shaders.

Xmas · Mar 7, 2003

How does that relate to my question?

arjan de lumens · Mar 7, 2003

Hmm .. I do have some questions myself about the F-buffer to sireric/whoever else may be qualified to answer:

How huch per-fragment state does it hold? Are we talking about just 1-4 RGBA tuples or the full set of 32 or so floating-point pixel shader registers?
How transparent is it to the programmer? Does the shader programmer need to know it's there and how it works in order to take advantage of it at all? Or can I just write a 1000-instruction shader, constantly using 25+ temporary pixel shader registers, and expect the R350 and its driver to just sort it out for me?

The reason I ask is that I am a little bit confused as to how it operates: if it stored the full 32-register per-fragment state, we would be talking about several hundred bytes per fragment, but sireric here seems to claim that it is about as expensive as a framebuffer write + a texture lookup per pass. So what is going on here?

Luminescent · Mar 7, 2003

Xmas, I guess I forgot to state that because it is only temporary storage, it is probably no bigger than a small cache. This is pure speculation though. Sireric would be able to provide the facts.

Xmas · Mar 7, 2003

Luminescent said:
Xmas, I guess I forgot to state that because it is only temporary storage, it is probably no bigger than a small cache. This is pure speculation though. Sireric would be able to provide the facts.

Yes, temporary storage, but for how many pixels? If it's only a few KiB and you need to store lots of active registers, It may only be a few dozen pixels before you have to switch to the next pass for these pixels.

Luminescent · Mar 7, 2003

Sireric wrote:
The thing to note is that the F-Buffer will be invoked when the instruction count exceeds the 160 instruction limit. That means that the F-Buffer reads/writes only occur a few times every 160 instruction pass (which is at most 64 cycles).

The fact that the F-buffer is not loaded frequently and only when instruction counts exceed 160, seems to indicate it contains the full amount of 32 r/w temps for the data it recieves as input and data it loads from the video memory.

I've not quite figured out the difference between the way instructions are stored and the way inputs are stored. How are their registers different?. After the outputs of the fragment pipeline are read back into the F-buffer (following a pass), are the results written back to the pipeline as input, along with the next set of instructions?

Does the F-buffer recieve the data for all coming passes, even the first, if instruction counts exceed 160?

shaderman · Mar 7, 2003

stuff

Luminescent said:
Sireric wrote:
The thing to note is that the F-Buffer will be invoked when the instruction count exceeds the 160 instruction limit. That means that the F-Buffer reads/writes only occur a few times every 160 instruction pass (which is at most 64 cycles).

Click to expand...

The fact that the F-buffer is not loaded frequently and only when instruction counts exceed 160, seems to indicate it contains the full amount of 32 r/w temps for the data it recieves as input and data it loads from the video memory.

I've not quite figured out the difference between the way instructions are stored and the way inputs are stored. How are their registers different?. After the outputs of the fragment pipeline are read back into the F-buffer (following a pass), are the results written back to the pipeline as input, along with the next set of instructions?

Does the F-buffer recieve the data for all coming passes, even the first, if instruction counts exceed 160?

It's not necessary to write out (to the F-buffer) ALL the register states between passes. Only the registers that need to live between passes.

Also, it's not necessary to involve shader code in the F-buffer. It's all data.

If you construct a code DAG from a shader, some portions resolve to single values (paths) and bifurcations in the graph result from conditionals and independent code paths. The trick is (obviously) to keep dynamic conditional in a "sub"-pass, since handling them across passes gets a bit more complicated. But you could do this by storing the conditional predicate in the F-buffer.

You handle the independent code paths in the shader by pushing values into the F-buffer. So the F-buffer really is just to handle the independent code paths in a shader.

Imaging that a primitive generates 10 fragments in a ~3 pass (~500 instructions) shader. The first 100 instructions could resolve to a single result, e.g. computing a color from compositing cubic_env_maps, xformed normals, xformed eye/lights, some dependent texture reads, resulting in a single color value.

The next 150 instructions do something similar to get another color result. And so on until instruction 500.

Each section of code that results in a few (or one) outstanding value (AND needs to be read by a subsequent pass) gets pushed into the F-buffer. The subsequent passes would POP, IFF they needed them for some reason, otherwise there would be no need to F-buffer the data, and you could write the result to output registers (since no subsequent passes would need to write that particular output). Maybe one pass does primary color and the second pass does the secondary color. No F-buffer needed. But you do need HW multipassing support. The final combine would handle the blend/etc.

It seems the best way to expose F-buffers, is to allow arbitrary length shaders. No need for further (explicit) support.

Also, there's lots of talk about long shaders being non-realtime. Rubbish. There are potentially lots of uses for expensive pixels to do finishing touches. Especially in a deferred shading system.

What's particularly interesting is that DCC people can replace roomfull's of Pentium's (render farms) with roomfull's of r300's to make serious render farms.

I would estimate that one r300 can replace (in pure FLOP/dollar terms) ~10 P4's.

R300 = 8 pipes = 32 FP ops per cycle @ 400 MHz = ~ 13 GF
P4 = 2 FP ops (scalar) + 2 FP ops (SSE) ~ 4 FP ops per cycle @ 3000 MHz = ~12 GF

R300 @ 400 MHz = ~ $60
P4 @ 3000 MHz = $600

Assuming you can fully utilize FP/SSE mixing in the P4 (not likely). Math is probably wrong. People in the know can correct this...

- SM

Luminescent · Mar 7, 2003

A bit off topic, but just for clarification:

The R300 (and R350, presumably) holds 60 programmable floating point processors (fmad/frcp/flog/ect.). This is how the numbers add up:

In the vertex shader: there are 5 units per vertex pipeline, 4 fmads, and 1 scalar (complex function) fp core. Each fmad can execute 2 fp instructions per-cycle and the scalar fp unit can execute 1. This yields 9 fp ops per vertex unit per-cycle. Four vertex units would yield a theoretical fp performance rate of (9*4*400) 14.4 gflops at 400 MHz.

In the fragment shader: there are three major floating point cores. A texture filtering and coordinate unit, a texture address unit, and a fragment color unit. The major contributer to floating point ops is the fragment color unit, which consists of 4 fmads and a special purpose fp core (like the vertex shader's pipeline, organized a bit differently). Each fmad contributes with a potential 2 fp ops per-cycle, and the complex fp core can kick in at 1 fp op. This would also sum-up to a total of 9 fp ops possible per clock. Eight pixel pipelines mean 8*9 fp color ops per cycle. This yields 28.8 gflops at 400 MHz. Counting the fp texture address unit, which is capable of 1 fp op in itself would increase the fp per-cylce count to (10x8) 80 fp ops. This would add up to a whopping 32 gflops of capability in the R300's pixel shader. (Man that was long!)

These are all fully programmable units, so they are, in essence, fully programmable flops (at least for the tasks they were meant to handle). If we add the max theoretical outputs of the vertex and pixel pipelines of the R300/R350, we arrive at 46.4 gflops, which is nothing to scoff at. Remember, this does not include the triangle set-up unit throughput or texture filtering unit throughput, which are floating point capable. However, these units, to my knowledge, are not fully programmable. The same goes for the anti-alaising portions of the processor.

Let us not forget that out of those 46.4 potential gflops, 28.8 of them (the ones coming from the fragment color processor) are at 24-bit per color component (96-bit total) max precision, while the others, from the vertex units and texture address unit, are 32-bit per color component (128-bit total) max precision.

I came to this information by reading a few informative theads, in the Beyond 3D forum, and consulting with a certain, reliable, individual for clarification.

Information resources may be found here:
http://www.beyond3d.com/forum/viewtopic.php?p=28682#28682
http://www.beyond3d.com/forum/viewtopic.php?p=53279#53279
http://firingsquad.gamers.com/hardware/radeon_9700/default.asp

P4 = 2 FP ops (scalar) + 2 FP ops (SSE) ~ 4 FP ops per cycle @ 3000 MHz =~12 GF

So at $60 vs $600, the R300/R350 blows the competition (PIV) away. Are you ready ... for more!! (

)

Can F-buffer mask the importance of single-pass abilities?

Luminescent

demalion

Luminescent

sireric

Joe DeFuria

Nebuchadnezzar

_GeLeTo_

sireric

Luminescent

Joe DeFuria

Xmas

Porous

Bambers

Luminescent

Xmas

Porous

arjan de lumens

Luminescent

Xmas

Porous

Luminescent

shaderman

Luminescent

Similar threads