Can F-buffer mask the importance of single-pass abilities?

dominikbehr · Mar 8, 2003

MfA said:
Ive <A HREF=http://groups.google.com/groups?q=renderman+single-precision&hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=87a9n5%2482b%241%40sherman.pixar.com&rnum=7>seen it said</A> that prman uses single precision floating point almost exclusively.

Wow! You have amazing google skills

And straight from the horses mouth. I agree that prman is high-end rendering. If 32bit fp is good for prman, especially for geometry processing thats really cool.

but,

Larry Gritz said:
Of course, both use doubles occasionally as
temporaries for intermediate calculations in certain parts of the
renderers where that last little bit of precision is vital.

I think this may be quite important here. He admit that there are places where doubles are vital. Unfortunately we do not know what would be consequences of using floats instead. Maybe what he considers vital we wouldnt even notice. Or maybe we would.

LeStoffer:
It seems Microsoft designed DX9 to last a while. And shaders 3.0 spec tells us how hardware will evolve over next few years. I consider it an incremental change over shaders 2.0 We just experienced a big shift from fixed to programmable pipeline. I think that we entered new era in hardware accelerated computer graphics. Judging by how long we used fixed pipeline current era will last very long. Actually there will be little incentive to change because there is nothing new on the horizon and everything seems to be possible now.

Reverend:
Your comments on precision issues are quite valid. Personally I dont find 32bit fp good enough for everything either.

Now, you may ask what is this positional data if it isn't an input to the vertex shader (which isn't supported yet by any hardware currently available)? Well, it could be in vertex or pixel shader constants, vertex stream values, textures (either manually generated or coming from render-to-texture).

See render to vertex array demonstrated lately at GDC using Radeon 9700 and uberbuffers extension.

------------
In summary:
- Computers work with finite precision numbers.
- It is a programmers job to know the architecture and write code that doesnt run into problems.
- Certain classes of calculations require some minimal precision.

I like geometry done in doubles or more (I think all the doubles input in OpenGL is there because of precision issues in flight/space sims), but fp32 seems to be good enough for most cases. fp24 is good for displacement mapping too. Even fp16 could be used for storing source object vertices, normals, normal maps, displacement mapping. But usually you want your matrices in fp32 or fp64.

For color/lighting calculations 8bit int, fp16, fp24 and maybe even fp32 in extreme cases depending on what you are doing.

Texture addressing depends on lots of factors, usually on size of the texture and number of repetitions. I liked the 3dlabs demo showing the precision of their texture interpolators on wildcat line. But the message I got was really: if you have dumb programmer|artist they can even screw up simple scene with a conveyor belt and ducklings.

MfA · Mar 8, 2003

Would it take GPUs much more hardware to support double precision floating point anyway? You'd think supporting it at 1/4th or 1/9th the throughput of single precision wouldn't add too much (need 26 bit multipliers for 1/4th, which would mean some wasted bits when using single precision, you can reuse the multipliers from the single precision units as is if you settle for 1/9th). Doesnt need to be fast, just needs to be able to accumulate some high precision values occasionally.

demalion · Mar 8, 2003

I'm assuming the data for such can't be calculated relative to and separately from a geometry face calculation? Also, I suppose I'm not certain about the range of applicability for texture address calculations in the pixel shader.

In any case, for subtraction/addition, how many ops would it take to carry the "lost precision" data separately, keeping in mind the modifiers the R300 supports?

What would be really interesting is some input from some ATI people on their thoughts for these types of calculations (I hate how the search doesn't allow less than 3 digit numbers for searching, since I'm pretty sure they have participated atleast in part in prior discussions... ).

Oh, and I guess I need to go figure out render to vertex array (where the need for full 32-bit FP precision seems more obvious to me). :-?

arjan de lumens · Mar 8, 2003

While you can do double-precision multiplies by combining a number of single-precision multipliers, you will need a substantial additional amount of hardware for DP adders (which cannot be made by combining SP adders). Also, for operations like reciprocal, rsq, pow, sin, cos, etc, there are reasonably fast hardware implementations available for single-precision operation that become much slower and more expensive with double-precision operation (by factors I would estimate to be between 4 and 10, depending on operation and how you balance performance against circuit size). Also, if you intend to split double-precision instructions over multiple cycles, you complicate the control logic that issues the shader instructions a great deal.

Reverend · Mar 8, 2003

dominikbehr said:
- It is a programmers job to know the architecture and write code that doesnt run into problems.

What a friggin' hard job that is when not all of a architecture is revealed to them from the onset. I love being excited by new architectures but I hate it when I have to waste time sending emails back and forth trying to find out why something is so... only to be told "Oh, that's the limitation... didn't we tell you about this before?"

PS. Sorry for a un-constructive post... lots of beer in the system from a dinner & dance with a new bloody sexy work colleague.

BRiT · Mar 8, 2003

Rev, I didn't know you worked with Kristof.

shaderman · Mar 8, 2003

loops/branches

DemoCoder said:
Not quite. There is still the issue of handing per-pixel data dependent branching. The biggest problem of course being loops with data-dependent conditional exits. Many branches can be handled with predicates, but dynamic loop conditionals can't be 'unrolled' as easily.

Handling this with the F-Buffer requires a little more sophication since you have a different number of "passes" per pixel which can't be determined ahead of time, but only as a result of the last pass.

F-Buffer is a good way to remove register and instruction length limits for PS3.0 support, but we still need more native HW support for data dependent branching. Yes, it will kill performance, but, like the F-Buffer, hopefully 90% of shaders won't use any data dependent branching.

The goal here is to handle the special cases so that programmers don't get annoyed. Better that the shader runs (albeit non-optimally) on some HW than not run at all and generate a compile error.

Think of it as a "catch all" that steps in to allow degenerate shaders to run on all HW, regardless of how many instructions or registers the HW has. This would truly be great if all HW had it.

For Nvidia, it's unlikely they will implement it, since once you yet into the thousands of instructions, the limits are effectively "infinite"

YOu are right and I made a mistake in a previous post. Branches out of a sub-pass are more tricky to handle. You could push an address into the F-buffer and resume on an address. Ideally, you want all branching and looping to stay in a sub-pass.

Data-dependent branching can be handled by pushing target addresses into the F-buffer. Data-dependent looping (if this even exists in current DX versions) can be handled by pushing the loop counter into the F-buffer. Again, the HW sequencer would have to explicitly support these kinds of features. These would perform very badly and it only buys you the ability to branch/loop over large numbers of instructions (probably not that useful). It seems that the compiler could break large branches and loops to make them fit in a sub-pass.

- SM

shaderman · Mar 8, 2003

we're really oversimplifying this -- oh well

Luminescent said:
A bit off topic, but just for clarification:

The R300 (and R350, presumably) holds 60 programmable floating point processors (fmad/frcp/flog/ect.). This is how the numbers add up:

.. lots of good stuff ...

Information resources may be found here:
http://www.beyond3d.com/forum/viewtopic.php?t=1902&highlight=128+bit
http://www.beyond3d.com/forum/viewtopic.php?t=3042&highlight=4+components
http://firingsquad.gamers.com/hardware/radeon_9700/default.asp

P4 = 2 FP ops (scalar) + 2 FP ops (SSE) ~ 4 FP ops per cycle @ 3000 MHz =~12 GF

Click to expand...

So at $60 vs $600, the R300/R350 blows the competition (PIV) away. Are you ready ... for more!! ( )

We also can't ignore all the other processing done by the GPU that must be explicitly done in a CPU, i.e. LOD, Texture Filtering, Alpha Blend, Texture Blend, Fog, ...

These are significant numbers of adders/multipliers (integer mainly). So our number for the GPU MIPS/FLOPS are probably low.

- SM

MfA · Mar 8, 2003

The size of adders gets lost in the margins, instructions apart from multiply/adds could be implemented with "microcode" (if that adds too much complexity just do the translation during loading in the driver) and maybe a couple of LUTs to speed up convergence ... it doesnt need to be fast. Some instructions already have latency of more than 1 cycle, and can of course be stalled because of memory access, so I dont see why it would make scheduling much harder. But if it is a real problem just add some NOPs after each double precision instruction.

arjan de lumens · Mar 8, 2003

Wide FP adders are about an order of magnitude more expensive than similar-width integer adders, mostly due to barrel shifters to normalize numbers before and after the actual addition. For DP RCP/RSQ, you can generally do first SP RCP/RSQ, then a couple of passes of Newton iterations. For exp/log/sin/cos/pow, you run into more trouble: for SP calculations, you can generally do LUTs and just interpolate between LUT entries, which is rather cheap; for DP, you pretty much need Taylor series (with, IIRC, ~10 terms for sufficient precision for DP), CORDIC or something similar, which tends to be slow and resource-intensive.

MfA · Mar 8, 2003

Given that we already accepted slow down you could combine the existing barrel shifters and do the shift in 2/3 passes (not unlike combining the existing multipliers into an array multiplier).

As for the intrinsic functions, as I said before ... I dont think it really matters how slow they are.

arjan de lumens · Mar 8, 2003

Hmmm ... combining the adders, barrel shifters and other support circuits (such as rounding circuitry, leading zeros detectors, Inf/NaN/Zero handling circuits, etc) of two SP adders in order to operate as 1 DP adder sounds ... incredibly hairy, but when I think of it, actually doable (but requiring much more design effort than, say, adding glue to 4 SP multipliers to get 1 DP multipler).

For the more complicated operations, yes, you can handle them with microcode or expanded macro instructions if a ~10x slowdown is acceptable.

shaderman · Mar 9, 2003

arjan de lumens said:
Wide FP adders are about an order of magnitude more expensive than similar-width integer adders, mostly due to barrel shifters to normalize numbers before and after the actual addition. For DP RCP/RSQ, you can generally do first SP RCP/RSQ, then a couple of passes of Newton iterations. For exp/log/sin/cos/pow, you run into more trouble: for SP calculations, you can generally do LUTs and just interpolate between LUT entries, which is rather cheap; for DP, you pretty much need Taylor series (with, IIRC, ~10 terms for sufficient precision for DP), CORDIC or something similar, which tends to be slow and resource-intensive.

all i said was that you have to take into account all the processing that's going on in a renderer. modern out-of-order, superscalar CPU's like (P6 and beyond) have lots of execution resources -- multiple ALUs, AGUs, FPUs, vector FPUs (SSE). And, in comparing it to an r300 or FX class processor, you have to take into account all the resources that the GPU offers. Although the GPU is pipelined differently than a P4 class processor.

i would assume that an r300 class processor still implements some form of "out-of-order" execution, otherwise I would not be able to "hide the latency of texture fetches" (as mentioned by people here in the know). i would guess that the r300 can handle multiple shader threads. that's why DX has all that ALU and fetch clausing baloney.

all those miscellaneous functions (rsq, sqrt, rcp, etc) are probably handled with point-slope LUTs and lerps, and they still result in 1 FLOP per cycle. but you make a good point (in an round-about way) -- CPUs generally don't handle these FLOPs with LUTs and lerps (too expensive), so they resort to short Taylor expansions or NR iterations which are slower (and < 1 op per cycle) but not resource intesive (luts and lerps are HW resource intesive).

- SM

psurge · Mar 9, 2003

Err... why go all the way to double precision? In PRMan they don't have the option of designing the hardware around their precision needs, but for GPUs you do.

I don't know how much you need where, but i do remember complaints about 24 bits not being enough for the z-buffer. So how about doing everything at say 48bit fp? That way you can use those fp pipes for 32 bit integer math, (full precision for z-writes in pixel shader), you get increased geometry precision, etc...

just my 2c, serge

moichi · Mar 9, 2003

F-buffer questions

sireric, I have some questions about F-buffer of RADEON 9800.

1) Can we(application programmers) create multiple F-buffer?

2) Can we set F-buffer to each texture stage?

3) Can we fetch F-buffer by texld shader instruction with any number of times?

4) Can we set F-buffer to each render target?

5) Can we write to F-buffer by mov shader instruction with any number of times?

6) I think F-buffer must generate fragments for subsequent passes.
At least one fragment need screen coordinate(x,y) and depth value.
We prepare F-buffer by writing these information at fragment shader.
And we activate F-buffer for fragment shader inputs of subsequent pass.
Is this process correct?

7) Can we support multi-pass fragment shader(through F-buffer) with stencil value?

8) Can we use F-buffer as vertex buffer immediately?

Sorry for my many questions.

Xmas · Mar 10, 2003

As I understand it, you cannot explicitly use the F-buffer as some kind of per-pixel stack (or for another use).

It is simply there to support 'unlimited length' fragment shaders fully transparent to the application. You don't even notice there is an F-buffer. The driver manages all the F-buffer stuff, you have no access to it.

Luminescent · Mar 10, 2003

What if Ati hands developers, which require it, the microcode necessary to expose custom functionality of the F-buffer (if there is, indeed, no way of hand coding for it).

moichi · Mar 10, 2003

general use of F-buffer

sireric said at this thread:

>Writes from the fragment shader to the F-Buffer are similar to other outputs,
>and have no effect on the fragment execution.
>F-Buffer reads are similar to texture reads and we already,
>by architecture, hide that latency from the shader execution.

So I thought we can fetch F-buffer as texture and write to F-buffer as render target.

He also said:
> That means that the F-Buffer reads/writes only occur a few times every 160 instruction pass (which is at most 64 cycles).

I interpreted "few times" means we can take F-buffer reads/writes with any number of times.
F-buffer is FIFO, so It's natural to reads/writes with any number of times.

He also said:

> For now, the plan is to support the F-Buffer in all products,
> in the GL2 and possibly as an extension (i.e. uber-buffers) in GL1.x.

"uber-buffers"(super buffer) was mentioned by Rob Mace(ATI) at OpenGL ARB meeting December.
http://www.opengl.org/developers/about/arb/notes/meeting_note_2002-12-10.html

I haven't white paper of super buffer.
But meeting notes said super buffer is "repurposible memory buffer".

ARB meeting notes said:
> Formed working group to develop an extension for memory buffers that are repurposable within the graphics pipeline,
> starting with the "uber buffers" white paper and earlier 3Dlabs work in this area.

I interpreted super buffer is repurposible among pixel buffer/vertex array/texture/etc.
If we can write to F-buffer with any number of times and repurpose F-buffer as vertex array,
we can easily generate triangles at fragment shader.

Mephisto · Mar 24, 2003

Luminescent said:
The R300 (and R350, presumably) holds 60 programmable floating point processors (fmad/frcp/flog/ect.). This is how the numbers add up:

Dumb question, but what's an "FMAD"? What does it stand for? Floating Point Multiply and Divide unit?

Simon F · Mar 24, 2003

Mephisto said:
Luminescent said:

The R300 (and R350, presumably) holds 60 programmable floating point processors (fmad/frcp/flog/ect.). This is how the numbers add up:

Click to expand...

Dumb question, but what's an "FMAD"? What does it stand for? Floating Point Multiply and Divide unit?

MAD usually means "Multiply Add" and appears as an opcode in some CPUs.

In terms of shaders the "frcp" (reciprocal) would be the equivalent of a divide.

Can F-buffer mask the importance of single-pass abilities?

dominikbehr

MfA

demalion

arjan de lumens

Reverend

BRiT

(>• •)>⌐■-■ (⌐■-■)

shaderman

shaderman

MfA

arjan de lumens

MfA

arjan de lumens

shaderman

psurge

moichi

Xmas

Porous

Luminescent

moichi

Mephisto

Simon F

Tea maker

Similar threads