Can F-buffer mask the importance of single-pass abilities?

Discussion in 'Architecture and Products' started by Luminescent, Mar 6, 2003.

  1. dominikbehr

    Newcomer

    Joined:
    Apr 19, 2002
    Messages:
    72
    Likes Received:
    0
    Location:
    Sunnyvale, CA
    Wow! You have amazing google skills :) And straight from the horses mouth. I agree that prman is high-end rendering. If 32bit fp is good for prman, especially for geometry processing thats really cool.

    but,
    I think this may be quite important here. He admit that there are places where doubles are vital. Unfortunately we do not know what would be consequences of using floats instead. Maybe what he considers vital we wouldnt even notice. Or maybe we would.

    LeStoffer:
    It seems Microsoft designed DX9 to last a while. And shaders 3.0 spec tells us how hardware will evolve over next few years. I consider it an incremental change over shaders 2.0 We just experienced a big shift from fixed to programmable pipeline. I think that we entered new era in hardware accelerated computer graphics. Judging by how long we used fixed pipeline current era will last very long. Actually there will be little incentive to change because there is nothing new on the horizon and everything seems to be possible now.

    Reverend:
    Your comments on precision issues are quite valid. Personally I dont find 32bit fp good enough for everything either.

    See render to vertex array demonstrated lately at GDC using Radeon 9700 and uberbuffers extension.

    ------------
    In summary:
    - Computers work with finite precision numbers.
    - It is a programmers job to know the architecture and write code that doesnt run into problems.
    - Certain classes of calculations require some minimal precision.

    I like geometry done in doubles or more (I think all the doubles input in OpenGL is there because of precision issues in flight/space sims), but fp32 seems to be good enough for most cases. fp24 is good for displacement mapping too. Even fp16 could be used for storing source object vertices, normals, normal maps, displacement mapping. But usually you want your matrices in fp32 or fp64.

    For color/lighting calculations 8bit int, fp16, fp24 and maybe even fp32 in extreme cases depending on what you are doing.

    Texture addressing depends on lots of factors, usually on size of the texture and number of repetitions. I liked the 3dlabs demo showing the precision of their texture interpolators on wildcat line. But the message I got was really: if you have dumb programmer|artist they can even screw up simple scene with a conveyor belt and ducklings.
     
  2. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,808
    Likes Received:
    473
    Would it take GPUs much more hardware to support double precision floating point anyway? You'd think supporting it at 1/4th or 1/9th the throughput of single precision wouldn't add too much (need 26 bit multipliers for 1/4th, which would mean some wasted bits when using single precision, you can reuse the multipliers from the single precision units as is if you settle for 1/9th). Doesnt need to be fast, just needs to be able to accumulate some high precision values occasionally.
     
  3. demalion

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,024
    Likes Received:
    1
    Location:
    CT
    I'm assuming the data for such can't be calculated relative to and separately from a geometry face calculation? Also, I suppose I'm not certain about the range of applicability for texture address calculations in the pixel shader.

    In any case, for subtraction/addition, how many ops would it take to carry the "lost precision" data separately, keeping in mind the modifiers the R300 supports?

    What would be really interesting is some input from some ATI people on their thoughts for these types of calculations (I hate how the search doesn't allow less than 3 digit numbers for searching, since I'm pretty sure they have participated atleast in part in prior discussions... ).

    Oh, and I guess I need to go figure out render to vertex array (where the need for full 32-bit FP precision seems more obvious to me). :-?
     
  4. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    While you can do double-precision multiplies by combining a number of single-precision multipliers, you will need a substantial additional amount of hardware for DP adders (which cannot be made by combining SP adders). Also, for operations like reciprocal, rsq, pow, sin, cos, etc, there are reasonably fast hardware implementations available for single-precision operation that become much slower and more expensive with double-precision operation (by factors I would estimate to be between 4 and 10, depending on operation and how you balance performance against circuit size). Also, if you intend to split double-precision instructions over multiple cycles, you complicate the control logic that issues the shader instructions a great deal.
     
  5. Reverend

    Banned

    Joined:
    Jan 31, 2002
    Messages:
    3,266
    Likes Received:
    24
    What a friggin' hard job that is when not all of a architecture is revealed to them from the onset. I love being excited by new architectures but I hate it when I have to waste time sending emails back and forth trying to find out why something is so... only to be told "Oh, that's the limitation... didn't we tell you about this before?"

    PS. Sorry for a un-constructive post... lots of beer in the system from a dinner & dance with a new bloody sexy work colleague.
     
  6. BRiT

    BRiT (╯°□°)╯
    Moderator Legend Alpha Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    12,514
    Likes Received:
    8,718
    Location:
    Cleveland
    Rev, I didn't know you worked with Kristof. :lol:
     
  7. shaderman

    Newcomer

    Joined:
    Jan 3, 2003
    Messages:
    19
    Likes Received:
    0
    loops/branches

    YOu are right and I made a mistake in a previous post. Branches out of a sub-pass are more tricky to handle. You could push an address into the F-buffer and resume on an address. Ideally, you want all branching and looping to stay in a sub-pass.

    Data-dependent branching can be handled by pushing target addresses into the F-buffer. Data-dependent looping (if this even exists in current DX versions) can be handled by pushing the loop counter into the F-buffer. Again, the HW sequencer would have to explicitly support these kinds of features. These would perform very badly and it only buys you the ability to branch/loop over large numbers of instructions (probably not that useful). It seems that the compiler could break large branches and loops to make them fit in a sub-pass.

    - SM
     
  8. shaderman

    Newcomer

    Joined:
    Jan 3, 2003
    Messages:
    19
    Likes Received:
    0
    we're really oversimplifying this -- oh well

    We also can't ignore all the other processing done by the GPU that must be explicitly done in a CPU, i.e. LOD, Texture Filtering, Alpha Blend, Texture Blend, Fog, ...

    These are significant numbers of adders/multipliers (integer mainly). So our number for the GPU MIPS/FLOPS are probably low.

    - SM
     
  9. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,808
    Likes Received:
    473
    The size of adders gets lost in the margins, instructions apart from multiply/adds could be implemented with "microcode" (if that adds too much complexity just do the translation during loading in the driver) and maybe a couple of LUTs to speed up convergence ... it doesnt need to be fast. Some instructions already have latency of more than 1 cycle, and can of course be stalled because of memory access, so I dont see why it would make scheduling much harder. But if it is a real problem just add some NOPs after each double precision instruction.
     
  10. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    Wide FP adders are about an order of magnitude more expensive than similar-width integer adders, mostly due to barrel shifters to normalize numbers before and after the actual addition. For DP RCP/RSQ, you can generally do first SP RCP/RSQ, then a couple of passes of Newton iterations. For exp/log/sin/cos/pow, you run into more trouble: for SP calculations, you can generally do LUTs and just interpolate between LUT entries, which is rather cheap; for DP, you pretty much need Taylor series (with, IIRC, ~10 terms for sufficient precision for DP), CORDIC or something similar, which tends to be slow and resource-intensive.
     
  11. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,808
    Likes Received:
    473
    Given that we already accepted slow down you could combine the existing barrel shifters and do the shift in 2/3 passes (not unlike combining the existing multipliers into an array multiplier).

    As for the intrinsic functions, as I said before ... I dont think it really matters how slow they are.
     
  12. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    Hmmm ... combining the adders, barrel shifters and other support circuits (such as rounding circuitry, leading zeros detectors, Inf/NaN/Zero handling circuits, etc) of two SP adders in order to operate as 1 DP adder sounds ... incredibly hairy, but when I think of it, actually doable (but requiring much more design effort than, say, adding glue to 4 SP multipliers to get 1 DP multipler).

    For the more complicated operations, yes, you can handle them with microcode or expanded macro instructions if a ~10x slowdown is acceptable.
     
  13. shaderman

    Newcomer

    Joined:
    Jan 3, 2003
    Messages:
    19
    Likes Received:
    0
    all i said was that you have to take into account all the processing that's going on in a renderer. modern out-of-order, superscalar CPU's like (P6 and beyond) have lots of execution resources -- multiple ALUs, AGUs, FPUs, vector FPUs (SSE). And, in comparing it to an r300 or FX class processor, you have to take into account all the resources that the GPU offers. Although the GPU is pipelined differently than a P4 class processor.

    i would assume that an r300 class processor still implements some form of "out-of-order" execution, otherwise I would not be able to "hide the latency of texture fetches" (as mentioned by people here in the know). i would guess that the r300 can handle multiple shader threads. that's why DX has all that ALU and fetch clausing baloney.

    all those miscellaneous functions (rsq, sqrt, rcp, etc) are probably handled with point-slope LUTs and lerps, and they still result in 1 FLOP per cycle. but you make a good point (in an round-about way) -- CPUs generally don't handle these FLOPs with LUTs and lerps (too expensive), so they resort to short Taylor expansions or NR iterations which are slower (and < 1 op per cycle) but not resource intesive (luts and lerps are HW resource intesive).

    - SM
     
  14. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    Err... why go all the way to double precision? In PRMan they don't have the option of designing the hardware around their precision needs, but for GPUs you do.

    I don't know how much you need where, but i do remember complaints about 24 bits not being enough for the z-buffer. So how about doing everything at say 48bit fp? That way you can use those fp pipes for 32 bit integer math, (full precision for z-writes in pixel shader), you get increased geometry precision, etc...

    just my 2c, serge
     
  15. moichi

    Newcomer

    Joined:
    Jul 22, 2002
    Messages:
    36
    Likes Received:
    0
    Location:
    Japan
    F-buffer questions

    sireric, I have some questions about F-buffer of RADEON 9800.


    1) Can we(application programmers) create multiple F-buffer?

    2) Can we set F-buffer to each texture stage?

    3) Can we fetch F-buffer by texld shader instruction with any number of times?

    4) Can we set F-buffer to each render target?

    5) Can we write to F-buffer by mov shader instruction with any number of times?

    6) I think F-buffer must generate fragments for subsequent passes.
    At least one fragment need screen coordinate(x,y) and depth value.
    We prepare F-buffer by writing these information at fragment shader.
    And we activate F-buffer for fragment shader inputs of subsequent pass.
    Is this process correct?

    7) Can we support multi-pass fragment shader(through F-buffer) with stencil value?

    8) Can we use F-buffer as vertex buffer immediately?

    Sorry for my many questions.
     
  16. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,298
    Likes Received:
    137
    Location:
    On the path to wisdom
    As I understand it, you cannot explicitly use the F-buffer as some kind of per-pixel stack (or for another use).

    It is simply there to support 'unlimited length' fragment shaders fully transparent to the application. You don't even notice there is an F-buffer. The driver manages all the F-buffer stuff, you have no access to it.
     
  17. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    What if Ati hands developers, which require it, the microcode necessary to expose custom functionality of the F-buffer (if there is, indeed, no way of hand coding for it).
     
  18. moichi

    Newcomer

    Joined:
    Jul 22, 2002
    Messages:
    36
    Likes Received:
    0
    Location:
    Japan
    general use of F-buffer

    sireric said at this thread:

    >Writes from the fragment shader to the F-Buffer are similar to other outputs,
    >and have no effect on the fragment execution.
    >F-Buffer reads are similar to texture reads and we already,
    >by architecture, hide that latency from the shader execution.

    So I thought we can fetch F-buffer as texture and write to F-buffer as render target.


    He also said:
    > That means that the F-Buffer reads/writes only occur a few times every 160 instruction pass (which is at most 64 cycles).

    I interpreted "few times" means we can take F-buffer reads/writes with any number of times.
    F-buffer is FIFO, so It's natural to reads/writes with any number of times.


    He also said:

    > For now, the plan is to support the F-Buffer in all products,
    > in the GL2 and possibly as an extension (i.e. uber-buffers) in GL1.x.

    "uber-buffers"(super buffer) was mentioned by Rob Mace(ATI) at OpenGL ARB meeting December.
    http://www.opengl.org/developers/about/arb/notes/meeting_note_2002-12-10.html

    I haven't white paper of super buffer.
    But meeting notes said super buffer is "repurposible memory buffer".


    ARB meeting notes said:
    > Formed working group to develop an extension for memory buffers that are repurposable within the graphics pipeline,
    > starting with the "uber buffers" white paper and earlier 3Dlabs work in this area.

    I interpreted super buffer is repurposible among pixel buffer/vertex array/texture/etc.
    If we can write to F-buffer with any number of times and repurpose F-buffer as vertex array,
    we can easily generate triangles at fragment shader.
     
  19. Mephisto

    Newcomer

    Joined:
    Feb 7, 2002
    Messages:
    200
    Likes Received:
    0
    Dumb question, but what's an "FMAD"? What does it stand for? Floating Point Multiply and Divide unit?
     
  20. Simon F

    Simon F Tea maker
    Moderator Veteran

    Joined:
    Feb 8, 2002
    Messages:
    4,560
    Likes Received:
    157
    Location:
    In the Island of Sodor, where the steam trains lie
    MAD usually means "Multiply Add" and appears as an opcode in some CPUs.

    In terms of shaders the "frcp" (reciprocal) would be the equivalent of a divide.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...