Question: Back-Face Culling: How do I get rid of the if-statements?

Discussion in 'CellPerformance@B3D' started by OzzyBC42, Aug 13, 2007.

  1. OzzyBC42

    Newcomer

    Joined:
    Jul 8, 2007
    Messages:
    25
    Likes Received:
    0
    Hi guys,

    I'm currently writing a realtime renderer on the PS3 and it's pretty close to reach a milestone, but right now, I need some sugesstions as to how to implent the back-face culling.

    I would like to process and draw each Triangle(actually up to 42 at a time due to the DMA Limit of 16KBs) via Texture Caching on the 6 SPEs. So far, the Back-Face Culling Code is written and gives me something like this (simplified for the purpose):

    process_42triangles(...){

    send_vertices_through_world_view_proj_transform;
    ...
    float bf[42];

    get_face_normal for triangle i;
    do_backface_calc_on_i;
    store_result in bf;
    ...
    homogenous_divide;
    viewport_transform;
    ...
    }

    I have 42 floats for each of the 42 Triangles.
    Now I would like to draw each Triangle (i) if bf >= 0.0f.

    I have unrolled and eliminated the for loop for the drawing part, still I need to

    if(bf >= 0) draw_triangle(i); up to 42 times.

    I would like to get rid of the if statements, without calculating the fall-through(the body of the if-statement), because actually making the draw_call would certainly hurt more than the branch and would defy the purpose of having back-face culling at all.

    My inital idea was to create an array of 42 draw_function pointers right after calculating the back-facing triangles, and placing a dummy_function for each not visible triangle in it. This array would be calculated well before the program gets to loop through it, but I don't know if that is not just pushing the actual problem around.
     
  2. ebola

    Newcomer

    Joined:
    Dec 13, 2006
    Messages:
    99
    Likes Received:
    0
    Am I to assume you're talking about a software renderer - as you mention texture-caching on SPE's..


    One usefull pattern for avoiding such if's is to conditionally store indices of items that pass a test. Not sure if it matters so much in this case compared to the triangle draw call, but anyway, here's one possible technique ....

    int renderIndices[MAX_TRIS+1]; int* pIndexOut=renderIndices;

    // CULL - write out indices of front-facing tris to 'renderIndices'
    for ( each triangle i) {
    get_face_normal for triangle(i);
    bool bTestResult = do_backface_calc_on(i);
    *pIndexOut= i; // always writes out one more index than is needed.
    pIndexOut += (bTestResult)?1:0; // or better use an int 1 or 0 for the result in the first place.
    }

    // RENDER - each index in 'renderIndices'
    int *pIndex=renderIndices;
    for (pIndex=renderIndices; pIndex<pIndexOut; pIndex++) {
    DrawTriangle( *pIndex);
    }

    I'd wrap this mess in a handy macro ..
    Maybe you want to unroll the drawTriangle tri-setup code, which would now be possible..?

    I dont think function pointers would work well - pretty sure they'll be worse than branches :) - the main point of eliminating if's is to allow the compiler to interleave instructions around them.

    Interesting to see what other suggestions you get. Bitfields are usually a good way to do this sort of thing..
     
    #2 ebola, Aug 13, 2007
    Last edited by a moderator: Aug 14, 2007
  3. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    605
    Likes Received:
    56
    I agree. But I don't see why not to use indexed primitives from the beginning.
     
  4. OzzyBC42

    Newcomer

    Joined:
    Jul 8, 2007
    Messages:
    25
    Likes Received:
    0
    I decided to go against it(for now), because I didn't wanted to keep track of which index has been processed and which index has not so far, but I'll come back to this once everything else is setup. Thanks for the advice, though.

    Second, I have pretty much unrolled and inlined everthing I could get my hands on, and now it has come back to bite me. Even the process_42triangles function is now somewhere near 72KBs in size as compiled binary, and I haven't even setup a cache or draw function yet! It's definetly going to be bigger than 256KB (SPU Local Store Size) this way.

    So, I see three options for this now:

    1. go back and don't be so aggressive on the loop unrolling
    2. dma the process_XXvertices functions in as needed, consider doing the same with the draw functions
    3. do both

    So far I have this:

    PPU Part:

    1. Parse Collada Document for the first Mesh and read in the Vertex Data for the first Triangle Array using Collada_DOM API
    2. Read first Image (extern 24-Bit tga uncompressed only) of the Collada Document and put in main mem
    3. Get Frambuffer Information
    4. Open Joypad Device
    5. Send Offset of 1 and 3 to SPU

    SPU Part:
    1. Double-Buffer DMA Vertices in while setting up the transformation matrixes
    2. Shuffle Vertices to Soa-Form, process them while there are any

    TODO:
    clip triangles to frustum
    draw to backbuffer( setup cache, get image data, ...)
    send draw finished signal to ppu
    Get Joypad data
    change matrixes
    flip backbuffer
    goto to SPU Part 1 if ( ! PS-Button ) else cleanup and exit;


    TODO when everthing else is done:

    consider using indexed triangles
    mip-mapping support for textures
    consider getting some sleep

    Thanks for your suggestions, but I guess this will take some time. :razz:
     
    #4 OzzyBC42, Aug 19, 2007
    Last edited by a moderator: Aug 19, 2007
  5. Vitaly Vidmirov

    Newcomer

    Joined:
    Jul 9, 2007
    Messages:
    108
    Likes Received:
    10
    Location:
    Russia
    Even the process_42triangles function is now somewhere near 72KBs

    You don't need that much unrolling. It is just a waste of LS.
    Design for indexed primitives. Optimize later unless you know what you're doing.
     
  6. ebola

    Newcomer

    Joined:
    Dec 13, 2006
    Messages:
    99
    Likes Received:
    0
    Tangent/off topic... ( I agree his solution here is to unroll less though, not to start writing a code-compressor.. )


    Anyone ever thought of trying something exotic like code compression? I certainly dont have time to try this...

    Seems like unrolled loops would contain the information of 1 iteration with the same instructions replicated 6x or whatever with incremented register indices.

    I wonder if there'd be a practical angle on this using code-overlays: - each loop body to decompress would be put in a seperate overlay ? decompressing the next code chunk should take less time than dragging data out of main memory,right ?- and would keep you from burning bandwidth... give you more freedom to create bigger tasks that in turn keep more working data in the same place..

    Could you negate the code-density disadvantages of in-order risc using such a technique ?
     
  7. Vitaly Vidmirov

    Newcomer

    Joined:
    Jul 9, 2007
    Messages:
    108
    Likes Received:
    10
    Location:
    Russia
    Seems like unrolled loops would contain the information of 1 iteration with the same instructions replicated 6x or whatever with incremented register indices

    These instruction are not just replicated, but sheduled for latency/dual-issue.
    So we'll need an algorithm with codetable and indices or code generator.
    I don't see any significant gain here and i'm too lazy to try this =)
     
  8. ebola

    Newcomer

    Joined:
    Dec 13, 2006
    Messages:
    99
    Likes Received:
    0

    I had this sort of case in mind
    for (i=0; i<count; i+=N)
    { DoSomething(i); DoSomething(i+1) .. DoSomethign(i+N-1); }

    if DoSomething(i) == A(i),B(i),C(i)
    you'd expect the unrolling /scheduling to produce
    A(i) A(i+1) .. A(i+N-1) B(i) B(i+1) .. B(i+N-1) C(i) C(i+1) .. C(i+N-1)

    i.e. the individual instructions are replicated... very predictable pattern .. seems like the sort of thing you could make some sort of RLE / differencing style custom compressor to deal with.

    granted SOA type code dealing with X,Y,Z might probably look quite different (less unrolls).. but as i understand it the completely general case needs 6 unrolls to deal with the float latency (e.g. if you have dot product calcs..)

    Most of the optimized code i've been dealing with is of the above pattern, AOS coming from xbox 360.

    having said that I dont know the ratio of 'setup' to 'loop body' code.


    Thinking about it some more, there are parallels between designing for the SPE's (RAM->LS) and designing ingame data-streaming (DVD->RAM), and I've seen this principle - 'load coarse grain compressed, decompress finer-grain on demand' used to good effect there...

    but you're right, this isn't exactly low-hanging fruit.
    probably most likely to appear in some demo-coders challenge to see how much can be done with a single code upload :)

    ah ok, after writing this long reply I can see the dual issue scheduling complicating the simple pattern I present above somewhat. maybe it still works with A,B,C being pairs of instructions rather than individual instructions.. i'll have to think about this some more :) (not that i'm likely to get round to trying it )
     
    #8 ebola, Aug 21, 2007
    Last edited by a moderator: Aug 21, 2007
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...