Larrabee at GDC 09

Discussion in 'Architecture and Products' started by bowman, Feb 16, 2009.

  1. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    Which paper? I was referring to Abrash article, at least in my understanding, the introduction clearly points to an initial lack of knowledge on parallel rasterization algorithms. Parallel or vectorized rasterization isn't 'challenging' or 'impossible' in any way if you have previous knowledge about the Pixel Planes algorithms.

    I'm not faulting Abrash or his team. None knows everything, well may be Jawed who knows all graphics patents in this world :).

    In any case, why even bother on such details? The point of my post was to promote my simulator. I haven't been posting here in years and now that I'm working back on it (at least for a little while) people must remember about such wonder :wink:.
     
  2. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    I was referring to the paper that you linked (and I quoted your link) by McCool et al.
     
  3. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,405
    Likes Received:
    402
    Location:
    New York
    I'm sure this is a silly question but how do tiling approaches like proposed on Larrabee with limited buffers (L2) scale to multiple render targets? Is the available buffer simply allocated equally to the various targets and what does this mean for scalability?
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    In theory to generate the data for MRTs you're doing more work (mix of ALU and TU) which helps to hide the extra latencies caused by both writing more data to the MRTs and dealing with the significantly reduced number of qquads that each core can support.

    In Seiler the comparison of binned rendering and immediate mode rendering shows a huge bandwidth saving. So this theoretically means Larrabee has a significant (née monstrous) leeway due to render back end running entirely out of cache.

    In traditional GPUs the render target cache (colour buffer and z/stencil buffer caches) routinely thrashes, even with a single render target (though colour rate is often significantly less than max and multiple quads of pixels will be output per thrash). In Larrabee the cached-tile won't thrash, it's really functioning as a tile-buffer with only minimal latency.

    Seiler talks about 32x32 being the typical smallest tile size, and explicitly talks about an RTset with many colour channels (i.e. MRTs) or high-precision formats. 32x32 is at most 64 qquads in flight. With 4 MRTs (1x 32-bit depth + 4x 32-bit colour = 20 bytes) would only use 20KB. Or 80KB if 4xMSAA. You'd have 2 or 3 threads sharing those 64 qquads plus 1 or 2 threads doing rasterisation/resolve, I suppose. Have to balance number of in-flight qquads and the per-strand state (register allocation, not forgetting allocation for moving data to/from TUs).

    ---

    If a single render target is tiled as 128x128 per core, that consumes 128KB of L2. That's a maximum of 1024 qquads in flight - but each strand could only have 8 bytes of state in L2. So in reality there'll be substantially less qquads in flight.

    So 64 qquads for the 4x MRTs tile, with an implicit 16:1 ALU:TEX (since Larrabee is serial scalar, 4:1 in vec4 terms) is still a substantial amount of latency hiding. If a modern game's MRT-generating pass is using a substantially lower ALU:TEX then that's just sucky.

    Put another way, GT200 is quite happy with 32 warps per multiprocessor, equivalent to 64 qquads, at what I guess are similar clockspeeds to those we'll see in Larrabee.

    Jawed
     
  5. cho

    cho
    Regular

    Joined:
    Feb 9, 2002
    Messages:
    416
    Likes Received:
    2
    if you are using fp16 render target, that will consumes 256KB of L2.
     
  6. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,405
    Likes Received:
    402
    Location:
    New York
    Ah thanks, guess I should've run the math to see that you can actually do a lot with 256KB of L2. Are you sure that 4 MRTs are only 20KB though? I get 80KB even without MSAA (20 KB for each).

    Is the Larrabee model strictly one strand per pixel or is it possible to have relatively fewer persistent qquads iterate over sub-tiles within the tile? I guess nothing is strict when it comes to LRB....

    Good point. I still wonder whether it would make sense for shared memory to double as a tile buffer. State storage won't be a problem as the register file will presumably continue to exist. But I guess for that to work you'd want a bit more than the 32KB mandated by DX11.
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    You'd use a smaller tile in that case. And yet smaller if doing MSAA.

    Jawed
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    A colour pixel in a normal render target is 4 bytes + z/stencil is 4 bytes. 4x MRTs share z/stencil, so 4*4+4=20 bytes per pixel for 32*32=1024 pixels = 20KB.

    There's four (hardware) threads per core, so each thread can support multiple qquads (fibres in the general sense, i.e. could be sets of vertices, etc.). So a strand is supporting a pixel from each of numerous qquads. Each qquad is, effectively, a different region of 16 pixels in the tile. So a thread with 8 qquads in flight, say, has 8 different pixels spread across the tile being processed by a single strand.

    qquads could easily iterate over the tile as you say, but each thread would normally have multiple qquads in flight at a given time, as four hardware threads, in their own right, aren't enough to hide most memory/texture latencies.

    I think shared memory, under graphics pipeline configuration (i.e. VS-GS-PS etc.), is used to hold triangle attributes ready for just-in-time interpolation by the pixel shader.

    But more importantly, Larrabee bins geometry into screen-space tiles. Seiler describes a tiled forward renderer as Larrabee doesn't attempt to Z-sort/cull triangles like a tiled deferred renderer does, instead relying upon a Z-buffer and un-ordered triangle rasterisation/pixel-shading.

    So once pixel shading of a tile is started it doesn't stop until all triangles in that tile have been rasterised and shaded. ATI and NVidia GPUs don't do any such binning, so putting a tile into shared memory would only last a short while before it's evicted for another tile. This exact process (thrashing tiles from the render target into on-die cache) is done by the ROPs. ATI functional diagrams contain blocks called "colour buffer cache" and "z/stencil buffer cache". Each of these is an independent cache dedicated to the named buffer.

    Jawed
     
  9. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    Keep in mind that as long as you can prefetch your frame buffer data early enough from memory to L2 then you can have almost arbitrarily big tiles. Obviously this is not an optimal solution as it requires to use more memory bandwidth.
     
  10. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,405
    Likes Received:
    402
    Location:
    New York
    Yes, of course. Excuse my fuzzy math :oops:
     
  11. Megadrive1988

    Veteran

    Joined:
    May 30, 2002
    Messages:
    4,634
    Likes Received:
    144
    #311 Megadrive1988, Jul 30, 2009
    Last edited by a moderator: Jul 30, 2009
  12. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,672
    Likes Received:
    441
    That didn't have a very high information content.
     
  13. repi

    Newcomer

    Joined:
    Dec 7, 2004
    Messages:
    203
    Likes Received:
    34
    Location:
    Sweden
    Mike is a great guy though!
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...