X1800/7800gt AA comparisons

Discussion in 'Architecture and Products' started by Nite_Hawk, Oct 5, 2005.

Thread Status:
Not open for further replies.
  1. Nite_Hawk

    Veteran

    Joined:
    Feb 11, 2002
    Messages:
    1,202
    Likes Received:
    35
    Location:
    Minneapolis, MN
    Ack, I hadn't had time yet to visit the site! :oops:

    Some of the research papers on that site look positively delicious! :twisted: The real time global illumination work looks especially interesting. I need to get my home computer setup again so I can start working on building a raytracer again. I was going to try to implement KD-Trees, but it sounds like atleast so far as GPUs are concerned the bounding volume hierarchy traversal technique is better...

    So much to read, so little time!!! UGGH!!!! I need those 256 hour days Rys was talking about...

    Nite_Hawk
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    The primary reason why I suspect this kind of organisation is not implemented in R520 is textures. Since MC channels enforce a tiling scheme in memory, and since textures will be spread randomly across tiles, a large portion of all memory accesses will not be channel-coherent.

    Indeed, memory accesses are best spread across as many channels as possible for texture reads, as far as I can tell.

    Additionally, R520's Ring Bus is designed explicitly to support a many-to-many relationship between memory clients and memory tiles.

    R520's 32-bit channels makes the minimum memory access half the size of previous GPUs. But so far we have no clear description of the specific benefits that affords. Particularly as there is a general push (seemingly) within R520 to "bulk-up" memory accesses to make the best use of the latency-tolerant clients (texture pipes, RBEs, Hierarchical-Z ...). There must be certain kinds of tasks that thrive on these very small memory accesses, but what are they :?:

    At the heart of that question is how memory-tiling is utilised. If you take the 8 bytes of a typical pixel (4 bytes colour, 3 bytes Z, 1 byte stencil) it looks like it's possible to tile pixels "one byte" at a time - so a single quad of pixels is spread across all 8 memory channels, 4 bytes at a time per memory tile. 8 separate tiles all together.
    • Tile 1 - one byte of red for four pixels
    • Tile 2 - one byte of green for four pixels
    • ...
    • Tile 5 - byte 1 (out of 3) of Z for four pixels
    • ...
    • Tile 8 - stencil byte for four pixels
    But that's just a guess (though a recent ATI patent suggests something pretty similar as far as I can tell)... (How does memory tiling work with AA? And FP16 HDR?...)

    Jawed
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I agree with all that, rasterisation/interpolation can be multi-threaded, following directly on from set-up.

    The only demerit I can think of is that texture data needs to be duplicated in texture caches multiple times when the same textures are being used across multiple screen-tiles, simultaneously. I expect that happens quite a lot.

    I expect there's a protocol in the MC to handle that case, so that texture reads can be aggregated if multiple pixel units (quads) request the same textures (presumably within a short timespan).

    Jawed
     
  4. Bob

    Bob
    Regular

    Joined:
    Apr 22, 2004
    Messages:
    424
    Likes Received:
    47
    That's what the L2 cache does, on NV4x.
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    And what's interesting is that ATI haven't bothered with an L1/L2 design since R300, so there must be some secret sauce we're missing.

    Jawed
     
  6. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    ROPs being tied to MCs doesn't mean TMUs are. I'm not sure which disadvantage you see here.

    Compressed texture tiles (64 bits per 4x4 tile for DXT1 and 3Dc).

    Then why multiple channels at all? I think this distribution would be a very bad idea, it increases granularity and every framebuffer access means switching pages on every memory channel.
     
  7. Nite_Hawk

    Veteran

    Joined:
    Feb 11, 2002
    Messages:
    1,202
    Likes Received:
    35
    Location:
    Minneapolis, MN
    Jawed,

    Forgive me as I'm not as well versed in this as you. Any mistakes I make are quite unintentional. ;)

    Well, the way you demonstrate seems like it would be a very good arrangement for 8 byte pixels as everything lines up nicely. For something like FP16 pixels (same amount of z/stencil data?), it seems the downside would be that your tiles would no longer align, all of a sudden you'd have partially empty channels with z/stencil information because you are waiting on RGB data.

    So perhaps this kind of arrangement doesn't make sense for FP16 pixels. Perhaps instead you'd want a packed format, where you are breaking up your pixels first into 4byte boundaries to send via seperate channels and then into 32byte boundaries. So 8 pixels in 3 transfers rather than 4 in one...

    Nite_Hawk
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    No, I was only saying that texture reads, being a large proportion of all memory accesses, wouldn't benefit from a design where all textures "fit" into a single tile. They can't, anyway, obviously.

    The concensus, as far as I can tell, is that the best performance with large textures is for them to be spread across all memory tiles fairly evenly.

    Sadly I've got no idea what proportion of a game's textures (or workload, if you prefer) would consist of such small textures :oops:

    That's what's puzzling me - it looks nice to start with, but is such fine granularity in the back-buffer ever desirable? I don't think so either... It contradicts my earlier points about the CBC being used to deal with large areas of pixels instead of the RBE working on quads.

    What are the chances we'll find out from ATI how render targets are packed into memory tiles?...

    Jawed
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    It was a "loaded-guess" :twisted: designed to prompt a bit of discussion.

    The problem we've got is we're way out in the dark. Back-buffer memory-tiling is highly relevant to ATI's OGL AA performance gains, but I think we're out of luck as far as definitive answers are concerned.

    But yes, you're right, FP16 isn't a neat fit.

    And AA looks awkward, too, since AA comes in 2xAA lumps (each AA sample is 8 bytes, so that's 16 bytes) so 2xAA and 6xAA get messy.

    One solution to the mess, naturally, is an asymmetric packing - where 8 or 16 or more pixels in a block solve the "page" problem for non-RGBA8 back-buffers. In which case CBC comes into its own. Erm... :razz:

    Jawed
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    This may be related to why there is no difference at 2x and 6xAA in ATI's OGL-AA tweak:

    http://www.guru3d.com/news.html#3182

    :twisted:

    Jawed
     
  11. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,116
    Likes Received:
    215
    Location:
    Uffda-land
    Oy. They nicked sireric's thots from here without attribution to the site. Bad Form. :sad:
     
  12. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    I guess the optimal distribution of textures across channels also depends on how multiple textures of different size are used together in shaders.

    But how is texturing related to having ROPs tied to MCs? And I'm not sure about what you consider a "memory tile".

    Small textures? DXT1/3Dc compressed textures consist of 4x4 texel tiles that are encoded in a 64 bit block. That doesn't mean the textures are small.

    ATI claims a best-case 24:1 compression ratio for Z data (with 6xAA enabled, so this means 8 bit per pixel). This compression is block- or tile-based. With tiles of 4x4 pixels, we get 128 bit per Z-tile, which is 32 bits times a burst length of 4.
     
  13. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    There's no point. With NV4x/G70's dispatch the quads are likely to be working very closely in region to one another, so an L1/L2 cache design works here - ATI's quads are working on completely different regions, making it much less likely that they'll need to share much texture data between one another.
     
  14. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    A couple of months ago the distribution algorithm of quads to shader units implemented in the simulator was basically round robin (skipping fully occupied shader units) on a per quad basis. Not surprisingly, when I finally decided to check what was that doing, the bandwidth consumed for textures was N, being the number of texture units (or shader units as I was testing the usual 1:1 arrangement), times the texture data footprint. If you used that kind of random/round robin distribution (not on a single quad basis as it's also reducing a lot the hit rate) and there is a lot of texture data that is being accessed by many of the texture units a L2 cache makes a lot of sense. In the case, which I think may not be that frequent, that you can keep the 'current' texture working set (not for a whole frame for sure, but may be for similar batches that use the same texture data but that can't be stored in the small L1 texture caches) in the L2, this second level would also help to further reduce texture bandwidth. I would like to test the L2 arrangement but I don't have the time now. Related to ATI when I was testing how their texture caches worked (and they do some really funny things that I fully can't explain) in the R350 I discovered something like three bandwidth steps, first being at 8 KB (the texture cache size) and the next two (I would have to search that data as I don't remember at which sizes happened) were like a gradual reduction to the available memory bandwidth which made me wonder if they really implemented a L2 or not. Of course there was a fourth when you hit AGP bandwidth.

    With 16x16 tiles texture units only share data at the borders at a much reduced rate. Now when I test 8x8 tiles as the distribution unit it reduces the excess texture bandwidth consumed.

    However even if tiled based shader work distribution can help in accessing data (and removes the requirement for a crossbar to order back the quads to their propper ROPs or MCs) random/round robin distribution is good for better load balancing (the check board case for example :)) in the shader units. The tile algorithm has the danger, when queues aren't large enough, of unbalancing between the pipelines or a slow start for some of the pipelines if the first fragments/triangles miss them. What is better? Depends.
     
    #314 RoOoBo, Oct 14, 2005
    Last edited by a moderator: Oct 14, 2005
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I'm basing my ideas on this:

    http://www.graphicshardware.org/presentations/bando-hexagonal-gh05.pdf

    It refers to both texture storage and framebuffer organisation across memory channels.

    I am suggesting that if texturing is "multi-tile" and a large consumer of bandwidth, optimisations for ROPs tied to MCs would prolly be sub-optimal for the same kinds of reasons that single-tiled texturing wouldn't make sense (though that's impossible, anyway).

    A tile is a contiguous region of memory in one channel, corresponding to one or more units of burst. So a tile might equal the minimum burst-length in a channel, or a multiple of that.

    I'm not sure. I don't know enough about this subject and what the typical constraints on memory access banking, paging, bursting etc. :cry:

    :oops:

    Earlier I was forgetting that these memory devices have a burst length of 8, I think.

    It would be so much easier if Eric would explain it all :!:

    Jawed
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    What about ground textures (e.g. repeating cobbles) - won't there be multiple instances in the texture caches of ATI GPUs?

    Jawed
     
  17. ERK

    ERK
    Regular

    Joined:
    Mar 31, 2004
    Messages:
    287
    Likes Received:
    10
    Location:
    SoCal
    But no penalty for this for R5XX series, right? Don't know about previous gen.
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I think the other missing ingredient in this discussion is how the NVidia and ATI architectures construct batches.

    If a triangle is too small to fill a batch (i.e. there are less quads in the triangle than the nominal batch size for the architecture), does the GPU fill the batch with more triangles (e.g. the succeeding triangles in a mesh)? Or is the empty space in the batch just entirely lost cycles?

    Jawed
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I'm not sure how you conclude that, since R5xx has the same per-quad texture cache organisation as R3xx...R4xx.

    The difference in R5xx is that the caches are larger (I think) and fully associative.

    Jawed
     
  20. Nite_Hawk

    Veteran

    Joined:
    Feb 11, 2002
    Messages:
    1,202
    Likes Received:
    35
    Location:
    Minneapolis, MN
    Seems like it would be incredibly wasteful to not fill the batch, but again, I suppose it depends on how hard it is to fill the batch with triangles from the succeeding mesh. It probably also depends on how much space is left. Do you worry about it if you can only cram one more triangle in?

    Speaking of which, how big are the triangle batches? Do we know?

    Nite_Hawk
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...