Tile-based Rasterization in Nvidia GPUs

Discussion in 'Architecture and Products' started by dkanter, Aug 1, 2016.

  1. Mat3

    Newcomer

    Joined:
    Nov 15, 2005
    Messages:
    163
    Likes Received:
    8
    What's the difference between a TBDR's parameter buffer of binned triangles and what's in a G buffer? Once you've got the G buffer, why isn't that enough to start rasterizing tile by tile?
     
  2. Rodéric

    Rodéric a.k.a. Ingenu
    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,986
    Likes Received:
    846
    Location:
    Planet Earth.
    If you filled your g-buffer you already rasterized the scene...
    And translucent objects aren't in the g-buffer.

    I suspect the tiles also improve data compression, besides all what was already mentionned. (more coherent texture reads = reduced bandwidth, translucent overdraw in L2 = reduced bandwidth, fewer ROP "export" = reduced bandwidth...)
     
    Heinrich04 and Mat3 like this.
  3. PixResearch

    Regular

    Joined:
    May 20, 2010
    Messages:
    187
    Likes Received:
    46
    Location:
    London, UK
    Bit off topic as it's not how NVidia's approach works but...

    The parameter buffer is the intermediate storage written after the geometry processing phase but before rasterization/3D phase, ie: all the primitives that might be in any tile plus a per tile entry indicating which of those primitives might be visible in the current tile. It's just a big pile of geometry, pointers and masks that have had no pixel shading applied and many of which may ultimately not be visible in the final render.

    A G-buffer on the other is an intermediate set of data that is the result of a rasterization/3D process that has already determined what is visible at each fragment location (normally it's limited to storing only one of the things visible at each location which is usually fine but not in cases like transparency). It's written out multiple render targets worth of data from the pixel shaders calculating all kinds of values needed for a subsequent lighting pass.

    Really - the closest equivalent to a GBuffer in a TBDR system is the on chip tile buffers in the 3D phase - they effectively store much of the same data. BTW - if you're sensible on several TBDR/TBIR architectures you exploit that to only create the GBuffer data for the current tile in the internal memories, use it immediately and throw the GBuffer data away without ever needing to write it out and read it back in for a huge bandwidth saving. (see modern APIs and pixel local storage extensions)
     
  4. milk

    Veteran Regular

    Joined:
    Jun 6, 2012
    Messages:
    2,988
    Likes Received:
    2,560
    I think he got hang up on the "G" from G-buffer, and wrongly assumed it stores geometry infotmation as in verts and polys, which it doesn't. It stores geometrical information as in depth and normals, on a per-sample basis. The normals are usually already modulated by normalmaps in the buffer, and even depth, in some next-gen engines, might also have been displaced by a parallax shader. After written, its completely agnostic to actual underlying polygonal geometry.
     
    Simon F and Mat3 like this.
  5. AnomalousEntity

    Newcomer

    Joined:
    Jun 6, 2016
    Messages:
    38
    Likes Received:
    25
    Location:
    Silicon Valley
    That depends. If your architecture could do position only vertex shader then you don't need to store anything - just run vertex-shader and bin your triangles. Gbuffer on the other hand stores everything (albedo, normals, gloss, materialID, ... ) required to compute lighting on those triangles.

    It's not agnostic to underlying geometry otherwise you won't be able to compute SSAO or even lighting. You only store the Z as you already know the X,Y of the pixel.
     
    sebbbi likes this.
  6. milk

    Veteran Regular

    Joined:
    Jun 6, 2012
    Messages:
    2,988
    Likes Received:
    2,560
    what I meant was: After written, the gbuffer knows nothing about the tris and verts that were used to build it.
     
  7. AnomalousEntity

    Newcomer

    Joined:
    Jun 6, 2016
    Messages:
    38
    Likes Received:
    25
    Location:
    Silicon Valley
    That's true generally as you've moved from object space to screen-space however you can store anything inside your gbuffer even the primitiveIDs. Depends on your use case.
     
    sebbbi likes this.
  8. HTupolev

    Regular

    Joined:
    Dec 8, 2012
    Messages:
    936
    Likes Received:
    564
    The point milk is making is that it's rasterizing the geometry and storing properties in screen-space, rather than being pre-raster data in geometry primitive space.
     
    sebbbi and milk like this.
  9. milk

    Veteran Regular

    Joined:
    Jun 6, 2012
    Messages:
    2,988
    Likes Received:
    2,560
    Thanks HTupolev
     
  10. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    859
    Likes Received:
    262
    I wonder if it's already possible to do efficient pixel-caching with a TBDR system. Reducing the pixel-set to all pixels with unique parameter vectors shouldn't be too hard on some smallish tile-size. The acting-like-a-pixel-shader could only be done with compute ofc, each unique vector to update (not in cache) assigned to one lane. Then scatter the results from the cache-vector as a separate pass.
    Are there any APIs which allow programming the pipeline and it's stages and the (re-)scheduler between the stages yourself?
     
  11. Philip

    Joined:
    Aug 3, 2016
    Messages:
    5
    Likes Received:
    15
    From my testing on a GTX 970, I think the explanation for the weirdness is something like:

    The screen is split into tiles of 16x16 px, which are assigned to 4 interleaved partitions. (My terminology is probably all wrong but never mind). Each partition corresponds to a GPC, and the number of tiles in each is proportional to the number of SMMs. The 970 has 13 SMMs in total so the partitions are unequal. Each partition gets rasterised almost completely independently.

    I don't know how to tell whether pixel shaders for each tile are restricted to running on the SMMs in the GPC corresponding to that tile's partition, but I guess it would make sense if they were.

    On my device I believe the assignment pattern is
    Code:
    p = [0,1,2,3,0,2,3,0,1,3,0,1,2]
    partition(x, y) = p[(x + y*2) % 13]
    
    where x,y are the tile index starting from the top left of the screen. That gives partition sizes in the ratio 4:3:3:3.

    On David's video, his looks more like
    Code:
    p = [0,1,2,3,0,1,2,0,1,3,0,1,2]
    partition(x, y) = p[(x + y*2) % 13]
    
    That gives the ratio 4:4:3:2. I assume that corresponds to a different arrangement of disabled SMMs in his device.

    The smaller partitions finish quicker, so the pattern becomes clearly visible as the partitions diverge.

    From the video, the GTX 1070 (3 GPCs) looks more like
    Code:
    partition(x, y) = x % 3
    
    Those partitions are equal over an infinite area, but don't fit uniformly into the ~512x512 px region that gets rasterised first, so the pattern becomes visible when the partition that's smaller in the first region starts the next region before the others do. Devices with 2 or 4 GPCs should have a much less visible pattern, since everything divides nicely there.

    (I'm mildly surprised they don't do something like "(x + y) % 3" on the 1070 to make thin vertical objects get distributed more evenly between the partitions.)

    It also looks compressed to me - I see it handling a lot more triangles per pass if I put duplicated values in the vertex shader outputs, than if the values are all unique. So that makes it even more complicated to analyse :(

    (But I'm certainly not an expert so I'd be happy to learn if I'm misinterpreting all this stuff!)
     
    Gubbi, Newguy, CSI PC and 5 others like this.
  12. Philip

    Joined:
    Aug 3, 2016
    Messages:
    5
    Likes Received:
    15
    Hmm, I see there are screenshots from other devices on http://www.neogaf.com/forum/showthread.php?t=1256067 which look to me like:

    1080 (4 GPC, 20 SM): partition(x, y) = (x + 3*y) % 4
    980Ti (6 GPCs, 22 SMM): partition(x, y) = [0,1,2,3,4,0,1,2,3,5,0,1,2,3,4,5,0,1,2,3,4,5][(x + 3*y) % 22] or something a bit like that
    Titan X (6 GPCs, 24 SMM): can't really tell
    840M (1 GPC, 3 SMM): partition(x, y) = 0

    The 1070 seems like the odd one out, the others are all interleaving the partitions sensibly when possible.
     
  13. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    337
    Likes Received:
    294
    That's a surprise. I expected the buffer to hold full triangles in unpacked form.

    Full compression appears unlikely though, more likely just de-duplication of vertices (by hash?) and (re)indexing?
     
    #53 Ext3h, Aug 6, 2016
    Last edited: Aug 6, 2016
  14. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Maybe they push the vertex data through DCC or similar. DCC could actually benefit vertex buffers as well.
     
    Heinrich04 likes this.
  15. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,435
    Likes Received:
    263
    Couldn't the compiler be removing duplicates?
     
  16. cho

    cho
    Regular

    Joined:
    Feb 9, 2002
    Messages:
    416
    Likes Received:
    2
    hmm, could you guys tell me how to estimate the Maxwell/Pascal do tile-based deferred rendering or not from that video ?
     
  17. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    337
    Likes Received:
    294
    Unlikely, unless you expect the compiler to evaluate which vertex IDs would result in identical vertex data, and the result of that evaluation would then be used to reorganize the geometry completely?

    @cho
    Simple, you would expect the GPU to render the geometry in the order it was submitted. In this case, from the bottom most triangle to the top most. Instead the GPU forms tiles, and starts rendering the slices from triangles intersecting with this this tile first, before moving on to the next tile.

    It's not strictly tile based in the classic meaning, as not the whole render pipeline is tiled, and it can even revisit the same tile multiple times in a single geometry if locality is insufficient, trashing the cache.

    Furthermore, the observations made are in line with a patent Nvidia filed: https://www.google.com/patents/US20140118366
     
  18. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    Oullah, i push a llittle things, there's thousands and thousands of patents and research about tiling rasterization and tiling render .. who have start in the end of the 90's .. from ATI, 3DFX, Intel, AMD, and ofc Nvidia, well nearly everyone, so its a bit hard to consider if this patent is a start of what is used by Nvidia right now in this case. The problem right now, is as Nvidia have not reveal they was use it, and how they execute it, is to know how they do it, this patent can give some information, but i really doubt the techinc used today is the same described here. A lot of things, even on the cache architecture could have change in between .
     
    #58 lanek, Aug 7, 2016
    Last edited: Aug 7, 2016
  19. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,435
    Likes Received:
    263
    I read Philip's post differently than you did. It seems to me that he's outputting the same data for multiple attributes, from a single vertex. There's no need to compare data across vertices.
     
  20. Philip

    Joined:
    Aug 3, 2016
    Messages:
    5
    Likes Received:
    15
    I saw the apparent compression in both cases - using the same value for multiple attributes in one vertex, and using the same value for one attribute in multiple vertexes.

    But... after more testing, I'm not convinced that it isn't really just the compiler being smart. E.g. the original code cycles through colours with "(input.VertexID / 3) % 7". If add a line like "output.Color *= (1.0 + (input.VertexID / 3) * 1e-16);" which should (I think) have zero effect on the output values but (I guess) makes it harder for the compiler to optimise, then it renders significantly fewer triangles per pass. I don't understand what optimisations the compiler could possibly be doing with the cyclic colours, though, so it still seems mysterious to me.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...