Tile-based Rasterization in Nvidia GPUs

Discussion in 'Architecture and Products' started by dkanter, Aug 1, 2016.

  1. dkanter

    Regular

    Joined:
    Jan 19, 2008
    Messages:
    360
    Likes Received:
    20
    http://www.realworldtech.com/tile-based-rasterization-nvidia-gpus/

    Starting with the Maxwell GM20x architecture, Nvidia high-performance GPUs have borrowed techniques from low-power mobile graphics architectures. Specifically, Maxwell and Pascal use tile-based immediate-mode rasterizers that buffer pixel output, instead of conventional full-screen immediate-mode rasterizers. Using simple DirectX shaders, we demonstrate the tile-based rasterization in Nvidia’s Maxwell and Pascal GPUs and contrast this behavior to the immediate-mode rasterizer used by AMD.
     
    Alexko, Heinrich04, Razor1 and 12 others like this.
  2. Kaarlisk

    Regular Newcomer Subscriber

    Joined:
    Mar 22, 2010
    Messages:
    293
    Likes Received:
    49
  3. Rodéric

    Rodéric a.k.a. Ingenu
    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,986
    Likes Received:
    847
    Location:
    Planet Earth.
    That wouldn't be very useful as it's basically commenting what's shown on screen ^^
    nVidia is using some kind of tiling in its newest GPU explaining the gain in efficiency (power & occupancy) pretty much.
     
    #3 Rodéric, Aug 1, 2016
    Last edited: Aug 2, 2016
    Alexko likes this.
  4. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    Great video but (we thought) we knew this already :)

     
    pharma, Arun, Razor1 and 2 others like this.
  5. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    I would assume that the tile size matches the ROP cache size. However Nvidia hardware doesn't have dedicated ROP caches, so I'd assume that the tile buffer resizes on L2 cache (where they usually keep the ROP outputs). Did you pixel count the tile sizes? My guess would be something between [32x32, 128x128] as that's close to the footprint of traditional ROP caches.

    Some years ago I did ROP cache experiments with AMD GCN (7970) in order to optimize particle rendering. GCN has dedicated ROP caches (16 KB color, 4 KB depth). In my experiment I split the rendering to 64x64 tiles (= 16 KB). This resulted in huge memory bandwidth savings (and over 100% performance increase), especially when the overdraw was large (lots of full screen alpha blended particles close to the camera). You can certainly get big bandwidth advantages also on AMD hardware, as long as you sort your workload (by screen locality) before submitting it.

    It's hard to draw 100% accurate conclusions from the results. This doesn't yet prove whether Nvidia is just buffering some work + reordering on fly to reach better ROP cache hit ratio, or whether they actually do hidden surface removal as well (saving pixel shader invocations in addition to bandwidth). This particular test shader doesn't allow the GPU to perform any hidden surface removal, since it increases an atomic counter (it has a side effect).

    To test HSR, you'd have to enable z-buffering (or stencil) and use [earlydepthstencil] tag in the pixel shader. This tag allows the GPU to skip shading the pixel even when it has side effects (DX documentation is incorrect about this). Submit triangles in back-to-front order to ensure that early depth doesn't cull anything with immediate mode rendering. I would be interested to see whether this results in zero overdraw on Maxwell/Kepler (in this simple test with some overlapping triangles and also with higher triangle counts).

    It would also be interesting to know how many (vertex output) attributes fit to the buffer.

    The new (Nvidia and Oculus) multiview VR extensions would definitely benefit from separating SV_Position part of the vertex shader to its own shader. This would also greatly benefit tiled rendering (do tile binning first, execute attribute shader later). I wouldn't be surprised if Nvidia did already something like this in Maxwell or Pascal, as both GPUs introduced lots of new multiview VR extensions.

    I just wish Nvidia would be as open as AMD regarding to their GPU architecture :)
     
    #5 sebbbi, Aug 1, 2016
    Last edited: Aug 1, 2016
    Pete, spworley, Silent_Buddha and 8 others like this.
  6. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    Wouldn't they have to go out of their way to not save pixel shader invocations with this approach? Seems the most straightforward thing to do is submit finished tiles to the pixel shader.
     
  7. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Most straightforward is just to rasterize the triangles of the tile to the ROP cache (with no sorting or HSR inside the tile). This already gives you all the bandwidth gains (as only ROP cache is touched, main memory is not).

    The test application didn't use depth buffering and had a side effect (which was clearly handled properly). In this test case, the GPU clearly executed the pixel shader multiple times for each pixel (and not just once for the last invocation). Otherwise the atomic counter would have increased only once per pixel (not once per overdrawn pixel). This was clearly not happening. The percentage slider worked fine. Thus the GPU executed pixel shader multiple times per pixel as instructed. There was no HSR. Side effects still needs to be handled properly, and this test case proves that it works just fine. The rendering result was legit. Ordering was of course different compared to the pure immediate mode renderer.

    Tiled HSR needs some additional on-chip memory as you first need to rasterize all tile triangles to the tile buffer to determine per pixel visibility. 16 bit triangle id (per pixel) is enough (up to 256x256 tile size can be supported). The GPU can simply fetch + interpolate the vertex attributes from the on-chip memory by indexing it with the triangle id. A custom (software) tiled renderer can do the same, but a hardware solution can efficiently cache the vertex attribute calculations and the 16 bpp tile buffer stays on-chip during the whole process.

    Handling of alpha blending and side effects however need special care as you need to handle overdraw. Single triangle id per pixel is not enough. If Nvidia has tiled HSR they still must be able to handle this case (possibly by disabling the HSR and having an alternative solution).
     
    Silent_Buddha, Heinrich04 and BRiT like this.
  8. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    really interessant, hoping some more information could come soon about how they have implement it.
     
    #8 lanek, Aug 1, 2016
    Last edited: Aug 1, 2016
  9. tangey

    Veteran

    Joined:
    Jul 28, 2006
    Messages:
    1,457
    Likes Received:
    214
    Location:
    0x5FF6BC
    Obvious question (s).
    Who was the first to do tiling, do they have the patent, and has it expired ?
     
  10. Putas

    Regular Newcomer

    Joined:
    Nov 7, 2004
    Messages:
    392
    Likes Received:
    59
    Which tiling? Even if Nvidia did TBDR they should have the old IP through Gigapixel-3dfx heritage.
     
  11. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    It depends on the SKU and framebuffer format/MRTs of course, but they get up to ~512^2 in size for a single 32bpp single sample render target. Almost certainly related to the addition of the larger L2$ in Maxwell. There's definitely some weirdness in the 970 that David tested though that is almost certainly related to there being some disabled clusters. On "fully enabled" parts you don't see any of that weird hashed run-ahead of multiple tiles - it's all very balanced and it goes from one tile to the next.

    I'm pretty sure this was meant just to get the conversation going mainly since NVIDIA still denies anything is even going on ;) It does actually get much more complicated when you start looking at non-full-screen triangles. It's definitely not just simple ROP cache stuff. In fact as various tech sides did observe if you stay within a tile Maxwell can actually exceed its theoretical ROP rate! Thus it's likely they aren't even using ROPs when they are able to do things "in tile" until they have to dump the tile or similar.

    There's no hidden surface removal (at least that I've ever seen in any test), this is all basically just rescheduling to capture coherence.

    Vertices/triangles are fully buffered (with all attributes) on-chip, up to about ~2k triangles (depending on the SKU and vertex output size) before a tile "pass" is run. Again this gets a lot more complicated when not considering full screen triangles but I think keeping the original article high level makes sense.

    There's no indication they are doing any position-only shading in Maxwell, but I agree that this is an obvious next step and I'm guessing if desktop/mobile architectures do converge at some point they will end up in a middle-ground with something like position only shading running ahead and TBIMR/DR depending on state following.

    Yep no kidding, hence why I think it's good to at least get some of the info out there so that others can investigate and maybe NVIDIA can stop denying anything is happening and be a bit more open about legitimately cool tech :)
     
  12. tangey

    Veteran

    Joined:
    Jul 28, 2006
    Messages:
    1,457
    Likes Received:
    214
    Location:
    0x5FF6BC
    Tiling appears to be the mechanism of breaking the screen up into smaller sizes for the purposes of more efficient processing. Although it is often referred to in association with deferred rendering, in particular when talking about IMG's IP, the article indicates that Nvidia is using tiled based immediate rendering. I am asking is tiling itself protected by patent, if so, who has it, and has it expired.

    Did Gigapixel-3dfx own the patent rights ?

    I suspect that if there is a patent on it, it's been around long enough that the protection it provides may have expired.
     
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,125
    Likes Received:
    2,884
    Location:
    Well within 3d
    The concept of tiling itself is widespread. Even specific elements of the graphics pipeline can be tiled, which may very well be the majority of vendors when items like rasterizers, render backends, and the mapping of address space to physical controllers.
    More specific methods can have patents (deferred, immediate, hybrid), and it's a wide swath of graphics vendors--even AMD (with or without any possible holdover from its Adreno days).
     
    Heinrich04 likes this.
  14. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,420
    Likes Received:
    179
    Location:
    Chania
    Not that it really matters but the Gigapixel I remember wasn't that much different then the ARM Mali. There are more than one tile based IMR architectures around ever since, Adrenos included.

    Here's a bit of the good old Gigapixel philosophy which survived into the early Tegra:

    https://forum.beyond3d.com/posts/1377355/

    https://forum.beyond3d.com/posts/1377394/

    https://forum.beyond3d.com/posts/1377362/

    Of course "chunkers" weren't much "hip" in the early Tegra days since the mantra was that for anything over DX7 tiling was questionable and for anything below DX9 unified shader cores useless. In any case I'm glad that even Imagination got rid of the filrate * scene complexity nonsense, since Gigapixel was also amongst those that calculated everything with a factor 3x or higher overdraw and that in 1998 and earlier. Besides that their technology got never licensed by anyone and always remained nothing more but a huge chain of wild exaggerations for vaporware.

    --------------------------------------------
    Andrew,

    Thank you for the clarifications.
     
    Razor1, pharma and BRiT like this.
  15. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    508
    Likes Received:
    18
    The last picture on this page:

    https://developer.nvidia.com/content/life-triangle-nvidias-logical-pipeline

    seems to indicate some sort of tiled boundaries for the allocation of SM and warps in graphics rendering for Kepler and Maxwell and there are tiles with multiple SMs and warps. I know the article is about the fixed function portion of the pipeline but this partitioning along tiles seems like it could be consistent with playing well w/ ROPs.
     
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,125
    Likes Received:
    2,884
    Location:
    Well within 3d
    Partitioning of resources and memory would go to one definition of tiling that is very common and would be along a dimension of physical or spatial locality. That is tiled memory formats and tiling in terms of partitioning the hardware or linking units to specific areas in screen space.

    The use of the word tiling in this case would be differentiated based on measures it takes to capture or create temporal locality in the stream of primitives going through it, by changing the order of issue or accumulating data on-chip for a specified window of primitives and their shaders.

    A more straightforward tiled GPU with caches and a long pipeline could capture some amount of locality even without special measures, but this is taking things further by massaging execution to get beyond somewhat coincidental coalescing of accesses during the time data happens to be resident in the texture cache or in a ROP tile. There seems to be rather clear benefits for doing this, given where Nvidia's efficiency has improved after its introduction.
     
  17. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,496
    Likes Received:
    910
  18. Alessio1989

    Regular Newcomer

    Joined:
    Jun 6, 2015
    Messages:
    582
    Likes Received:
    285
    I wonder how much of this have to do with SM 6.0 requirements :D
     
    Heinrich04 likes this.
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,125
    Likes Received:
    2,884
    Location:
    Well within 3d
    Is there something besides the shader/wavefront operations linked elsewhere? Otherwise, it's asking for things like wavefront ballot operations to feed back to the tiling stage that called the shader in the first place.

    There are API-level constructs like Vulkan's render passes that help tiled renderers, but those predate 6.0.
    AMD uses the context provided to help reduce pipeline bubbles, even though the driver is targeting an immediate mode renderer.
     
  20. Alessio1989

    Regular Newcomer

    Joined:
    Jun 6, 2015
    Messages:
    582
    Likes Received:
    285
    I am not aware of anything that is public for SM 6.0, but I guess we can expect more things to be added before final preview. Also, keep in mind that MS claimed the shader compiler to be unbound from the Windows SDK releases. So everything could be possible.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...