Tile-based Rasterization in Nvidia GPUs

Discussion in 'Architecture and Products' started by dkanter, Aug 1, 2016.

  1. jra101

    Joined:
    Apr 6, 2016
    Messages:
    2
    Likes Received:
    3
    Heinrich04, Anarchist4000 and sebbbi like this.
  2. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    Why would features in a 2+ year old architecture be motivated by SM6.0?
     
  3. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    508
    Likes Received:
    18
    For that picture of the GPU, it's likely that the plastic shroud region uses a single material shader program and its operands; it looks like there's some patterns that region, for example there are pure black tiles which presumably contains just one SMID and this tile appears in a semi regular way which reuses the same program in that SM's cache presumably. There appears to be some regularity in the assignment of SM and Warp ids to the geometry if you look at the connector pins on that model, however some of the pins seem to use just one SM and the surrounding pcb part looks like it's handled by another SM but others analogous regions look like they're entirely handled in one SM. Perhaps there's some kind of initial smart binning of programs and then an attempt to fill in remaining resources?

    As for DK's video, just a random thought: is it possible he's picking up some artifact of the memory color compression here? Maybe rasterizing a very noisy texture could make this clearer.
     
  4. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    429
    Location:
    Cleveland, OH
    Is this a ~2k triangle window in submission order, or has it been binned and tile bins contain up to 2k triangles?

    Because if it's the former that doesn't sound like it'd be that useful for normal game scenes.
     
  5. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    In submission order (and does not cross draw call boundaries). Not sure why you think that wouldn't be useful for normal game scenes - there's a lot of spatial coherence in meshes that have been sorted for vcache efficiency (i.e. all of them) and this sort of design captures the vast majority of it without any real downsides.
     
    homerdog likes this.
  6. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    This is true especially for lower LOD objects in distance. Less than 2k triangles + high spatial coherence. And this is a common case where the modern GPUs have serious bottlenecks (troubles in filling compute units with work). Big triangles close to camera are already efficiently handled.
     
  7. Alessio1989

    Regular Newcomer

    Joined:
    Jun 6, 2015
    Messages:
    582
    Likes Received:
    285
    All FL 12_0+ hardware will support SM 6.0. But I am not stating those arch features were built for SM 6.0, but the opposite: SM 6.0 is made to take advantage of current architectures (at least FL 12_0+ GPUs).
    I also would like to see support for something equivalent of the AMD "Out-of-Order Rasterization". But this is a lil OT.
     
    Heinrich04 and RecessionCone like this.
  8. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    At launch DirectX 12 focused solely on rendering (ROV, conservative raster), CPU performance and resource model improvements. No improvements to compute shaders at all. Compute oriented people have been waiting for this update for a long time.

    To me it seems that SM 6.0 goal was to bring DirectX compute shaders to feature parity with OpenCL 2.0 and CUDA and add missing features that were already exposed on consoles and/or as OpenCL/Vulkan extensions. Soon we can (almost) fully utilize the Radeon 7970 feature set. It's been almost 5 years since it launched :D
     
  9. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    429
    Location:
    Cleveland, OH
    No doubt 2k triangles will have pixels that are more often than not next to each other, but isn't the main driver of tiling to capture overdraw? The caches will buffer the data from those triangles whether they're drawn in a tile order or not. I can see how tiling would increase utilization within the cache lines themselves and reduce associativity collisions but I wouldn't expect this to make a huge difference...

    I suppose there's inter-mesh overdraw too but I figured most overdraw comes from triangles that are pretty far apart (unless there aren't many triangles in the scene to begin with) Or does quad based shading result in internal overdraw?
     
  10. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    HiZ is already handling the large scale "far away" overdraw problem. Modern GPUs reject failed HiZ tiles at very high rate (at very low BW cost). As far as I understand, this new tiling optimization should improve the memory read and write locality (better cache hit rates for ROP writes, Z-buffer and texture sampling) and better ROP/lane utilization.
     
    Heinrich04 likes this.
  11. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    429
    Location:
    Cleveland, OH
    Do games still use depth pre-passes, or any kind of depth sorting? I've heard that both have fallen out of fashion but I haven't kept up to date at all.

    At any rate, HiZ doesn't do anything to save bandwidth for color RMW ops, and I doubt this technique would do much to change that, while conventional tiling would. Given the comments that rendering in this mode isn't even ROP-limited I wonder if alpha blended fragments are even being tiled.
     
  12. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Inclusive of features currently available from all IHVs or just wrapping up console features to facilitate porting without intrinsics? Curious if there are new capabilities or just the full console feature set being exposed.

    Deferred renderers would. Sorting would be up to the engine and developer and would depend on just how much optimization occurs. Rough sorting likely happens in most games alongside basic frustrum culling.

    In the case of this tiling, bandwidth would be saved as the tile should be resident in cache until complete and written out. All RMW ops should hit the cache unless they're from a subsequent pass. This technique seems along the lines of the ESRAM on XB.
     
    Heinrich04 likes this.
  13. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Doom does depth prepass (their SIGGRAPH 2016 presentation slides just got available). There are also renderers that use super thin full ROP rate g-buffers with only triangle id (Intel research) or UV + tangent (RedLynx, Eidos).

    GPU-driven pipelines (GPU culling + submit just a few massive multidraw commands) are able to efficiently sort objects (and sub-objects) on GPU. Sub-object depth sort results in similar HiZ efficiency as depth prepass, without the need to render geometry twice. It is definitely worth doing, especially with SM 6.0 as it makes GPU sorting (radix sort = prefix sum) more efficient.

    For more traditional pipelines, depth sorting objects by CPU is also a no brainer. A fast CPU radix sort is able to sort all your objects in less than 1 millisecond (of CPU time), saving more than 1 ms of GPU time in the worst case. If you don't depth sort, your frame rate fill fluctuate mufh more.
    Particle systems are the heaviest color RMW (alpha blend) sources. Tiling limit of 2000 triangles = 1000 particles. Particles are obviously depth sorted (alpha blending back to front). Sorting makes spatially (3d) close particles also close by Z. So the particles in the same particle system (for example a single explosion + smoke of it) are likely found at the same time in the 2000 triangle tiling buffer. There might of course be several particle effects (on the same depth) that are rendered simultaneously. But it is likely that this small subslice of all particles fits to the 2000 triangle budget, meaning that the Nvidia tiling method should be highly efficient.

    Nvidia method of course doesn't remove all RMW overdraw. However if we assume that a single particle explosion has 10x+ overdraw (and is tiled at once), the tiling already reduces the render target memory reads and writes by 10x+, meaning that the rendering is no longer memory bound -> full compute and ROP unit utilization. The end result should be in practice almost as good as a TBDR.
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    With this kind of tiled rasterisation is there any need for NVidia to do any kind of hierarchical-Z?
     
  15. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    This kind of tiling doesn't replace HiZ. Objects close to the camera are rendered first and background is rendered last. There are several thousands of draw calls and up to several million triangles between the foreground (= potential big occluder) and the background. 2000 triangle tiling buffer doesn't help this (common) situation at all. HiZ handles this perfectly.
     
    Alessio1989 and Heinrich04 like this.
  16. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    429
    Location:
    Cleveland, OH
    Okay, so you're saying that HiZ reject rate is close to perfect in current games, or they're using deferred renderers that have overdraw in their G-buffer pass but with low byte/fragment output.

    That's good to know. I wonder if that's the current standard on mobile as well (outside of IMG optimized stuff where it's less relevant)

    The idea that depth sorting particles also brings them close to each other spatially seems uintuitive to me, but I'll take your word for it.

    When you say that the end result is as good as TBDR, do you mean that it's enough to keep current high end discrete GPUs with comparatively huge amounts of bandwidth from being bandwidth limited? Or do you mean that the raw render target and texture bandwidth demands are as low? If it's the latter this could be a big deal for mobile, because there the conventional (full scene) tiler method increases memory accesses for vertex data and adds complexity and edge cases to deal with that this wouldn't. Could we see Qualcomm and ARM move in this direction?
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    This tiled rasterisation presents an opportunity to fetch Z from memory at the start of tile processing and at the end overwrite to both depth and colour buffers all of the data from the finished tile. There are likely to be other triangles that hit this tile later, so when that later pass (or more likely, one of many later passes) occurs both depth and colour at full resolution are fetched first.

    So, there is no need to keep low resolution Z on die, when the count of memory transactions per tile pass is so low. Tiling integrates all render target data to the extent that no other on-die data structure is required. Hierarchical-Z is effectively redundant.

    That's my hypothesis.
     
  18. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Easy example: There are 100 explosions placed randomly in one square kilometer area (camra view frustum). All particles in one explosion are Inside one meter radius bounding sphere. Now you sort them by distance to the camera depth plane. On average there's one explosion in 10 meters of depth range, no matter where you are looking at. The particles of one explosion have Z values inside of 1 meter range. Thus it is highly probable that particles from the same explosion are close to each other in the Z sorted list. Thus spatially (XY screen) local particles (from the same explosion) likely fit to the 2000 triangle buffer.
    As good = bottleneck is elsewhere = performance is not BW limited. Let's assume that we are rendering 10 back to back explosions with 10x overdraw each. Naive approximation (above) would result in 100x overdraw on IMR, 10x on NV tiling and 1x on TBDR. TBDR system memory controller would be practically idling. Nvidia would perform one read and one write per 10x layers of particles. Which is in practice as good as TBDR as BW should ne nowhere as near a bottleneck. Of course you'd burn a little bit extra power.
    HiZ improves hidden (rejected) geometry fill rate by 16x to 64x (depending on tile size). You don't want to do precise per pixel Z checks (or load per pixel Z from memory) when not needed. Modern GPUs don't have fixed on-chip memory pools for HiZ. HiZ is cached similarly as depth data. One 128 byte cache line fits a pretty big HiZ tile. I would assume that the GPU fetches the HiZ cacheline of the current tile and does a coarse check before fetching any per pixel depth or color data of the tile.
     
    Silent_Buddha, milk and Heinrich04 like this.
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Where would that hierarchical-Z data in the cache line come from?

    There aren't definitive facts on Maxwell and Pascal's use of hierarchical Z are there?
     
  20. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,015
    Likes Received:
    112
    HiZ data is just stored mostly the same as other depth data, albeit allocation might be separate or not, this varies IIRC. Albeit I don't really know anything about nvidia chips there though just amd (which indeed had on-tile buffer ages ago). I don't actually know when the chip updates the min/max (*) depth values in the HiZ data, maybe just always when the (fine-grained) depth test succeeds.

    (*) Some earlier chips could only store either min or max values depending on the actual depth function, meaning that if you switched the depth function mid-frame you had to disable HiZ.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...