Tile-based Rasterization in Nvidia GPUs

Partitioning of resources and memory would go to one definition of tiling that is very common and would be along a dimension of physical or spatial locality. That is tiled memory formats and tiling in terms of partitioning the hardware or linking units to specific areas in screen space.

The use of the word tiling in this case would be differentiated based on measures it takes to capture or create temporal locality in the stream of primitives going through it, by changing the order of issue or accumulating data on-chip for a specified window of primitives and their shaders.

A more straightforward tiled GPU with caches and a long pipeline could capture some amount of locality even without special measures, but this is taking things further by massaging execution to get beyond somewhat coincidental coalescing of accesses during the time data happens to be resident in the texture cache or in a ROP tile. There seems to be rather clear benefits for doing this, given where Nvidia's efficiency has improved after its introduction.

For that picture of the GPU, it's likely that the plastic shroud region uses a single material shader program and its operands; it looks like there's some patterns that region, for example there are pure black tiles which presumably contains just one SMID and this tile appears in a semi regular way which reuses the same program in that SM's cache presumably. There appears to be some regularity in the assignment of SM and Warp ids to the geometry if you look at the connector pins on that model, however some of the pins seem to use just one SM and the surrounding pcb part looks like it's handled by another SM but others analogous regions look like they're entirely handled in one SM. Perhaps there's some kind of initial smart binning of programs and then an attempt to fill in remaining resources?

As for DK's video, just a random thought: is it possible he's picking up some artifact of the memory color compression here? Maybe rasterizing a very noisy texture could make this clearer.
 
Vertices/triangles are fully buffered (with all attributes) on-chip, up to about ~2k triangles (depending on the SKU and vertex output size) before a tile "pass" is run.

Is this a ~2k triangle window in submission order, or has it been binned and tile bins contain up to 2k triangles?

Because if it's the former that doesn't sound like it'd be that useful for normal game scenes.
 
Is this a ~2k triangle window in submission order, or has it been binned and tile bins contain up to 2k triangles?
In submission order (and does not cross draw call boundaries). Not sure why you think that wouldn't be useful for normal game scenes - there's a lot of spatial coherence in meshes that have been sorted for vcache efficiency (i.e. all of them) and this sort of design captures the vast majority of it without any real downsides.
 
In submission order (and does not cross draw call boundaries). Not sure why you think that wouldn't be useful for normal game scenes - there's a lot of spatial coherence in meshes that have been sorted for vcache efficiency (i.e. all of them) and this sort of design captures the vast majority of it without any real downsides.
This is true especially for lower LOD objects in distance. Less than 2k triangles + high spatial coherence. And this is a common case where the modern GPUs have serious bottlenecks (troubles in filling compute units with work). Big triangles close to camera are already efficiently handled.
 
Why would features in a 2+ year old architecture be motivated by SM6.0?
All FL 12_0+ hardware will support SM 6.0. But I am not stating those arch features were built for SM 6.0, but the opposite: SM 6.0 is made to take advantage of current architectures (at least FL 12_0+ GPUs).
I also would like to see support for something equivalent of the AMD "Out-of-Order Rasterization". But this is a lil OT.
 
Why would features in a 2+ year old architecture be motivated by SM6.0?
At launch DirectX 12 focused solely on rendering (ROV, conservative raster), CPU performance and resource model improvements. No improvements to compute shaders at all. Compute oriented people have been waiting for this update for a long time.

To me it seems that SM 6.0 goal was to bring DirectX compute shaders to feature parity with OpenCL 2.0 and CUDA and add missing features that were already exposed on consoles and/or as OpenCL/Vulkan extensions. Soon we can (almost) fully utilize the Radeon 7970 feature set. It's been almost 5 years since it launched :D
 
In submission order (and does not cross draw call boundaries). Not sure why you think that wouldn't be useful for normal game scenes - there's a lot of spatial coherence in meshes that have been sorted for vcache efficiency (i.e. all of them) and this sort of design captures the vast majority of it without any real downsides.

No doubt 2k triangles will have pixels that are more often than not next to each other, but isn't the main driver of tiling to capture overdraw? The caches will buffer the data from those triangles whether they're drawn in a tile order or not. I can see how tiling would increase utilization within the cache lines themselves and reduce associativity collisions but I wouldn't expect this to make a huge difference...

I suppose there's inter-mesh overdraw too but I figured most overdraw comes from triangles that are pretty far apart (unless there aren't many triangles in the scene to begin with) Or does quad based shading result in internal overdraw?
 
No doubt 2k triangles will have pixels that are more often than not next to each other, but isn't the main driver of tiling to capture overdraw? The caches will buffer the data from those triangles whether they're drawn in a tile order or not. I can see how tiling would increase utilization within the cache lines themselves and reduce associativity collisions but I wouldn't expect this to make a huge difference...

I suppose there's inter-mesh overdraw too but I figured most overdraw comes from triangles that are pretty far apart (unless there aren't many triangles in the scene to begin with) Or does quad based shading result in internal overdraw?
HiZ is already handling the large scale "far away" overdraw problem. Modern GPUs reject failed HiZ tiles at very high rate (at very low BW cost). As far as I understand, this new tiling optimization should improve the memory read and write locality (better cache hit rates for ROP writes, Z-buffer and texture sampling) and better ROP/lane utilization.
 
HiZ is already handling the large scale "far away" overdraw problem. Modern GPUs reject failed HiZ tiles at very high rate (at very low BW cost). As far as I understand, this new tiling optimization should improve the memory read and write locality (better cache hit rates for ROP writes, Z-buffer and texture sampling) and better ROP/lane utilization.

Do games still use depth pre-passes, or any kind of depth sorting? I've heard that both have fallen out of fashion but I haven't kept up to date at all.

At any rate, HiZ doesn't do anything to save bandwidth for color RMW ops, and I doubt this technique would do much to change that, while conventional tiling would. Given the comments that rendering in this mode isn't even ROP-limited I wonder if alpha blended fragments are even being tiled.
 
To me it seems that SM 6.0 goal was to bring DirectX compute shaders to feature parity with OpenCL 2.0 and CUDA and add missing features that were already exposed on consoles and/or as OpenCL/Vulkan extensions. Soon we can (almost) fully utilize the Radeon 7970 feature set. It's been almost 5 years since it launched
Inclusive of features currently available from all IHVs or just wrapping up console features to facilitate porting without intrinsics? Curious if there are new capabilities or just the full console feature set being exposed.

Do games still use depth pre-passes, or any kind of depth sorting? I've heard that both have fallen out of fashion but I haven't kept up to date at all.
Deferred renderers would. Sorting would be up to the engine and developer and would depend on just how much optimization occurs. Rough sorting likely happens in most games alongside basic frustrum culling.

At any rate, HiZ doesn't do anything to save bandwidth for color RMW ops, and I doubt this technique would do much to change that, while conventional tiling would. Given the comments that rendering in this mode isn't even ROP-limited I wonder if alpha blended fragments are even being tiled.
In the case of this tiling, bandwidth would be saved as the tile should be resident in cache until complete and written out. All RMW ops should hit the cache unless they're from a subsequent pass. This technique seems along the lines of the ESRAM on XB.
 
Do games still use depth pre-passes, or any kind of depth sorting? I've heard that both have fallen out of fashion but I haven't kept up to date at all.
Doom does depth prepass (their SIGGRAPH 2016 presentation slides just got available). There are also renderers that use super thin full ROP rate g-buffers with only triangle id (Intel research) or UV + tangent (RedLynx, Eidos).

GPU-driven pipelines (GPU culling + submit just a few massive multidraw commands) are able to efficiently sort objects (and sub-objects) on GPU. Sub-object depth sort results in similar HiZ efficiency as depth prepass, without the need to render geometry twice. It is definitely worth doing, especially with SM 6.0 as it makes GPU sorting (radix sort = prefix sum) more efficient.

For more traditional pipelines, depth sorting objects by CPU is also a no brainer. A fast CPU radix sort is able to sort all your objects in less than 1 millisecond (of CPU time), saving more than 1 ms of GPU time in the worst case. If you don't depth sort, your frame rate fill fluctuate mufh more.
At any rate, HiZ doesn't do anything to save bandwidth for color RMW ops, and I doubt this technique would do much to change that, while conventional tiling would. Given the comments that rendering in this mode isn't even ROP-limited I wonder if alpha blended fragments are even being tiled.
Particle systems are the heaviest color RMW (alpha blend) sources. Tiling limit of 2000 triangles = 1000 particles. Particles are obviously depth sorted (alpha blending back to front). Sorting makes spatially (3d) close particles also close by Z. So the particles in the same particle system (for example a single explosion + smoke of it) are likely found at the same time in the 2000 triangle tiling buffer. There might of course be several particle effects (on the same depth) that are rendered simultaneously. But it is likely that this small subslice of all particles fits to the 2000 triangle budget, meaning that the Nvidia tiling method should be highly efficient.

Nvidia method of course doesn't remove all RMW overdraw. However if we assume that a single particle explosion has 10x+ overdraw (and is tiled at once), the tiling already reduces the render target memory reads and writes by 10x+, meaning that the rendering is no longer memory bound -> full compute and ROP unit utilization. The end result should be in practice almost as good as a TBDR.
 
With this kind of tiled rasterisation is there any need for NVidia to do any kind of hierarchical-Z?
 
With this kind of tiled rasterisation is there any need for NVidia to do any kind of hierarchical-Z?
This kind of tiling doesn't replace HiZ. Objects close to the camera are rendered first and background is rendered last. There are several thousands of draw calls and up to several million triangles between the foreground (= potential big occluder) and the background. 2000 triangle tiling buffer doesn't help this (common) situation at all. HiZ handles this perfectly.
 
Doom does depth prepass (their SIGGRAPH 2016 presentation slides just got available). There are also renderers that use super thin full ROP rate g-buffers with only triangle id (Intel research) or UV + tangent (RedLynx, Eidos).

GPU-driven pipelines (GPU culling + submit just a few massive multidraw commands) are able to efficiently sort objects (and sub-objects) on GPU. Sub-object depth sort results in similar HiZ efficiency as depth prepass, without the need to render geometry twice. It is definitely worth doing, especially with SM 6.0 as it makes GPU sorting (radix sort = prefix sum) more efficient.

For more traditional pipelines, depth sorting objects by CPU is also a no brainer. A fast CPU radix sort is able to sort all your objects in less than 1 millisecond (of CPU time), saving more than 1 ms of GPU time in the worst case. If you don't depth sort, your frame rate fill fluctuate mufh more.

Okay, so you're saying that HiZ reject rate is close to perfect in current games, or they're using deferred renderers that have overdraw in their G-buffer pass but with low byte/fragment output.

That's good to know. I wonder if that's the current standard on mobile as well (outside of IMG optimized stuff where it's less relevant)

Particle systems are the heaviest color RMW (alpha blend) sources. Tiling limit of 2000 triangles = 1000 particles. Particles are obviously depth sorted (alpha blending back to front). Sorting makes spatially (3d) close particles also close by Z. So the particles in the same particle system (for example a single explosion + smoke of it) are likely found at the same time in the 2000 triangle tiling buffer. There might of course be several particle effects (on the same depth) that are rendered simultaneously. But it is likely that this small subslice of all particles fits to the 2000 triangle budget, meaning that the Nvidia tiling method should be highly efficient.

Nvidia method of course doesn't remove all RMW overdraw. However if we assume that a single particle explosion has 10x+ overdraw (and is tiled at once), the tiling already reduces the render target memory reads and writes by 10x+, meaning that the rendering is no longer memory bound -> full compute and ROP unit utilization. The end result should be in practice almost as good as a TBDR.

The idea that depth sorting particles also brings them close to each other spatially seems uintuitive to me, but I'll take your word for it.

When you say that the end result is as good as TBDR, do you mean that it's enough to keep current high end discrete GPUs with comparatively huge amounts of bandwidth from being bandwidth limited? Or do you mean that the raw render target and texture bandwidth demands are as low? If it's the latter this could be a big deal for mobile, because there the conventional (full scene) tiler method increases memory accesses for vertex data and adds complexity and edge cases to deal with that this wouldn't. Could we see Qualcomm and ARM move in this direction?
 
This kind of tiling doesn't replace HiZ. Objects close to the camera are rendered first and background is rendered last. There are several thousands of draw calls and up to several million triangles between the foreground (= potential big occluder) and the background. 2000 triangle tiling buffer doesn't help this (common) situation at all. HiZ handles this perfectly.
This tiled rasterisation presents an opportunity to fetch Z from memory at the start of tile processing and at the end overwrite to both depth and colour buffers all of the data from the finished tile. There are likely to be other triangles that hit this tile later, so when that later pass (or more likely, one of many later passes) occurs both depth and colour at full resolution are fetched first.

So, there is no need to keep low resolution Z on die, when the count of memory transactions per tile pass is so low. Tiling integrates all render target data to the extent that no other on-die data structure is required. Hierarchical-Z is effectively redundant.

That's my hypothesis.
 
The idea that depth sorting particles also brings them close to each other spatially seems uintuitive to me, but I'll take your word for it.
Easy example: There are 100 explosions placed randomly in one square kilometer area (camra view frustum). All particles in one explosion are Inside one meter radius bounding sphere. Now you sort them by distance to the camera depth plane. On average there's one explosion in 10 meters of depth range, no matter where you are looking at. The particles of one explosion have Z values inside of 1 meter range. Thus it is highly probable that particles from the same explosion are close to each other in the Z sorted list. Thus spatially (XY screen) local particles (from the same explosion) likely fit to the 2000 triangle buffer.
When you say that the end result is as good as TBDR, do you mean that it's enough to keep current high end discrete GPUs with comparatively huge amounts of bandwidth from being bandwidth limited? Or do you mean that the raw render target and texture bandwidth demands are as low? If it's the latter this could be a big deal for mobile, because there the conventional (full scene) tiler method increases memory accesses for vertex data and adds complexity and edge cases to deal with that this wouldn't. Could we see Qualcomm and ARM move in this direction?
As good = bottleneck is elsewhere = performance is not BW limited. Let's assume that we are rendering 10 back to back explosions with 10x overdraw each. Naive approximation (above) would result in 100x overdraw on IMR, 10x on NV tiling and 1x on TBDR. TBDR system memory controller would be practically idling. Nvidia would perform one read and one write per 10x layers of particles. Which is in practice as good as TBDR as BW should ne nowhere as near a bottleneck. Of course you'd burn a little bit extra power.
So, there is no need to keep low resolution Z on die, when the count of memory transactions per tile pass is so low. Tiling integrates all render target data to the extent that no other on-die data structure is required. Hierarchical-Z is effectively redundant.
HiZ improves hidden (rejected) geometry fill rate by 16x to 64x (depending on tile size). You don't want to do precise per pixel Z checks (or load per pixel Z from memory) when not needed. Modern GPUs don't have fixed on-chip memory pools for HiZ. HiZ is cached similarly as depth data. One 128 byte cache line fits a pretty big HiZ tile. I would assume that the GPU fetches the HiZ cacheline of the current tile and does a coarse check before fetching any per pixel depth or color data of the tile.
 
HiZ improves hidden (rejected) geometry fill rate by 16x to 64x (depending on tile size). You don't want to do precise per pixel Z checks (or load per pixel Z from memory) when not needed. Modern GPUs don't have fixed on-chip memory pools for HiZ. HiZ is cached similarly as depth data. One 128 byte cache line fits a pretty big HiZ tile. I would assume that the GPU fetches the HiZ cacheline of the current tile and does a coarse check before fetching any per pixel depth or color data of the tile.
Where would that hierarchical-Z data in the cache line come from?

There aren't definitive facts on Maxwell and Pascal's use of hierarchical Z are there?
 
Where would that hierarchical-Z data in the cache line come from?

There aren't definitive facts on Maxwell and Pascal's use of hierarchical Z are there?
HiZ data is just stored mostly the same as other depth data, albeit allocation might be separate or not, this varies IIRC. Albeit I don't really know anything about nvidia chips there though just amd (which indeed had on-tile buffer ages ago). I don't actually know when the chip updates the min/max (*) depth values in the HiZ data, maybe just always when the (fine-grained) depth test succeeds.

(*) Some earlier chips could only store either min or max values depending on the actual depth function, meaning that if you switched the depth function mid-frame you had to disable HiZ.
 
Back
Top