Tile-based Rasterization in Nvidia GPUs

dkanter

Regular
http://www.realworldtech.com/tile-based-rasterization-nvidia-gpus/

Starting with the Maxwell GM20x architecture, Nvidia high-performance GPUs have borrowed techniques from low-power mobile graphics architectures. Specifically, Maxwell and Pascal use tile-based immediate-mode rasterizers that buffer pixel output, instead of conventional full-screen immediate-mode rasterizers. Using simple DirectX shaders, we demonstrate the tile-based rasterization in Nvidia’s Maxwell and Pascal GPUs and contrast this behavior to the immediate-mode rasterizer used by AMD.
 
Is there a transcript of the video available?
That wouldn't be very useful as it's basically commenting what's shown on screen ^^
nVidia is using some kind of tiling in its newest GPU explaining the gain in efficiency (power & occupancy) pretty much.
 
Last edited:
Great video but (we thought) we knew this already :)

Hah! :p

GM107 doesn't show any of the performance characteristics of a traditional TBR with off-chip binning, so it's certainly not quite like Gigapixel, but it does show some very unusual characteristics that strongly imply they're doing on-chip binning for a relatively small number of triangles. I'm skeptical the 2MB L2 cache would make sense without that architecture (see http://forum.beyond3d.com/showthread.php?p=1856670#post1856670 and patents linked above my post).
 
I would assume that the tile size matches the ROP cache size. However Nvidia hardware doesn't have dedicated ROP caches, so I'd assume that the tile buffer resizes on L2 cache (where they usually keep the ROP outputs). Did you pixel count the tile sizes? My guess would be something between [32x32, 128x128] as that's close to the footprint of traditional ROP caches.

Some years ago I did ROP cache experiments with AMD GCN (7970) in order to optimize particle rendering. GCN has dedicated ROP caches (16 KB color, 4 KB depth). In my experiment I split the rendering to 64x64 tiles (= 16 KB). This resulted in huge memory bandwidth savings (and over 100% performance increase), especially when the overdraw was large (lots of full screen alpha blended particles close to the camera). You can certainly get big bandwidth advantages also on AMD hardware, as long as you sort your workload (by screen locality) before submitting it.

It's hard to draw 100% accurate conclusions from the results. This doesn't yet prove whether Nvidia is just buffering some work + reordering on fly to reach better ROP cache hit ratio, or whether they actually do hidden surface removal as well (saving pixel shader invocations in addition to bandwidth). This particular test shader doesn't allow the GPU to perform any hidden surface removal, since it increases an atomic counter (it has a side effect).

To test HSR, you'd have to enable z-buffering (or stencil) and use [earlydepthstencil] tag in the pixel shader. This tag allows the GPU to skip shading the pixel even when it has side effects (DX documentation is incorrect about this). Submit triangles in back-to-front order to ensure that early depth doesn't cull anything with immediate mode rendering. I would be interested to see whether this results in zero overdraw on Maxwell/Kepler (in this simple test with some overlapping triangles and also with higher triangle counts).

It would also be interesting to know how many (vertex output) attributes fit to the buffer.

The new (Nvidia and Oculus) multiview VR extensions would definitely benefit from separating SV_Position part of the vertex shader to its own shader. This would also greatly benefit tiled rendering (do tile binning first, execute attribute shader later). I wouldn't be surprised if Nvidia did already something like this in Maxwell or Pascal, as both GPUs introduced lots of new multiview VR extensions.

I just wish Nvidia would be as open as AMD regarding to their GPU architecture :)
 
Last edited:
This doesn't yet prove whether Nvidia is just buffering some work + reordering on fly to reach better ROP cache hit ratio, or whether they actually do hidden surface removal as well (saving pixel shader invocations in addition to bandwidth). This particular test shader doesn't allow the GPU to perform any hidden surface removal, since it increases an atomic counter (it has a side effect).

Wouldn't they have to go out of their way to not save pixel shader invocations with this approach? Seems the most straightforward thing to do is submit finished tiles to the pixel shader.
 
Wouldn't they have to go out of their way to not save pixel shader invocations with this approach? Seems the most straightforward thing to do is submit finished tiles to the pixel shader.
Most straightforward is just to rasterize the triangles of the tile to the ROP cache (with no sorting or HSR inside the tile). This already gives you all the bandwidth gains (as only ROP cache is touched, main memory is not).

The test application didn't use depth buffering and had a side effect (which was clearly handled properly). In this test case, the GPU clearly executed the pixel shader multiple times for each pixel (and not just once for the last invocation). Otherwise the atomic counter would have increased only once per pixel (not once per overdrawn pixel). This was clearly not happening. The percentage slider worked fine. Thus the GPU executed pixel shader multiple times per pixel as instructed. There was no HSR. Side effects still needs to be handled properly, and this test case proves that it works just fine. The rendering result was legit. Ordering was of course different compared to the pure immediate mode renderer.

Tiled HSR needs some additional on-chip memory as you first need to rasterize all tile triangles to the tile buffer to determine per pixel visibility. 16 bit triangle id (per pixel) is enough (up to 256x256 tile size can be supported). The GPU can simply fetch + interpolate the vertex attributes from the on-chip memory by indexing it with the triangle id. A custom (software) tiled renderer can do the same, but a hardware solution can efficiently cache the vertex attribute calculations and the 16 bpp tile buffer stays on-chip during the whole process.

Handling of alpha blending and side effects however need special care as you need to handle overdraw. Single triangle id per pixel is not enough. If Nvidia has tiled HSR they still must be able to handle this case (possibly by disabling the HSR and having an alternative solution).
 
really interessant, hoping some more information could come soon about how they have implement it.
 
Last edited:
Obvious question (s).
Who was the first to do tiling, do they have the patent, and has it expired ?
 
Did you pixel count the tile sizes? My guess would be something between [32x32, 128x128] as that's close to the footprint of traditional ROP caches.
It depends on the SKU and framebuffer format/MRTs of course, but they get up to ~512^2 in size for a single 32bpp single sample render target. Almost certainly related to the addition of the larger L2$ in Maxwell. There's definitely some weirdness in the 970 that David tested though that is almost certainly related to there being some disabled clusters. On "fully enabled" parts you don't see any of that weird hashed run-ahead of multiple tiles - it's all very balanced and it goes from one tile to the next.

It's hard to draw 100% accurate conclusions from the results.
I'm pretty sure this was meant just to get the conversation going mainly since NVIDIA still denies anything is even going on ;) It does actually get much more complicated when you start looking at non-full-screen triangles. It's definitely not just simple ROP cache stuff. In fact as various tech sides did observe if you stay within a tile Maxwell can actually exceed its theoretical ROP rate! Thus it's likely they aren't even using ROPs when they are able to do things "in tile" until they have to dump the tile or similar.

There's no hidden surface removal (at least that I've ever seen in any test), this is all basically just rescheduling to capture coherence.

Vertices/triangles are fully buffered (with all attributes) on-chip, up to about ~2k triangles (depending on the SKU and vertex output size) before a tile "pass" is run. Again this gets a lot more complicated when not considering full screen triangles but I think keeping the original article high level makes sense.

The new (Nvidia and Oculus) multiview VR extensions would definitely benefit from separating SV_Position part of the vertex shader to its own shader. This would also greatly benefit tiled rendering (do tile binning first, execute attribute shader later). I wouldn't be surprised if Nvidia did already something like this in Maxwell or Pascal, as both GPUs introduced lots of new multiview VR extensions.
There's no indication they are doing any position-only shading in Maxwell, but I agree that this is an obvious next step and I'm guessing if desktop/mobile architectures do converge at some point they will end up in a middle-ground with something like position only shading running ahead and TBIMR/DR depending on state following.

I just wish Nvidia would be as open as AMD regarding to their GPU architecture :)
Yep no kidding, hence why I think it's good to at least get some of the info out there so that others can investigate and maybe NVIDIA can stop denying anything is happening and be a bit more open about legitimately cool tech :)
 
Which tiling? Even if Nvidia did TBDR they should have the old IP through Gigapixel-3dfx heritage.

Tiling appears to be the mechanism of breaking the screen up into smaller sizes for the purposes of more efficient processing. Although it is often referred to in association with deferred rendering, in particular when talking about IMG's IP, the article indicates that Nvidia is using tiled based immediate rendering. I am asking is tiling itself protected by patent, if so, who has it, and has it expired.

Did Gigapixel-3dfx own the patent rights ?

I suspect that if there is a patent on it, it's been around long enough that the protection it provides may have expired.
 
The concept of tiling itself is widespread. Even specific elements of the graphics pipeline can be tiled, which may very well be the majority of vendors when items like rasterizers, render backends, and the mapping of address space to physical controllers.
More specific methods can have patents (deferred, immediate, hybrid), and it's a wide swath of graphics vendors--even AMD (with or without any possible holdover from its Adreno days).
 
Which tiling? Even if Nvidia did TBDR they should have the old IP through Gigapixel-3dfx heritage.

Not that it really matters but the Gigapixel I remember wasn't that much different then the ARM Mali. There are more than one tile based IMR architectures around ever since, Adrenos included.

Here's a bit of the good old Gigapixel philosophy which survived into the early Tegra:

https://forum.beyond3d.com/posts/1377355/

Tegra2 should get in the neighborhood of 1.2Gpix/sec with 50% z-culling, 480Mpix/sec drawn/shaded. You also get 5x CSAA for "free" (~10% hit). The Wii at most has to fill 640x480p @ 60hz, arguably handheld displays are higher resolution and the same framerate. To top it off you've got arbitrary length floating-point shaders and all kinds of goodies (like affine-transformed point sprites) on Tegra, so it should be able to beat the Wii on all fronts.

https://forum.beyond3d.com/posts/1377394/

I can see where SGX would cause you to jump through a few extra hoops (I'm very familiar with chunkers)

Tegra isn't a chunker, and in any case there are very few use cases where you are constrained by chained data dependencies (i.e. you can usually read buffer N after sending commands for N+1 so at least the pipeline doesn't stall out). There are Tegra extensions to render directly to a mappable buffer, so apart from some page-table munging you get copy-free access to rendering results, and I seem to recall they support async readbacks via one of the ARB extensions though I'd have to research that a bit more. The biggest win is that with GLES2.0 you can usually write shaders that handle everything in the pipeline once the initial data has been sent, freeing the CPU to run game logic etc.

https://forum.beyond3d.com/posts/1377362/

I'm guessing the ~3x perf comes from an improved memory infrastructure.

I was assuming 240Mhz/8 Z/clock, 2 pix/clock, 50/50 mix. I've got one sitting on the desk here, I'm working on getting something interesting running on it.

Of course "chunkers" weren't much "hip" in the early Tegra days since the mantra was that for anything over DX7 tiling was questionable and for anything below DX9 unified shader cores useless. In any case I'm glad that even Imagination got rid of the filrate * scene complexity nonsense, since Gigapixel was also amongst those that calculated everything with a factor 3x or higher overdraw and that in 1998 and earlier. Besides that their technology got never licensed by anyone and always remained nothing more but a huge chain of wild exaggerations for vaporware.

--------------------------------------------
Andrew,

Thank you for the clarifications.
 
Partitioning of resources and memory would go to one definition of tiling that is very common and would be along a dimension of physical or spatial locality. That is tiled memory formats and tiling in terms of partitioning the hardware or linking units to specific areas in screen space.

The use of the word tiling in this case would be differentiated based on measures it takes to capture or create temporal locality in the stream of primitives going through it, by changing the order of issue or accumulating data on-chip for a specified window of primitives and their shaders.

A more straightforward tiled GPU with caches and a long pipeline could capture some amount of locality even without special measures, but this is taking things further by massaging execution to get beyond somewhat coincidental coalescing of accesses during the time data happens to be resident in the texture cache or in a ROP tile. There seems to be rather clear benefits for doing this, given where Nvidia's efficiency has improved after its introduction.
 
I wonder how much of this have to do with SM 6.0 requirements :D
Is there something besides the shader/wavefront operations linked elsewhere? Otherwise, it's asking for things like wavefront ballot operations to feed back to the tiling stage that called the shader in the first place.

There are API-level constructs like Vulkan's render passes that help tiled renderers, but those predate 6.0.
AMD uses the context provided to help reduce pipeline bubbles, even though the driver is targeting an immediate mode renderer.
 
I am not aware of anything that is public for SM 6.0, but I guess we can expect more things to be added before final preview. Also, keep in mind that MS claimed the shader compiler to be unbound from the Windows SDK releases. So everything could be possible.
 
Back
Top