Tile-based Rasterization in Nvidia GPUs

Mat3 · Aug 3, 2016

What's the difference between a TBDR's parameter buffer of binned triangles and what's in a G buffer? Once you've got the G buffer, why isn't that enough to start rasterizing tile by tile?

Rodéric · Aug 3, 2016

Mat3 said:
What's the difference between a TBDR's parameter buffer of binned triangles and what's in a G buffer? Once you've got the G buffer, why isn't that enough to start rasterizing tile by tile?

If you filled your g-buffer you already rasterized the scene...
And translucent objects aren't in the g-buffer.

I suspect the tiles also improve data compression, besides all what was already mentionned. (more coherent texture reads = reduced bandwidth, translucent overdraw in L2 = reduced bandwidth, fewer ROP "export" = reduced bandwidth...)

PixResearch · Aug 3, 2016

Mat3 said:
What's the difference between a TBDR's parameter buffer of binned triangles and what's in a G buffer? Once you've got the G buffer, why isn't that enough to start rasterizing tile by tile?

Bit off topic as it's not how NVidia's approach works but...

The parameter buffer is the intermediate storage written after the geometry processing phase but before rasterization/3D phase, ie: all the primitives that might be in any tile plus a per tile entry indicating which of those primitives might be visible in the current tile. It's just a big pile of geometry, pointers and masks that have had no pixel shading applied and many of which may ultimately not be visible in the final render.

A G-buffer on the other is an intermediate set of data that is the result of a rasterization/3D process that has already determined what is visible at each fragment location (normally it's limited to storing only one of the things visible at each location which is usually fine but not in cases like transparency). It's written out multiple render targets worth of data from the pixel shaders calculating all kinds of values needed for a subsequent lighting pass.

Really - the closest equivalent to a GBuffer in a TBDR system is the on chip tile buffers in the 3D phase - they effectively store much of the same data. BTW - if you're sensible on several TBDR/TBIR architectures you exploit that to only create the GBuffer data for the current tile in the internal memories, use it immediately and throw the GBuffer data away without ever needing to write it out and read it back in for a huge bandwidth saving. (see modern APIs and pixel local storage extensions)

milk · Aug 3, 2016

I think he got hang up on the "G" from G-buffer, and wrongly assumed it stores geometry infotmation as in verts and polys, which it doesn't. It stores geometrical information as in depth and normals, on a per-sample basis. The normals are usually already modulated by normalmaps in the buffer, and even depth, in some next-gen engines, might also have been displaced by a parallax shader. After written, its completely agnostic to actual underlying polygonal geometry.

AnomalousEntity · Aug 3, 2016

Mat3 said:
What's the difference between a TBDR's parameter buffer of binned triangles and what's in a G buffer? Once you've got the G buffer, why isn't that enough to start rasterizing tile by tile?

That depends. If your architecture could do position only vertex shader then you don't need to store anything - just run vertex-shader and bin your triangles. Gbuffer on the other hand stores everything (albedo, normals, gloss, materialID, ... ) required to compute lighting on those triangles.

milk said:
I think he got hang up on the "G" from G-buffer, and wrongly assumed it stores geometry infotmation as in verts and polys, which it doesn't. It stores geometrical information as in depth and normals, on a per-sample basis. The normals are usually already modulated by normalmaps in the buffer, and even depth, in some next-gen engines, might also have been displaced by a parallax shader. After written, its completely agnostic to actual underlying polygonal geometry.

It's not agnostic to underlying geometry otherwise you won't be able to compute SSAO or even lighting. You only store the Z as you already know the X,Y of the pixel.

milk · Aug 3, 2016

what I meant was: After written, the gbuffer knows nothing about the tris and verts that were used to build it.

AnomalousEntity · Aug 3, 2016

That's true generally as you've moved from object space to screen-space however you can store anything inside your gbuffer even the primitiveIDs. Depends on your use case.

HTupolev · Aug 3, 2016

AnomalousEntity said:
That's true generally as you've moved from object space to screen-space however you can store anything inside your gbuffer even the primitiveIDs. Depends on your use case.

The point milk is making is that it's rasterizing the geometry and storing properties in screen-space, rather than being pre-raster data in geometry primitive space.

milk · Aug 3, 2016

Thanks HTupolev

Ethatron · Aug 3, 2016

I wonder if it's already possible to do efficient pixel-caching with a TBDR system. Reducing the pixel-set to all pixels with unique parameter vectors shouldn't be too hard on some smallish tile-size. The acting-like-a-pixel-shader could only be done with compute ofc, each unique vector to update (not in cache) assigned to one lane. Then scatter the results from the cache-vector as a separate pass.
Are there any APIs which allow programming the pipeline and it's stages and the (re-)scheduler between the stages yourself?

Philip · Aug 5, 2016

Andrew Lauritzen said:
There's definitely some weirdness in the 970 that David tested though that is almost certainly related to there being some disabled clusters. On "fully enabled" parts you don't see any of that weird hashed run-ahead of multiple tiles - it's all very balanced and it goes from one tile to the next.

From my testing on a GTX 970, I think the explanation for the weirdness is something like:

The screen is split into tiles of 16x16 px, which are assigned to 4 interleaved partitions. (My terminology is probably all wrong but never mind). Each partition corresponds to a GPC, and the number of tiles in each is proportional to the number of SMMs. The 970 has 13 SMMs in total so the partitions are unequal. Each partition gets rasterised almost completely independently.

I don't know how to tell whether pixel shaders for each tile are restricted to running on the SMMs in the GPC corresponding to that tile's partition, but I guess it would make sense if they were.

On my device I believe the assignment pattern is

Code:

p = [0,1,2,3,0,2,3,0,1,3,0,1,2]
partition(x, y) = p[(x + y*2) % 13]

where x,y are the tile index starting from the top left of the screen. That gives partition sizes in the ratio 4:3:3:3.

On David's video, his looks more like

Code:

p = [0,1,2,3,0,1,2,0,1,3,0,1,2]
partition(x, y) = p[(x + y*2) % 13]

That gives the ratio 4:4:3:2. I assume that corresponds to a different arrangement of disabled SMMs in his device.

The smaller partitions finish quicker, so the pattern becomes clearly visible as the partitions diverge.

From the video, the GTX 1070 (3 GPCs) looks more like

Code:

partition(x, y) = x % 3

Those partitions are equal over an infinite area, but don't fit uniformly into the ~512x512 px region that gets rasterised first, so the pattern becomes visible when the partition that's smaller in the first region starts the next region before the others do. Devices with 2 or 4 GPCs should have a much less visible pattern, since everything divides nicely there.

(I'm mildly surprised they don't do something like "(x + y) % 3" on the 1070 to make thin vertical objects get distributed more evenly between the partitions.)

Vertices/triangles are fully buffered (with all attributes) on-chip, up to about ~2k triangles (depending on the SKU and vertex output size) before a tile "pass" is run. Again this gets a lot more complicated when not considering full screen triangles but I think keeping the original article high level makes sense.

It also looks compressed to me - I see it handling a lot more triangles per pass if I put duplicated values in the vertex shader outputs, than if the values are all unique. So that makes it even more complicated to analyse

(But I'm certainly not an expert so I'd be happy to learn if I'm misinterpreting all this stuff!)

Philip · Aug 6, 2016

Philip said:
(I'm mildly surprised they don't do something like "(x + y) % 3" on the 1070 to make thin vertical objects get distributed more evenly between the partitions.)

Hmm, I see there are screenshots from other devices on http://www.neogaf.com/forum/showthread.php?t=1256067 which look to me like:

1080 (4 GPC, 20 SM): partition(x, y) = (x + 3*y) % 4
980Ti (6 GPCs, 22 SMM): partition(x, y) = [0,1,2,3,4,0,1,2,3,5,0,1,2,3,4,5,0,1,2,3,4,5][(x + 3*y) % 22] or something a bit like that
Titan X (6 GPCs, 24 SMM): can't really tell
840M (1 GPC, 3 SMM): partition(x, y) = 0

The 1070 seems like the odd one out, the others are all interleaving the partitions sensibly when possible.

Ext3h · Aug 6, 2016

Philip said:
It also looks compressed to me - I see it handling a lot more triangles per pass if I put duplicated values in the vertex shader outputs, than if the values are all unique. So that makes it even more complicated to analyse

That's a surprise. I expected the buffer to hold full triangles in unpacked form.

Full compression appears unlikely though, more likely just de-duplication of vertices (by hash?) and (re)indexing?

sebbbi · Aug 6, 2016

Ext3h said:
That's a surprise. I expected the buffer to hold full triangles in unpacked form.

Full compression appears unlikely though, more likely just de-duplication of vertices (by hash?) and (re)indexing?

Maybe they push the vertex data through DCC or similar. DCC could actually benefit vertex buffers as well.

3dcgi · Aug 7, 2016

Couldn't the compiler be removing duplicates?

cho · Aug 7, 2016

hmm, could you guys tell me how to estimate the Maxwell/Pascal do tile-based deferred rendering or not from that video ?

Ext3h · Aug 7, 2016

3dcgi said:
Couldn't the compiler be removing duplicates?

Unlikely, unless you expect the compiler to evaluate which vertex IDs would result in identical vertex data, and the result of that evaluation would then be used to reorganize the geometry completely?

@cho
Simple, you would expect the GPU to render the geometry in the order it was submitted. In this case, from the bottom most triangle to the top most. Instead the GPU forms tiles, and starts rendering the slices from triangles intersecting with this this tile first, before moving on to the next tile.

It's not strictly tile based in the classic meaning, as not the whole render pipeline is tiled, and it can even revisit the same tile multiple times in a single geometry if locality is insufficient, trashing the cache.

Furthermore, the observations made are in line with a patent Nvidia filed: https://www.google.com/patents/US20140118366

lanek · Aug 7, 2016

Ext3h said:
Furthermore, the observations made are in line with a patent Nvidia filed: https://www.google.com/patents/US20140118366

Oullah, i push a llittle things, there's thousands and thousands of patents and research about tiling rasterization and tiling render .. who have start in the end of the 90's .. from ATI, 3DFX, Intel, AMD, and ofc Nvidia, well nearly everyone, so its a bit hard to consider if this patent is a start of what is used by Nvidia right now in this case. The problem right now, is as Nvidia have not reveal they was use it, and how they execute it, is to know how they do it, this patent can give some information, but i really doubt the techinc used today is the same described here. A lot of things, even on the cache architecture could have change in between .

3dcgi · Aug 7, 2016

Ext3h said:
Unlikely, unless you expect the compiler to evaluate which vertex IDs would result in identical vertex data, and the result of that evaluation would then be used to reorganize the geometry completely?

I read Philip's post differently than you did. It seems to me that he's outputting the same data for multiple attributes, from a single vertex. There's no need to compare data across vertices.

Philip · Aug 8, 2016

I saw the apparent compression in both cases - using the same value for multiple attributes in one vertex, and using the same value for one attribute in multiple vertexes.

But... after more testing, I'm not convinced that it isn't really just the compiler being smart. E.g. the original code cycles through colours with "(input.VertexID / 3) % 7". If add a line like "output.Color *= (1.0 + (input.VertexID / 3) * 1e-16);" which should (I think) have zero effect on the output values but (I guess) makes it harder for the compiler to optimise, then it renders significantly fewer triangles per pass. I don't understand what optimisations the compiler could possibly be doing with the cyclic colours, though, so it still seems mysterious to me.

Tile-based Rasterization in Nvidia GPUs

Mat3

Rodéric

a.k.a. Ingenu

PixResearch

milk

Like Verified

AnomalousEntity

milk

Like Verified

AnomalousEntity

HTupolev

milk

Like Verified

Ethatron

Philip

Philip

Ext3h

sebbbi

3dcgi

cho

Ext3h

lanek

3dcgi

Philip

Similar threads