Tile-based Rasterization in Nvidia GPUs

Jawed · Aug 8, 2016

Any time you put constants as literals in your code the compiler is likely to "look for patterns", e.g. with loop unrolling. To hide constants from the compiler, use constants as arguments to the shader.

sebbbi · Aug 8, 2016

Philip said:
I saw the apparent compression in both cases - using the same value for multiple attributes in one vertex, and using the same value for one attribute in multiple vertexes.

But... after more testing, I'm not convinced that it isn't really just the compiler being smart. E.g. the original code cycles through colours with "(input.VertexID / 3) % 7". If add a line like "output.Color *= (1.0 + (input.VertexID / 3) * 1e-16);" which should (I think) have zero effect on the output values but (I guess) makes it harder for the compiler to optimise, then it renders significantly fewer triangles per pass. I don't understand what optimisations the compiler could possibly be doing with the cyclic colours, though, so it still seems mysterious to me.

The driver/compiler could move "(input.VertexID / 3) % 7" and the color lookup code to the pixel shader to reduce the vertex attribute count. But this optimization makes the pixel shader more complex. It's hard to justify as the driver/compiler have no knowledge about your average vertex and pixel counts (or overdraw).

If the compiler was really clever, it could notice the modulo operator on the SV_VertexID and transform the code to use geometry instancing instead (extract the color lookup table as instance vertex buffer). If an attribute has only ties to instance vertex buffer data, it doesn't need to be replicated per vertex.

I would guess that the shader compiler does some simple analysis on each output parameter (it needs to do that already for other optimizations). 1e-16 increment per vertex tells the compiler that each output is unique. If there's any kind of reuse caching, this attribute would be excluded. Indexing to a constant array by a modulo operator (%7) however results in most 7 different values, so caching is highly efficient. This kind of vertex output caching is already needed for indexed geometry. It's not hard to believe that Nvidia might have extended it to support some other easily detected safe cases as well.

Andrew Lauritzen · Aug 8, 2016

Philip said:
The screen is split into tiles of 16x16 px, which are assigned to 4 interleaved partitions. (My terminology is probably all wrong but never mind). Each partition corresponds to a GPC, and the number of tiles in each is proportional to the number of SMMs. The 970 has 13 SMMs in total so the partitions are unequal. Each partition gets rasterised almost completely independently.

Awesome, yes that's what I was sorta guessing was happening. The unbalance you noted via the "4:3:3:3" pattern or similar, and it's confirmed nicely by there being different patterns on different 970s. Great analysis!

The fine-grained 16x16 hashing is similar to what all GPUs do (and have done for quite some time) - it's also what mistakenly trips some people up into saying there's nothing special going on here (there clearly is

). But how that interacts with the coarse-grained tiling with uneven loads is the neat bit. Obviously in practice the loads will tend to be a lot less even in the first place due to geometry variation, but it's easier to analyse the simple case first.

Philip said:
I don't know how to tell whether pixel shaders for each tile are restricted to running on the SMMs in the GPC corresponding to that tile's partition, but I guess it would make sense if they were.

The fine grained hashing is typically static (although sometimes software programmable/tweakable) on most GPUs. I don't know for sure if this is the case on Maxwell but I wouldn't be surprised either.

Philip · Aug 9, 2016

sebbbi said:
I would guess that the shader compiler does some simple analysis on each output parameter (it needs to do that already for other optimizations). 1e-16 increment per vertex tells the compiler that each output is unique. If there's any kind of reuse caching, this attribute would be excluded. Indexing to a constant array by a modulo operator (%7) however results in most 7 different values, so caching is highly efficient. This kind of vertex output caching is already needed for indexed geometry. It's not hard to believe that Nvidia might have extended it to support some other easily detected safe cases as well.

After a bit more testing: I think the significant output here is just output.Color.a - the compiler's analysis is smart enough to realise that all 7 possible values for it are equal, so it doesn't have to be stored in memory and can be replaced with a constant in the pixel shader. If I simply change one of the 7 alpha constants to a different number, the behaviour changes (and I have to reduce "num floats per vertex" by 1 to get back to the same behaviour as before).

So ignore everything I said about compression, the compiler was just being smarter than me

(But since speculation is fun: it looks like it actually takes a few frames for the compiler to discover that alpha is constant. E.g. I set it up so when "num floats per vertex" is 20 and "num pixels" is 50%, it's drawn to half the screen; and when it's 22, it's drawn to the entire screen and started a second batch of triangles. Then, whenever I move the slider from 20 to 21 (and it recompiles the shaders), it very briefly flashes an image that looks the same as 22, before settling down to the same as 20. But if I change it so alpha is not constant, and it can't do that optimisation, there is no flicker any more - 21 always looks the same as 22. I guess that means the analysis is expensive enough that it's only done when recompiling the shader in some background thread or something? Anyway, not really relevant to this topic, just something that makes the demo application confuse me.)

sebbbi · Aug 9, 2016

Philip said:
After a bit more testing: I think the significant output here is just output.Color.a - the compiler's analysis is smart enough to realise that all 7 possible values for it are equal, so it doesn't have to be stored in memory and can be replaced with a constant in the pixel shader. If I simply change one of the 7 alpha constants to a different number, the behaviour changes (and I have to reduce "num floats per vertex" by 1 to get back to the same behaviour as before).

So ignore everything I said about compression, the compiler was just being smarter than me

(But since speculation is fun: it looks like it actually takes a few frames for the compiler to discover that alpha is constant. E.g. I set it up so when "num floats per vertex" is 20 and "num pixels" is 50%, it's drawn to half the screen; and when it's 22, it's drawn to the entire screen and started a second batch of triangles. Then, whenever I move the slider from 20 to 21 (and it recompiles the shaders), it very briefly flashes an image that looks the same as 22, before settling down to the same as 20. But if I change it so alpha is not constant, and it can't do that optimisation, there is no flicker any more - 21 always looks the same as 22. I guess that means the analysis is expensive enough that it's only done when recompiling the shader in some background thread or something? Anyway, not really relevant to this topic, just something that makes the demo application confuse me.)

Yeah. Modern compilers can move compile time constants (and math) between VS<->PS. It's a good trick in reducing vertex attribute count.

I have seen similar behavior from Nvidia's shader compiler. I often modify & recompile shaders at runtime. After recompile the affected shader is slower for a few frames. I believe Nvidia is quickly compiling an unoptimized shader first and then applying heavy optimizations as background compile process. Automatic PGO is a possibility, if they collect counter stats from shaders. Wouldn't surprise me as they have researched PGO a lot for Denver.

itaru · Aug 10, 2016

The tile check software that has been introduced I tried to compile.

Silent_Buddha · Dec 25, 2016

Well, for people interested in what tiled rendering on a 1070 looks like when it's glitching out.

http://imgur.com/a/hzx7Y

That's near the jumping puzzle in Silverwastes in Guild Wars 2. Sometimes it glitches even worse than that in that area.

Regards,
SB

Infinisearch · Dec 26, 2016

Why are you attributing the glitch to tiled rendering?

pTmdfx · Dec 26, 2016

Infinisearch said:
Why are you attributing the glitch to tiled rendering?

Look at the surrounding area of the mini-map at the bottom-right corner. You can see tiles there.

MDolenc · Dec 26, 2016

So because it's tile shaped it's a glitch with tiled rendering?

Silent_Buddha · Dec 28, 2016

I suppose it could just be bad NVidia drivers doing something that just happens to have a tile shape. It doesn't happen on AMD or Intel hardware that I've used. I wish I had access to an older Fermi or Keplar based NVidia GPU so I could see if it does the same thing there on those chips.

Regards,
SB

sebbbi · Dec 28, 2016

MDolenc said:
So because it's tile shaped it's a glitch with tiled rendering?

Yeah. All modern GPUs perform several rasterization related things in smallish rectangular tiles. Also acceleration structures (such as hiZ) are tile based. I am guessing this is just a driver bug. Some timing issue (missing synchronization) in this special case. Could be anything really.

Silent_Buddha · Dec 28, 2016

Yeah I'm guessing probably just bad drivers then or something with Pascal. There's other places in the game that exhibit the same tiled artifacting. I'm guessing it has something to do with the lighting in the game.

In another location

http://imgur.com/a/3eX0Y

It's very annoying as the tiles are constantly flickering and changing based on where you are looking. Also only happens in certain locations in the game.

Regards,
SB

lanek · Dec 31, 2016

Guild wars 2 player ... i have been a long time a player on GW 1 and 2 and Lineage 2 .. This bug is a long time bug, sometimes it appears and then disappear.... this is due to the engine they are using .. when it appear, just alt tab and come back to the game, it shoulld disappear .. its effectly due to rasterization, but not in the sense you think.

AlNom · Dec 31, 2016

DX9 problem?

Andrew Lauritzen · Jan 4, 2017

Yeah sebbbi is most likely right here - this is some sort of timing/race in the hash/ROP part of the pipeline. All GPUs do a screen space hash into "small tiles", not just NVIDIA ones. So this is unrelated to the large tile stuff that this thread is about.

Deleted member 2197 · Mar 2, 2017

On NVIDIA's Tile-Based Rendering

Thanks to NVIDIA's public acknowledgement on the usage of tile-based rendering strating with its Maxwell architectures, some design decisions on the Maxwell architecture now make much more sense. Below, is a screenshot taken from NVIDIA's "5 Things You Should Know About the New Maxwell GPU Architecture". Take a look at the L2 cache size. From Kepler to Maxwell, the cache size increased 8x, from 256 KB on Kepler to the 2048 KB on Maxwell. Now, we can attribute this gigantic leap in cache size to the need for a higher-size L2 cache so as to fit the required tile-based resources for the rasterizing process, which allowed NVIDIA the leap in memory performance and power efficiency they achieved with the Maxwell architecture compared to its Kepler predecessor. Incidentally, NVIDIA's GP102 chip (which powers the GTX Titan X and the upcoming, recently announced GTX 1080 Ti, doubles that amount of L2 cache again, to a staggering 4096 KB. Whether or not Volta will continue with the scaling of L2 cache remains to be seen, but I've seen worse bets.

An interesting tangent: the Xbox 360 and Xbox One ESRAM chips (running on AMD-architectured GPUs, no less) can make for a substitute for the tile-based rasterization process that post-Maxwell NVIDIA GPUs employ.

Tile-based rendering seems to have been a key part on NVIDIA's secret-sauce towards achieving the impressive performance-per-watt ratings of their last two architectures, and it's expected that their approach to this rendering mode will only improve with time. Some differences can be seen in the tile-based rendering between Maxwell and Pascal already, with the former dividing the scene into triangles, and the later breaking a scene up into squares or vertical rectangles as needed, so this means that NVIDIA has in fact put in some measure of work into the rendering system between both these architectures.

Perhaps we have already seen some seeds of this tile-based rendering on AMD's Vega architecture sneak peek, particularly in regards to its next-generation Pixel Engine: the render back-ends now being clients of the L2 cache substitute their previous architectures' non-coherent memory access, in which the pixel engine wrote to the memory controller. This could be AMD's way of tackling the same problem, with AMD's improvements to the pixel-engine with a new-generation draw-stream binning rasterizer supposedly helping to conserve clock cycles, whilst simultaneously improving on-die cache locality and memory footprint.

https://www.techpowerup.com/231129/on-nvidias-tile-based-rendering

Tile-based Rasterization in Nvidia GPUs

Jawed

sebbbi

Andrew Lauritzen

Moderator

Philip

sebbbi

itaru

Attachments

Silent_Buddha

Infinisearch

pTmdfx

MDolenc

Silent_Buddha

sebbbi

Silent_Buddha

lanek

AlNom

Moderator

Andrew Lauritzen

Moderator

Deleted member 2197

Guest

Similar threads