Tile-based Rasterization in Nvidia GPUs

Any time you put constants as literals in your code the compiler is likely to "look for patterns", e.g. with loop unrolling. To hide constants from the compiler, use constants as arguments to the shader.
 
I saw the apparent compression in both cases - using the same value for multiple attributes in one vertex, and using the same value for one attribute in multiple vertexes.

But... after more testing, I'm not convinced that it isn't really just the compiler being smart. E.g. the original code cycles through colours with "(input.VertexID / 3) % 7". If add a line like "output.Color *= (1.0 + (input.VertexID / 3) * 1e-16);" which should (I think) have zero effect on the output values but (I guess) makes it harder for the compiler to optimise, then it renders significantly fewer triangles per pass. I don't understand what optimisations the compiler could possibly be doing with the cyclic colours, though, so it still seems mysterious to me.
The driver/compiler could move "(input.VertexID / 3) % 7" and the color lookup code to the pixel shader to reduce the vertex attribute count. But this optimization makes the pixel shader more complex. It's hard to justify as the driver/compiler have no knowledge about your average vertex and pixel counts (or overdraw).

If the compiler was really clever, it could notice the modulo operator on the SV_VertexID and transform the code to use geometry instancing instead (extract the color lookup table as instance vertex buffer). If an attribute has only ties to instance vertex buffer data, it doesn't need to be replicated per vertex.

I would guess that the shader compiler does some simple analysis on each output parameter (it needs to do that already for other optimizations). 1e-16 increment per vertex tells the compiler that each output is unique. If there's any kind of reuse caching, this attribute would be excluded. Indexing to a constant array by a modulo operator (%7) however results in most 7 different values, so caching is highly efficient. This kind of vertex output caching is already needed for indexed geometry. It's not hard to believe that Nvidia might have extended it to support some other easily detected safe cases as well.
 
The screen is split into tiles of 16x16 px, which are assigned to 4 interleaved partitions. (My terminology is probably all wrong but never mind). Each partition corresponds to a GPC, and the number of tiles in each is proportional to the number of SMMs. The 970 has 13 SMMs in total so the partitions are unequal. Each partition gets rasterised almost completely independently.
Awesome, yes that's what I was sorta guessing was happening. The unbalance you noted via the "4:3:3:3" pattern or similar, and it's confirmed nicely by there being different patterns on different 970s. Great analysis!

The fine-grained 16x16 hashing is similar to what all GPUs do (and have done for quite some time) - it's also what mistakenly trips some people up into saying there's nothing special going on here (there clearly is :)). But how that interacts with the coarse-grained tiling with uneven loads is the neat bit. Obviously in practice the loads will tend to be a lot less even in the first place due to geometry variation, but it's easier to analyse the simple case first.

I don't know how to tell whether pixel shaders for each tile are restricted to running on the SMMs in the GPC corresponding to that tile's partition, but I guess it would make sense if they were.
The fine grained hashing is typically static (although sometimes software programmable/tweakable) on most GPUs. I don't know for sure if this is the case on Maxwell but I wouldn't be surprised either.
 
I would guess that the shader compiler does some simple analysis on each output parameter (it needs to do that already for other optimizations). 1e-16 increment per vertex tells the compiler that each output is unique. If there's any kind of reuse caching, this attribute would be excluded. Indexing to a constant array by a modulo operator (%7) however results in most 7 different values, so caching is highly efficient. This kind of vertex output caching is already needed for indexed geometry. It's not hard to believe that Nvidia might have extended it to support some other easily detected safe cases as well.
After a bit more testing: I think the significant output here is just output.Color.a - the compiler's analysis is smart enough to realise that all 7 possible values for it are equal, so it doesn't have to be stored in memory and can be replaced with a constant in the pixel shader. If I simply change one of the 7 alpha constants to a different number, the behaviour changes (and I have to reduce "num floats per vertex" by 1 to get back to the same behaviour as before).

So ignore everything I said about compression, the compiler was just being smarter than me :(

(But since speculation is fun: it looks like it actually takes a few frames for the compiler to discover that alpha is constant. E.g. I set it up so when "num floats per vertex" is 20 and "num pixels" is 50%, it's drawn to half the screen; and when it's 22, it's drawn to the entire screen and started a second batch of triangles. Then, whenever I move the slider from 20 to 21 (and it recompiles the shaders), it very briefly flashes an image that looks the same as 22, before settling down to the same as 20. But if I change it so alpha is not constant, and it can't do that optimisation, there is no flicker any more - 21 always looks the same as 22. I guess that means the analysis is expensive enough that it's only done when recompiling the shader in some background thread or something? Anyway, not really relevant to this topic, just something that makes the demo application confuse me.)
 
After a bit more testing: I think the significant output here is just output.Color.a - the compiler's analysis is smart enough to realise that all 7 possible values for it are equal, so it doesn't have to be stored in memory and can be replaced with a constant in the pixel shader. If I simply change one of the 7 alpha constants to a different number, the behaviour changes (and I have to reduce "num floats per vertex" by 1 to get back to the same behaviour as before).

So ignore everything I said about compression, the compiler was just being smarter than me :(

(But since speculation is fun: it looks like it actually takes a few frames for the compiler to discover that alpha is constant. E.g. I set it up so when "num floats per vertex" is 20 and "num pixels" is 50%, it's drawn to half the screen; and when it's 22, it's drawn to the entire screen and started a second batch of triangles. Then, whenever I move the slider from 20 to 21 (and it recompiles the shaders), it very briefly flashes an image that looks the same as 22, before settling down to the same as 20. But if I change it so alpha is not constant, and it can't do that optimisation, there is no flicker any more - 21 always looks the same as 22. I guess that means the analysis is expensive enough that it's only done when recompiling the shader in some background thread or something? Anyway, not really relevant to this topic, just something that makes the demo application confuse me.)
Yeah. Modern compilers can move compile time constants (and math) between VS<->PS. It's a good trick in reducing vertex attribute count.

I have seen similar behavior from Nvidia's shader compiler. I often modify & recompile shaders at runtime. After recompile the affected shader is slower for a few frames. I believe Nvidia is quickly compiling an unoptimized shader first and then applying heavy optimizations as background compile process. Automatic PGO is a possibility, if they collect counter stats from shaders. Wouldn't surprise me as they have researched PGO a lot for Denver.
 
The tile check software that has been introduced I tried to compile.
 

Attachments

  • tile.zip
    121.2 KB · Views: 13
I suppose it could just be bad NVidia drivers doing something that just happens to have a tile shape. It doesn't happen on AMD or Intel hardware that I've used. I wish I had access to an older Fermi or Keplar based NVidia GPU so I could see if it does the same thing there on those chips.

Regards,
SB
 
So because it's tile shaped it's a glitch with tiled rendering?
Yeah. All modern GPUs perform several rasterization related things in smallish rectangular tiles. Also acceleration structures (such as hiZ) are tile based. I am guessing this is just a driver bug. Some timing issue (missing synchronization) in this special case. Could be anything really.
 
Yeah I'm guessing probably just bad drivers then or something with Pascal. There's other places in the game that exhibit the same tiled artifacting. I'm guessing it has something to do with the lighting in the game.

In another location

http://imgur.com/a/3eX0Y

It's very annoying as the tiles are constantly flickering and changing based on where you are looking. Also only happens in certain locations in the game.


Regards,
SB
 
Guild wars 2 player ... i have been a long time a player on GW 1 and 2 and Lineage 2 .. This bug is a long time bug, sometimes it appears and then disappear.... this is due to the engine they are using .. when it appear, just alt tab and come back to the game, it shoulld disappear .. its effectly due to rasterization, but not in the sense you think.
 
On NVIDIA's Tile-Based Rendering


Thanks to NVIDIA's public acknowledgement on the usage of tile-based rendering strating with its Maxwell architectures, some design decisions on the Maxwell architecture now make much more sense. Below, is a screenshot taken from NVIDIA's "5 Things You Should Know About the New Maxwell GPU Architecture". Take a look at the L2 cache size. From Kepler to Maxwell, the cache size increased 8x, from 256 KB on Kepler to the 2048 KB on Maxwell. Now, we can attribute this gigantic leap in cache size to the need for a higher-size L2 cache so as to fit the required tile-based resources for the rasterizing process, which allowed NVIDIA the leap in memory performance and power efficiency they achieved with the Maxwell architecture compared to its Kepler predecessor. Incidentally, NVIDIA's GP102 chip (which powers the GTX Titan X and the upcoming, recently announced GTX 1080 Ti, doubles that amount of L2 cache again, to a staggering 4096 KB. Whether or not Volta will continue with the scaling of L2 cache remains to be seen, but I've seen worse bets.

An interesting tangent: the Xbox 360 and Xbox One ESRAM chips (running on AMD-architectured GPUs, no less) can make for a substitute for the tile-based rasterization process that post-Maxwell NVIDIA GPUs employ.

Tile-based rendering seems to have been a key part on NVIDIA's secret-sauce towards achieving the impressive performance-per-watt ratings of their last two architectures, and it's expected that their approach to this rendering mode will only improve with time. Some differences can be seen in the tile-based rendering between Maxwell and Pascal already, with the former dividing the scene into triangles, and the later breaking a scene up into squares or vertical rectangles as needed, so this means that NVIDIA has in fact put in some measure of work into the rendering system between both these architectures.

Perhaps we have already seen some seeds of this tile-based rendering on AMD's Vega architecture sneak peek, particularly in regards to its next-generation Pixel Engine: the render back-ends now being clients of the L2 cache substitute their previous architectures' non-coherent memory access, in which the pixel engine wrote to the memory controller. This could be AMD's way of tackling the same problem, with AMD's improvements to the pixel-engine with a new-generation draw-stream binning rasterizer supposedly helping to conserve clock cycles, whilst simultaneously improving on-die cache locality and memory footprint.
https://www.techpowerup.com/231129/on-nvidias-tile-based-rendering
 
Back
Top