The driver/compiler could move "(input.VertexID / 3) % 7" and the color lookup code to the pixel shader to reduce the vertex attribute count. But this optimization makes the pixel shader more complex. It's hard to justify as the driver/compiler have no knowledge about your average vertex and pixel counts (or overdraw).I saw the apparent compression in both cases - using the same value for multiple attributes in one vertex, and using the same value for one attribute in multiple vertexes.
But... after more testing, I'm not convinced that it isn't really just the compiler being smart. E.g. the original code cycles through colours with "(input.VertexID / 3) % 7". If add a line like "output.Color *= (1.0 + (input.VertexID / 3) * 1e-16);" which should (I think) have zero effect on the output values but (I guess) makes it harder for the compiler to optimise, then it renders significantly fewer triangles per pass. I don't understand what optimisations the compiler could possibly be doing with the cyclic colours, though, so it still seems mysterious to me.
Awesome, yes that's what I was sorta guessing was happening. The unbalance you noted via the "4:3:3:3" pattern or similar, and it's confirmed nicely by there being different patterns on different 970s. Great analysis!The screen is split into tiles of 16x16 px, which are assigned to 4 interleaved partitions. (My terminology is probably all wrong but never mind). Each partition corresponds to a GPC, and the number of tiles in each is proportional to the number of SMMs. The 970 has 13 SMMs in total so the partitions are unequal. Each partition gets rasterised almost completely independently.
The fine grained hashing is typically static (although sometimes software programmable/tweakable) on most GPUs. I don't know for sure if this is the case on Maxwell but I wouldn't be surprised either.I don't know how to tell whether pixel shaders for each tile are restricted to running on the SMMs in the GPC corresponding to that tile's partition, but I guess it would make sense if they were.
After a bit more testing: I think the significant output here is just output.Color.a - the compiler's analysis is smart enough to realise that all 7 possible values for it are equal, so it doesn't have to be stored in memory and can be replaced with a constant in the pixel shader. If I simply change one of the 7 alpha constants to a different number, the behaviour changes (and I have to reduce "num floats per vertex" by 1 to get back to the same behaviour as before).I would guess that the shader compiler does some simple analysis on each output parameter (it needs to do that already for other optimizations). 1e-16 increment per vertex tells the compiler that each output is unique. If there's any kind of reuse caching, this attribute would be excluded. Indexing to a constant array by a modulo operator (%7) however results in most 7 different values, so caching is highly efficient. This kind of vertex output caching is already needed for indexed geometry. It's not hard to believe that Nvidia might have extended it to support some other easily detected safe cases as well.
Yeah. Modern compilers can move compile time constants (and math) between VS<->PS. It's a good trick in reducing vertex attribute count.After a bit more testing: I think the significant output here is just output.Color.a - the compiler's analysis is smart enough to realise that all 7 possible values for it are equal, so it doesn't have to be stored in memory and can be replaced with a constant in the pixel shader. If I simply change one of the 7 alpha constants to a different number, the behaviour changes (and I have to reduce "num floats per vertex" by 1 to get back to the same behaviour as before).
So ignore everything I said about compression, the compiler was just being smarter than me
(But since speculation is fun: it looks like it actually takes a few frames for the compiler to discover that alpha is constant. E.g. I set it up so when "num floats per vertex" is 20 and "num pixels" is 50%, it's drawn to half the screen; and when it's 22, it's drawn to the entire screen and started a second batch of triangles. Then, whenever I move the slider from 20 to 21 (and it recompiles the shaders), it very briefly flashes an image that looks the same as 22, before settling down to the same as 20. But if I change it so alpha is not constant, and it can't do that optimisation, there is no flicker any more - 21 always looks the same as 22. I guess that means the analysis is expensive enough that it's only done when recompiling the shader in some background thread or something? Anyway, not really relevant to this topic, just something that makes the demo application confuse me.)
Look at the surrounding area of the mini-map at the bottom-right corner. You can see tiles there.Why are you attributing the glitch to tiled rendering?
Yeah. All modern GPUs perform several rasterization related things in smallish rectangular tiles. Also acceleration structures (such as hiZ) are tile based. I am guessing this is just a driver bug. Some timing issue (missing synchronization) in this special case. Could be anything really.So because it's tile shaped it's a glitch with tiled rendering?
https://www.techpowerup.com/231129/on-nvidias-tile-based-renderingThanks to NVIDIA's public acknowledgement on the usage of tile-based rendering strating with its Maxwell architectures, some design decisions on the Maxwell architecture now make much more sense. Below, is a screenshot taken from NVIDIA's "5 Things You Should Know About the New Maxwell GPU Architecture". Take a look at the L2 cache size. From Kepler to Maxwell, the cache size increased 8x, from 256 KB on Kepler to the 2048 KB on Maxwell. Now, we can attribute this gigantic leap in cache size to the need for a higher-size L2 cache so as to fit the required tile-based resources for the rasterizing process, which allowed NVIDIA the leap in memory performance and power efficiency they achieved with the Maxwell architecture compared to its Kepler predecessor. Incidentally, NVIDIA's GP102 chip (which powers the GTX Titan X and the upcoming, recently announced GTX 1080 Ti, doubles that amount of L2 cache again, to a staggering 4096 KB. Whether or not Volta will continue with the scaling of L2 cache remains to be seen, but I've seen worse bets.
An interesting tangent: the Xbox 360 and Xbox One ESRAM chips (running on AMD-architectured GPUs, no less) can make for a substitute for the tile-based rasterization process that post-Maxwell NVIDIA GPUs employ.
Tile-based rendering seems to have been a key part on NVIDIA's secret-sauce towards achieving the impressive performance-per-watt ratings of their last two architectures, and it's expected that their approach to this rendering mode will only improve with time. Some differences can be seen in the tile-based rendering between Maxwell and Pascal already, with the former dividing the scene into triangles, and the later breaking a scene up into squares or vertical rectangles as needed, so this means that NVIDIA has in fact put in some measure of work into the rendering system between both these architectures.
Perhaps we have already seen some seeds of this tile-based rendering on AMD's Vega architecture sneak peek, particularly in regards to its next-generation Pixel Engine: the render back-ends now being clients of the L2 cache substitute their previous architectures' non-coherent memory access, in which the pixel engine wrote to the memory controller. This could be AMD's way of tackling the same problem, with AMD's improvements to the pixel-engine with a new-generation draw-stream binning rasterizer supposedly helping to conserve clock cycles, whilst simultaneously improving on-die cache locality and memory footprint.