AMD's own statements indicate they want to get some, or hopefully most, of the gains of a full TBDR. Your current position would not be disputing what AMD has said, but this current position is not what created the point of contention.Approaching TBDR would be more accurate.
It is not sufficient to subdivide a 2D screen space tile repeatedly in order to control on-chip storage, and doing this repeatedly adds context as well. How the shader would know cache usage is maximized unless it is after the fact and too late isn't clear. The second condition seems to be trivially true, one would want an ideal bin size. However, if the ideal size is known, the trick would be to just set it to the ideal bin size. If not known, the conditions for ideal size have not been given.Subdividing until cache usage is maximized or some ideal bin size reached.
With the instruction introduced in the Vega ISA doc, you get a conservative bounding box for a primitive--in pixels. Subdividing the area in pixels does not put a ceiling on how many primitives may have overlapping boxes, and cutting the area of overlap even more finely just gives you an increasing number of bins with no upper bound on their primitive count.
Or at least it would if the instruction and everything else were able to subdivide quads or individual pixels. The instruction doesn't, and the rest of the pipeline description still mentions quads being important. Getting near this point rather than stopping at the more coarse pixel granularity of the SIMD architecture and rasterizer seems counterproductive.
Given the sparse description of the corner cases, it's potentially also a case where this sub-dividing shader goes on forever, if a fully covered tile is not considered fully optimal and this scheme keeps on subdividing areas of full overlap.
Even if this avoids subdividing infinitely, it leaves a longer and variable period before optimal bin size is determined (and then it determines if geometry should be culled?). My impression of AMD's plan is to not drill down this far, but trust that the existing hardware and culling paths will catch what gets through at primitive assembly and the rasterizer, particularly if using the DSBR's shade-once mode.
You wouldn't be getting the necessary information from the Vega instruction to get a workable variable frequency of sampling between bins, and some amount of luck would be needed to contain the growth in complexity by having this so-called primitive shader changing the sampling, boundaries, and context growth of variable portions of the screen, then mapping them back up to the fixed target and execution granularities.Smaller bins would facilitate more accurate occlusion culling of geometry. Possibly with an uneven distribution.
I do not see this as being supported by AMD's claims, either in patents or in marketing. The fetch-once feature of Vega doesn't work if the bins are allowed to overflow the chip. It would necessitate fetching whatever was pushed out again, and this hypothesis apparently needs AMD's rasterizer scheme to not close batches and then pass them to be stepped through bin by bin by the scan converter, which AMD's patents give as the process.I've been working off the hypothesis aggregate bin size will exceed hardware capacity somewhat readily in an ideal case. Not simply filling a batch/tile before passing it on.
Part of that is included in what I linked, as well as an ACII table showing the merged stages. It's been brought up before, and I am left wondering what all the back and forth was about since it seems the shift in the debate is that we've apparently both been arguing for my position. However, I think I could find some quotes that indicate that's not what happened.https://lists.freedesktop.org/archives/mesa-dev/2017-April/152733.html
That's what I recalled specifically. Not quite fully merged, but reduced stages with data in LDS. At least in Vega's case.
The comments on the code gave their reasons,and use cases for monolithic and non-monolithic shaders. So long as there is more than one stage in the pipeline, whether the number is somewhat higher or lower does not affect the point I would think we were debating.Part of that hardware pipeline no longer exists though. So it is more than just a compiler vs hardware trade-off for optimization as referenced in the link above.
That the section is described as being part of the compiler middle-end really makes it very much about compiler trade-offs.