AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Approaching TBDR would be more accurate.
AMD's own statements indicate they want to get some, or hopefully most, of the gains of a full TBDR. Your current position would not be disputing what AMD has said, but this current position is not what created the point of contention.

Subdividing until cache usage is maximized or some ideal bin size reached.
It is not sufficient to subdivide a 2D screen space tile repeatedly in order to control on-chip storage, and doing this repeatedly adds context as well. How the shader would know cache usage is maximized unless it is after the fact and too late isn't clear. The second condition seems to be trivially true, one would want an ideal bin size. However, if the ideal size is known, the trick would be to just set it to the ideal bin size. If not known, the conditions for ideal size have not been given.
With the instruction introduced in the Vega ISA doc, you get a conservative bounding box for a primitive--in pixels. Subdividing the area in pixels does not put a ceiling on how many primitives may have overlapping boxes, and cutting the area of overlap even more finely just gives you an increasing number of bins with no upper bound on their primitive count.

Or at least it would if the instruction and everything else were able to subdivide quads or individual pixels. The instruction doesn't, and the rest of the pipeline description still mentions quads being important. Getting near this point rather than stopping at the more coarse pixel granularity of the SIMD architecture and rasterizer seems counterproductive.

Given the sparse description of the corner cases, it's potentially also a case where this sub-dividing shader goes on forever, if a fully covered tile is not considered fully optimal and this scheme keeps on subdividing areas of full overlap.
Even if this avoids subdividing infinitely, it leaves a longer and variable period before optimal bin size is determined (and then it determines if geometry should be culled?). My impression of AMD's plan is to not drill down this far, but trust that the existing hardware and culling paths will catch what gets through at primitive assembly and the rasterizer, particularly if using the DSBR's shade-once mode.

Smaller bins would facilitate more accurate occlusion culling of geometry. Possibly with an uneven distribution.
You wouldn't be getting the necessary information from the Vega instruction to get a workable variable frequency of sampling between bins, and some amount of luck would be needed to contain the growth in complexity by having this so-called primitive shader changing the sampling, boundaries, and context growth of variable portions of the screen, then mapping them back up to the fixed target and execution granularities.

I've been working off the hypothesis aggregate bin size will exceed hardware capacity somewhat readily in an ideal case. Not simply filling a batch/tile before passing it on.
I do not see this as being supported by AMD's claims, either in patents or in marketing. The fetch-once feature of Vega doesn't work if the bins are allowed to overflow the chip. It would necessitate fetching whatever was pushed out again, and this hypothesis apparently needs AMD's rasterizer scheme to not close batches and then pass them to be stepped through bin by bin by the scan converter, which AMD's patents give as the process.

https://lists.freedesktop.org/archives/mesa-dev/2017-April/152733.html

That's what I recalled specifically. Not quite fully merged, but reduced stages with data in LDS. At least in Vega's case.
Part of that is included in what I linked, as well as an ACII table showing the merged stages. It's been brought up before, and I am left wondering what all the back and forth was about since it seems the shift in the debate is that we've apparently both been arguing for my position. However, I think I could find some quotes that indicate that's not what happened.

Part of that hardware pipeline no longer exists though. So it is more than just a compiler vs hardware trade-off for optimization as referenced in the link above.
The comments on the code gave their reasons,and use cases for monolithic and non-monolithic shaders. So long as there is more than one stage in the pipeline, whether the number is somewhat higher or lower does not affect the point I would think we were debating.
That the section is described as being part of the compiler middle-end really makes it very much about compiler trade-offs.
 
It is not sufficient to subdivide a 2D screen space tile repeatedly in order to control on-chip storage, and doing this repeatedly adds context as well. How the shader would know cache usage is maximized unless it is after the fact and too late isn't clear.
Control the number of bins and limit context more than just aggregate bin size. The size of the bins inevitably will be tied to the quantity of valid triangles. My argument was that pixel overdraw was replaced by paging of bins and ultimately resulting in more efficient execution.

The second condition seems to be trivially true, one would want an ideal bin size. However, if the ideal size is known, the trick would be to just set it to the ideal bin size. If not known, the conditions for ideal size have not been given.
The previous frame could serve as a means to estimate ideal size. Bin size probably needs to be understood in terms of both aggregate data size and screen space dimensions. Aggregate size remaining roughly equal while screen area adjusting for some distribution. What I was hinting at with block encoding where bits are focused on significant areas of the screen.

I think it's safe to assume the implementation is going to be iterated on for a while as it is rather complex.

With the instruction introduced in the Vega ISA doc, you get a conservative bounding box for a primitive--in pixels. Subdividing the area in pixels does not put a ceiling on how many primitives may have overlapping boxes, and cutting the area of overlap even more finely just gives you an increasing number of bins with no upper bound on their primitive count.
Pixels or bins? In this context they could be interchangeable. A 16x16 grid of bins and 256 pixels in screen space aren't all that different for example. One invokes a fragment shader while the latter some process to insert the geometry into appropriate bins. Not unlike PixelSync with OIT, where at some point the entire bin may need reevaluating based on new metrics. Cases where a primitive in the middle of the sequence occludes all current geometry.

Given the sparse description of the corner cases, it's potentially also a case where this sub-dividing shader goes on forever, if a fully covered tile is not considered fully optimal and this scheme keeps on subdividing areas of full overlap.
Not forever, as there would be some limit to context, but I could see a few levels deep. Bin/quadrant within a bin sort of arrangement. Some portions of screen space (skybox, UI, minimap, weapon, etc) would just be a waste of bin space if not dynamically sized.

One consideration we may not have touched on is that with binning the SEs could be remapped to a current bin. All working in relatively close proximity without interference. Until a memory channel comes into play. Solving some ROP alignment issues, but distributing the framebuffer with memory controllers could be fun.

You wouldn't be getting the necessary information from the Vega instruction to get a workable variable frequency of sampling between bins, and some amount of luck would be needed to contain the growth in complexity by having this so-called primitive shader changing the sampling, boundaries, and context growth of variable portions of the screen, then mapping them back up to the fixed target and execution granularities.
The primitive shader wouldn't actively influence the boundaries. The driver would most likely fix them in advance based on prior frames/results. The PS just intrepreting those results for binning. Pre-pass with video encoding type of mechanism to determine distribution. I'd agree one iteration of the function would be insufficient, but variable passes based on locality would work.

I do not see this as being supported by AMD's claims, either in patents or in marketing. The fetch-once feature of Vega doesn't work if the bins are allowed to overflow the chip. It would necessitate fetching whatever was pushed out again, and this hypothesis apparently needs AMD's rasterizer scheme to not close batches and then pass them to be stepped through bin by bin by the scan converter, which AMD's patents give as the process.
Not in published patents or marketing at least. But it seems reasonable this would be an area of active development. Current level of DSBR enablement leaves something to be desired.

Fetch once, to my understanding, was in regards to fragment data. No application to bins or geometry which under most cases would never require multiple fetches anyways. Perhaps ID buffer techniques.

I have a feeling the hardware supports these possibilities, but AMD hasn't fully hashed out the implementation. Which back to the original primitive shader argument, I wouldn't consider strictly defined, but reflective of their current model at the time.

As for the scan converter, even with what I described some bins could still be flushed and repopulated with new data. Less efficient, but possibly better performing. Any pixel shader modifying depth could bypass the binning process entirely. Allowing final depth values to be read back and influence culling. I don't see my proposal as being inconsistent with what AMD has presented, just perhaps tweaked or loosely defined in current documentation.
 
Layman's question here: Couldn't the compiler draw conclusions about the (approximately) ideal bin size from data gathered by previous execution and then saved (and reused) alongside the shader cache? Would shader pre-warming help with that, if the compiler can adjust the tile size during run-time?
 
Control the number of bins and limit context more than just aggregate bin size.
I do not see any actual meaning in these words. Aggregate bin size and number of bins add up to the same thing, since bins are subdivisions of the same screen space.

The size of the bins inevitably will be tied to the quantity of valid triangles.
I am willing to discuss this if you can cite where this conclusion came from.
I could cite rafts of AMD statements, descriptions, and patches that do not align with this.

Pixels or bins? In this context they could be interchangeable.
The descriptions of a 2D screen space composed of pixels being subdivided into sub-regions called bins (where bins cover multiple pixels) seems to leave little room for confusion.
The documentation for the one instruction in the Vega ISA that mentions primitive shaders is explicitly dealing with extents measured in pixels.

A 16x16 grid of bins and 256 pixels in screen space aren't all that different for example.
If there is a citation for any time that bin=pixel, I'd be interested. Otherwise, I'm going with what's been disclosed before by AMD.

One invokes a fragment shader while the latter some process to insert the geometry into appropriate bins.
Neither a bin nor a pixel do either of these things.

Not in published patents or marketing at least. But it seems reasonable this would be an area of active development. Current level of DSBR enablement leaves something to be desired.
I'm not going to go that far into the weeds in the lack of data, or in defiance of what data there is. So long as future claims are clear that they are not to be contradicted by evidence, I will leave them at that.

Fetch once, to my understanding, was in regards to fragment data.
It's part of AMD's PR about the DSBR, and deals with primitives.

Layman's question here: Couldn't the compiler draw conclusions about the (approximately) ideal bin size from data gathered by previous execution and then saved (and reused) alongside the shader cache? Would shader pre-warming help with that, if the compiler can adjust the tile size during run-time?
If going by the following for Raven Ridge, the logic used for deciding on the tile size is based on matching up the amount of pixel context to what is available to store it on-chip:
https://cgit.freedesktop.org/mesa/mesa/commit/?id=c3ebac68900de5ad461a7b5a279621a435f5bcec

The heuristic seems to be calculating a number for bytes per pixel being exported times number of samples in the color case, and depth enablement and stencil times number of samples in depth case.
The sum is then looked up in a series of arrays defined for the number of shader engines and for the number of RBEs per engine at each count.

It appears as if the measure here is how much pixel context can be stored without blowing out the on-chip storage in the worst case, and there's a set of x and y dimensions for tile size. If not there, it may be the case that the driver goes for no binning.
 
Layman's question here: Couldn't the compiler draw conclusions about the (approximately) ideal bin size from data gathered by previous execution and then saved (and reused) alongside the shader cache? Would shader pre-warming help with that, if the compiler can adjust the tile size during run-time?
For a canned benchmark yes. In fact an IHV could stream in the configuration for the duration of a run. Having manually evaluated all possibilities. The compiler likely wouldn't draw any conclusions, but could hint an ideal tile size. That would only be one possible constraint on ideal tile size however. There would still be the hardware factor with number of SEs, cache sizes, etc that need worked around.

Aggregate bin size and number of bins add up to the same thing, since bins are subdivisions of the same screen space.
One is measured in bytes. As bins cover less screen space or triangles become larger, the context data increases as the geometry will more frequently intercept multiple bins. A large primitive could intercept every bin and require storing a reference in each. If a bin holds 16 primitives, there exists a case where 256 bins could only store 16 total primitives. The other issue is that an even bin distribution would be less than ideal. Especially with tessellation, geometric detail should be concentrated around edges. Varying bin dimensions within a frame by cooresponding mip levels may be advantageous. If you look at that commit the bins are following the stencils. Small screen space bins should only be desirable when there is a high chance of occlusion, which implies edges. Looking across a barren terrain for example, a third or more of screen space could be empty. The bins would be largely wasted.

The descriptions of a 2D screen space composed of pixels being subdivided into sub-regions called bins (where bins cover multiple pixels) seems to leave little room for confusion.
The documentation for the one instruction in the Vega ISA that mentions primitive shaders is explicitly dealing with extents measured in pixels.
That's not what I'm saying, however multi-sampling could be viewed that way. I'm suggesting recursively binning into multiple levels alongside a depth/stencil mip tree. Repeatedly testing coverage, essentially scanning out, a primitive at each level of the tree to test occlusion. That's why I said pixels and bin intercepts are synonymous. Only difference is resolution in this case.

Consider the following for Early Fragment Test:
Early fragment tests, as an optimization, exist to prevent unnecessary executions of the Fragment Shader. If a fragment will be discarded based on the Depth Test (due perhaps to being behind other geometry), it saves performance to avoid executing the fragment shader. There is specialized hardware that makes this particularly efficient in many GPUs.
...
The OpenGL Specification states that these operations happens after fragment processing. However, a specification only defines apparent behavior, so the implementation is only required to behave "as if" it happened afterwards.
Therefore, an implementation is free to apply early fragment tests if the Fragment Shader being used does not do anything that would impact the results of those tests. So if a fragment shader writes to gl_FragDepth, thus changing the fragment's depth value, then early testing cannot take place, since the test must use the new computed value.

https://www.khronos.org/opengl/wiki/Early_Fragment_Test
This is what I'm arguing the instruction is accelerating. Accelerating the binning process alongside a stencil mip tree. The actual culling process should have existed for some time. AMD only recently added bins and deferred raster a bit along with it.

If there is a citation for any time that bin=pixel, I'd be interested. Otherwise, I'm going with what's been disclosed before by AMD.
Definition of rasterization or scan out? They are both sampling onto an ordered grid, just with different resolution. One detects coverage, the other a bin intercept. Semantics aside they are performing the same operation with different numbers and at their most basic detecting intercepts.

Neither a bin nor a pixel do either of these things.
Those are the ultimate outcomes of the process. Yes there are some added stages in there, but they primarily relate to organizing the data for the architecture.

It's part of AMD's PR about the DSBR, and deals with primitives.
The problem is when and what data is being fetched. Deferred attribute interpolation for example would be part of that. Being that it is likely more efficient to assemble, transform and cull with FP16 on Vega, fetch once is a bit disingenuous. Data is only fetched once, but for the most part not passed along throughout the pipeline anyways. It's not until post-culling that most of the data would be fetched anyways. The bigger savings would be avoiding overdraw unless pushing a lot of culled geometry. Which is why I suggested fetch once with a DSBR approach would seem more applicable. Fetching data for a pixel only once and limiting overdraw.
 
One is measured in bytes. As bins cover less screen space or triangles become larger, the context data increases as the geometry will more frequently intercept multiple bins.
The existing disclosures state this is building a batch of primitives before evaluating bins.
The pipeline accumulates primitives into a batch, and stores references to what bins each primitive intercepts.
The batch's context (or estimated context size) grows with each primitive, but the bins are evaluated at a later stage with a more firm limit on their impact since the process works through bins in a sequential fashion.

A large primitive could intercept every bin and require storing a reference in each. If a bin holds 16 primitives, there exists a case where 256 bins could only store 16 total primitives.
The size of the batch is a variable number of primitives, which have references to what bins they intercept. This may be a set or range of bins, or an initial bin intercept with the understanding that further bin intercepts will be calculated as that bin is processed (being processed sequentially helps with this).
If a condition is met where the pipeline determines further batch growth will cause something to exceed on-chip storage, it would close the batch. AMD has not exhaustively listed context types that could trigger this condition.
The bin size appears to be primarily concerned with whether the worst-case amount of pixel context being tracked can overflow on-chip storage, which is built into the sample count and format of the target buffer(s). That determination is based on the number of pixels as determined by the screen-space dimensions of the bin, rather than the primitive count that is more of a concern for the prior batch generation step. With occlusion, there could be more batched primitives being processed for a bin than there are pixels in the 2D area of the bin.

If you look at that commit the bins are following the stencils.
Which commit, and which portion?

Small screen space bins should only be desirable when there is a high chance of occlusion, which implies edges.
There are various parameters that need to be balanced against, such as the granularity of the hardware paths that can most quickly cull primitives, or what parts of the pipeline cannot be subdivided without utilization concerns. Having many smaller bins also raises the cost of formerly trivial conservative checks. If a larger occluding triangle covers multiple bins fully, what would have been a small number of trivial checks scales with the number of additional bins the original set was subdivided into, and increases the encoding size for bin identifiers. If handling even smaller triangles, it also increases the chance of a bin having very few actionable primitives (scan converter works on unchanging pixel/sample point scale) whereas increasing complexity in bin evaluation raises the cost of selecting or using each bin and slows their submission to the back end.

Looking across a barren terrain for example, a third or more of screen space could be empty. The bins would be largely wasted.
If truly empty of geometry, no primitives in any batches would reference those bins, and they would not be stepped through by the DSBR. Even in the absence of a DSBR, what does a vast swath of screen space without geometry that the front end wouldn't be evaluating and no shaders would be invoked for cost?

That's not what I'm saying, however multi-sampling could be viewed that way. I'm suggesting recursively binning into multiple levels alongside a depth/stencil mip tree. Repeatedly testing coverage, essentially scanning out, a primitive at each level of the tree to test occlusion. That's why I said pixels and bin intercepts are synonymous. Only difference is resolution in this case.
To be clear, this is adding an uncertain number of iterations on a per-triangle basis for every triangle covering one or more bins, and cross-referencing the subdivisions arrived at by prior triangles or ignoring them and then merging checks afterwards, repeatedly accessing or calculating coverage and depth, coupled with a tree traversal of unspecified time complexity and memory access behavior.
This is between vertex/primitive shader export and wavefront launch, where there are 4 points chip-wide that can go through this.

On top of that, this looks to setting up values as part of a hardware context change. The number of those is limited and exceeding the chip's support of them is globally expensive. Changing at the granularity of individual primitives does seem expensive.

Consider the following for Early Fragment Test:
This states the overall API-level decision that permits a GPU to suppress fragment shader invocation if the fragment does not affect final rendering, which the DSBR tries to accomplish more frequently. What does this alter in a debate concerning work and culling done prior to fragment shader launch, which I would have assumed has this as a given?

This is what I'm arguing the instruction is accelerating. Accelerating the binning process alongside a stencil mip tree.
What does this instruction provide that provides more context related to what bins a triangle covers?
It takes a conservative bounding box composed of 2 bits each from the x_min,y_min and x_max,y_max values of the bounding box, and the output is a 4-bit mask for coverage. The bounding box's dimensions are in pixels measured in terms of overall screen space. Anything that covers 4 or more tiles has the same output.

What is this process doing with the various mask outcomes? If all SEs are covered, further iteration creates more bins that might be covered. If not all bits are set, further subdivision in effect creates more bins that do not have coverage but put work into a primitive anyway, and more bins that might have coverage and will be performing work anyway. At some point, the system needs to move on and submit something.

Whether the tile size can vary beyond what is documented in the ISA isn't clear, but the granularity isn't inconsistent with the bin sizes offered in the Raven Ridge binning patch, although it is conservative in the narrowest bin case. This may indicate a difference in the "tile" AMD's white paper calls a bin, versus the tile related to the SE backends mentioned by the instruction. This may in the final account change what level of concurrency there is in this part of the process, and in conjunction with the values in the binning patch may mean we've seen most if not all of the options that are available.

Definition of rasterization or scan out? They are both sampling onto an ordered grid, just with different resolution.
A bin is a region of screen space that is many times larger and can vary against the overall subdivided screen space, whereas a pixel is a fixed component of both. Saying they occur at different resolutions when one is literally made of the other seems like a problem in treating them interchangeably. The rules concerning what is allowed to be conservative and inaccurate in spatial terms are also reserved for the bin.
What the scan converter's output actually creates for wave launch and active lanes is also not 1:1 to either. Neither actually perform actions, as much as they are a measure of the alignment and allocations for the elements that perform the actual actions and checks.

The problem is when and what data is being fetched. Deferred attribute interpolation for example would be part of that.
In the description of a visibility buffer type deferred attribute interpolation, the fragment shader would have a primitive ID that would look back into a buffer. That method would be after the DSBR, although the fully deferred DSBR mode behaves like a localized visibility buffer.
For the purposes of outlining a specific feature of the DSBR and not in conjunction with a hypothetical use of the optional primitive shader, I think AMD was not discussing that particular technique. The shader-based methods involve writing to a buffer that the DSBR would not be aware of, at any rate.

How fetch-once is accomplished is not clear, perhaps by storing the necessary geometry stage outputs into dedicated storage or careful management and back-pressure from the DSBR or pixel shader wavefront creation stage.

Being that it is likely more efficient to assemble, transform and cull with FP16 on Vega, fetch once is a bit disingenuous. Data is only fetched once, but for the most part not passed along throughout the pipeline anyways.
This is saying one thing and then immediately stating the other. Primitive-side data being fetched once seems like it can be honestly described as "fetch-once", it's not fetch-once-and-resolve-all-future-concerns. So long as data needs from vertex export through pixel shader invocation for a triangle are kept on-chip, it seems like a fair descriptor.
 
Last edited:
Which commit, and which portion?
Code:
+static struct uvec2 si_get_depth_bin_size(struct si_context *sctx)
+{
+    struct si_state_dsa *dsa = sctx->queued.named.dsa;
+
+    if (!sctx->framebuffer.state.zsbuf ||
+        (!dsa->depth_enabled && !dsa->stencil_enabled)) {
+        /* Return the max size. */
+        struct uvec2 size = {512, 512};
+        return size;
+    }
+
+    struct r600_texture *rtex =
+        (struct r600_texture*)sctx->framebuffer.state.zsbuf->texture;
+    unsigned depth_coeff = dsa->depth_enabled ? 5 : 0;
+    unsigned stencil_coeff = rtex->surface.flags & RADEON_SURF_SBUFFER &&
+                 dsa->stencil_enabled ? 1 : 0;
+    unsigned sum = 4 * (depth_coeff + stencil_coeff) *
+               sctx->framebuffer.nr_samples;

Having many smaller bins also raises the cost of formerly trivial conservative checks. If a larger occluding triangle covers multiple bins fully, what would have been a small number of trivial checks scales with the number of additional bins the original set was subdivided into, and increases the encoding size for bin identifiers. If handling even smaller triangles, it also increases the chance of a bin having very few actionable primitives (scan converter works on unchanging pixel/sample point scale) whereas increasing complexity in bin evaluation raises the cost of selecting or using each bin and slows their submission to the back end.
Not if bin distribution was hinted from prior frames. A large primitive occluding multiple bins should result in fewer bins in that area. That should result in fewer conservative checks at a higher mip level more likely to be cached. The concern would be incorrectly predicting bin sizes. Result in that case just being a loss in efficiency, but still retaining some benefits over not binning.

If truly empty of geometry, no primitives in any batches would reference those bins, and they would not be stepped through by the DSBR. Even in the absence of a DSBR, what does a vast swath of screen space without geometry that the front end wouldn't be evaluating and no shaders would be invoked for cost?
It wouldn't be truly empty, just sparsely populated with a handful of primitives. A wall could cross several bins with only one or two triangles. Outside of rendering a jungle, there would be large portions of screen space with little geometric detail.

The batches would also be rather large going off the size of the parameter cache. Over 1MB in size as I recall with what may be pre-culled vertices. In the case a compute shader ran positions at FP16, frustum and zero coverage may already be culled. Possibly even occlusion before even hitting the front end. Leaving the front ends to an accurate transform with only z or stencil tests from the current scene.

To be clear, this is adding an uncertain number of iterations on a per-triangle basis for every triangle covering one or more bins, and cross-referencing the subdivisions arrived at by prior triangles or ignoring them and then merging checks afterwards, repeatedly accessing or calculating coverage and depth, coupled with a tree traversal of unspecified time complexity and memory access behavior.
This is between vertex/primitive shader export and wavefront launch, where there are 4 points chip-wide that can go through this.
Conservative Z shouldn't have memory access problems for the traversal depending on tree level used as the entire surface should be compressible and reside entirely in cache. Even if overflowing, the bins should allow front ends to operate in close proximity to share data.

What does this alter in a debate concerning work and culling done prior to fragment shader launch, which I would have assumed has this as a given?
Just that prior architectures likely we're scanning out along bin boundaries already, just without actual binning.

What does this instruction provide that provides more context related to what bins a triangle covers?
It takes a conservative bounding box composed of 2 bits each from the x_min,y_min and x_max,y_max values of the bounding box, and the output is a 4-bit mask for coverage. The bounding box's dimensions are in pixels measured in terms of overall screen space. Anything that covers 4 or more tiles has the same output.
It would reduce the number of instructions required to perform the test traditionally. Assuming pow2 bin dimensions, bit shifts would be sufficient to calculate screen space in pixels following a quad tree. Testing a primitive against all four quads and recursively narrowing it down for more accurate coverage tests. As a conservative test, the bounding boxes could always be pow2 dimensions.

What is this process doing with the various mask outcomes? If all SEs are covered, further iteration creates more bins that might be covered. If not all bits are set, further subdivision in effect creates more bins that do not have coverage but put work into a primitive anyway, and more bins that might have coverage and will be performing work anyway. At some point, the system needs to move on and submit something.
Four SEs would seem an incorrect way of looking at this. With binning there would be one bin per SE processed at any given time. The order of bin submission would be irrelevant so long as they weren't overlapping. The result always being to render triangles in submission order. Some care taken with memory alignment of course.

As for binning, further iteration wouldn't create more bins, but more accurately intercept existing bins. The quadtree approach I mentioned above. Keep in mind all the referenced bin sizes were powers of two, keeping the math easy.
 
*code snippet*

That's something I've commented on. There's separate bin size evaluation logic for depth. Stencil adds context that needs to be tracked. All it does is check if stencil is enabled, and slightly increases the cost function for the depth bin.

My interpretation is that both data types are evaluated for bin dimensions in their respective tables, which are likely different because of the differing hardware paths and architectural scaling behaviors.
The smaller area of the two is the one the driver sets as the bin size overall. This is conservatively setting bin size to keep the worst-case data payload in either the depth or color path from spilling.

Not if bin distribution was hinted from prior frames. A large primitive occluding multiple bins should result in fewer bins in that area.
Indirection to a structure built in a prior frame. That's extra references, and since it's from a prior frame its initial placement is resident in memory. Even if on-chip it's going to have a minimum cycle cost and finite number of accesses per cycle.
This gets done for every primitive.
It's a hint, and could be wrong. The omission or movement of that large primitive could mean that those bins could face massively different loads in the next frame.

Also, is this still maintaining the assumption of a 2D grid's level of adjacency?
If so, then a given row's Y-height would need to be constant, same with the stride in the X-dimension for a column. If trying to vary both, it snaps everything to a fixed granularity very quickly, and fewer bins in any area means fewer and larger bins overall.
The math would require referencing one or more variables to determine stride and build a variable list of bin intercepts. It's not particularly flexible since one big sparse bin will constrain the dimensions of those sharing its row and column.
If not, then determining intercepts and affected neighbors becomes more intensive and variable in time&space than a fixed grid.

The concern would be incorrectly predicting bin sizes. Result in that case just being a loss in efficiency, but still retaining some benefits over not binning.
If the GPU spills something it needs for the DSBR or SPI, it can potentially hamper or halt at least 1/4 of the GPU's wavefront creation for hundreds to thousands of cycles, with potential knock-on effects globally. This part of the system is not described as being very concurrent.

The batches would also be rather large going off the size of the parameter cache.
There's a maximum batch size mentioned in the commit, and disclosures elsewhere indicate the amount it would dedicate to a batch is limited.

Conservative Z shouldn't have memory access problems for the traversal depending on tree level used as the entire surface should be compressible and reside entirely in cache.
To be clear, isn't this in the context of a geometry front end's culling and primitive shader? One of the major goals of a primitive shader's culling is to avoid sacrificing a cycle per primitive in the fixed-function front end.
Even checking one more buffer is going to cost cycle time, and this is still iterating or looping at non-zero cost.
How many cycles are we devoting to saving that one cycle in primitive assembly?

Even if overflowing, the bins should allow front ends to operate in close proximity to share data.
The hardware cannot know that, once it spills into an autonomous domain of the SOC. It might be often be true, but this is a process that is documented to have much tighter constraints that would be left hoping something doesn't thrash. It likely means coverage or depth data the DSBR needs right in the middle of a primitive stream has become 10-100x more expensive to check or modify. The L2's latency is still tens of cycles, even without contention.

Just that prior architectures likely we're scanning out along bin boundaries already, just without actual binning.
I'm going to save space and not copy verbatim the text that was put in as relevant.
Let's say the API allowing fragment shaders for fragments that will eventually fail a depth test to be skipped is point A.
Let's say the text citing it and going on about culling and using a MIP tree, where said primitive gets scanned out at every level is point B.
Let's say the above, saying GPUs likely already divided the screen into sections is C.

First off, yes, prior GPUs already subdivided the screen, this is another item that has not been up for debate and bringing it up only confuses matters.
Second, can the path from A-B-C be described concisely? I'm not seeing a tight relationship between the claims.

It would reduce the number of instructions required to perform the test traditionally. Assuming pow2 bin dimensions, bit shifts would be sufficient to calculate screen space in pixels following a quad tree.
The binary-convenient dimensions and the limited number of them can also point to there being hard-wired paths or programmable selection of a small number of simple circuits to calculate many of the values in-pipeline with zero instructions to be fetched or cache accesses to miss. It's particularly easy if it's a hardware mode register that sets a known fixed-stride bin pattern.

As a conservative test, the bounding boxes could always be pow2 dimensions.
The instruction takes 2 bits from the X,Y(max,min) of the screen-space global coordinates of a triangle's max/min bounding box.
These coordinates do not change, and the bits used do not change.
To not be wrong requires that all primitives only ever have power of 2 global coordinates or that their boxes snap to a coarser power of 2-aligned area that still contains the box. Beyond blowing up the size of the conservative bounding box, the output of the instruction would never change.

Four SEs would seem an incorrect way of looking at this. With binning there would be one bin per SE processed at any given time.
SE coverage is the sole purpose of the instruction, and it is explicitly targeted at 4 SEs.
What would an instruction dedicated to determining if more than one SE is involved be used for in this case?
The coverage determination within a bin does not appear to be a major limiter to hardware as-is. I'm not seeing where all this additional shader code is getting savings from.

The order of bin submission would be irrelevant so long as they weren't overlapping. The result always being to render triangles in submission order.
There's been no question as to whether triangles are being rendered in-order. I would note that AMD's disclosures about its binning rasterizer explicitly indicate using a consistent marching order, and there seem to be good functional reasons for it. I do not think that the opposite can be casually asserted as fact.

Keep in mind all the referenced bin sizes were powers of two, keeping the math easy.
The referenced bin sizes are also referenced once per target surface, using known resource parameters known to the code or overall application. There's no reference to any geometry or objects being rendered.
 
Last edited:
It's a hint, and could be wrong. The omission or movement of that large primitive could mean that those bins could face massively different loads in the next frame.
It could, but does it significantly matter? The result is just the bin filling more quickly. The concern would be dependencies from bins being in order. Outside of a fully TBDR approach, bins should be readily filling.

Also, is this still maintaining the assumption of a 2D grid's level of adjacency?
If so, then a given row's Y-height would need to be constant, same with the stride in the X-dimension for a column. If trying to vary both, it snaps everything to a fixed granularity very quickly, and fewer bins in any area means fewer and larger bins overall.
The math would require referencing one or more variables to determine stride and build a variable list of bin intercepts. It's not particularly flexible since one big sparse bin will constrain the dimensions of those sharing its row and column.
If not, then determining intercepts and affected neighbors becomes more intensive and variable in time&space than a fixed grid.
Locality more than adjacency I'd say.

With only a handful of levels all aligned to the same pow2 grid, it shouldn't be all that tricky to decode. Dedicated addressing units would definitely speed the process with indirection though. They could reside in TMUs for various swizzling functionality.

In the case of binning, neighbors shouldn't be affected beyond addressing. Some added logic could decide the addresses rather easily.

If the GPU spills something it needs for the DSBR or SPI, it can potentially hamper or halt at least 1/4 of the GPU's wavefront creation for hundreds to thousands of cycles, with potential knock-on effects globally. This part of the system is not described as being very concurrent.
Not necessarily if primitive shaders are viewed as async compute tasks. Each CU could spill to memory, with the results later read back in. Replace the usual front end with async compute tasks with the actual front ends processing bins of indexed primitives. In the case of fetch once, if running FP16 async positions it holds true. Culling would occur before even bothering with other attributes. That doesn't seem all that different from current async compute tasks. Using idle CU cycles in the face of a front end bottleneck. Leaving the front ends to mostly valid, sorted primitives.

To be clear, isn't this in the context of a geometry front end's culling and primitive shader? One of the major goals of a primitive shader's culling is to avoid sacrificing a cycle per primitive in the fixed-function front end.
Even checking one more buffer is going to cost cycle time, and this is still iterating or looping at non-zero cost.
How many cycles are we devoting to saving that one cycle in primitive assembly?
See above where it wouldn't be the front ends' cycles. I need to think through that approach a bit more, but limiting primitive shaders to assembly from within an established bin might be plausible here. Not apparent from what has been presented though.

The binary-convenient dimensions and the limited number of them can also point to there being hard-wired paths or programmable selection of a small number of simple circuits to calculate many of the values in-pipeline with zero instructions to be fetched or cache accesses to miss. It's particularly easy if it's a hardware mode register that sets a known fixed-stride bin pattern.
I've avoided that part of the implementation for simplicity, but I'd agree dedicated addressing units would speed things along and be rather simple. With 512 possible addresses/bins the logic should be minimal.

The instruction takes 2 bits from the X,Y(max,min) of the screen-space global coordinates of a triangle's max/min bounding box.
These coordinates do not change, and the bits used do not change.
To not be wrong requires that all primitives only ever have power of 2 global coordinates or that their boxes snap to a coarser power of 2-aligned area that still contains the box. Beyond blowing up the size of the conservative bounding box, the output of the instruction would never change.
The bounding box may be larger, but iterating with the function until hitting a bin resolution would retain most of that accuracy. As all results are conservative, accuracy in the case of scan out isn't required.

SE coverage is the sole purpose of the instruction, and it is explicitly targeted at 4 SEs.
What would an instruction dedicated to determining if more than one SE is involved be used for in this case?
If that were the case, why add it to the ISA? Small cost to that function, but Raven isn't exactly running 4SEs to ever need it. Only way I see it being useful there is if a Raven is binning for a dGPU and primitive shaders more decoupled than we've been assuming. That's why I've been assuming it's binning along a quadtree.

There's been no question as to whether triangles are being rendered in-order. I would note that AMD's disclosures about its binning rasterizer explicitly indicate using a consistent marching order, and there seem to be good functional reasons for it. I do not think that the opposite can be casually asserted as fact.
Function reasons seem to be the ordered guarantee. There exist options to relax that constraint and drivers already switch to out of order when they can. What effect this would have on binning is unclear, but I'd imagine it scales better if occlusion isn't tied to submission order.
 
It could, but does it significantly matter? The result is just the bin filling more quickly. The concern would be dependencies from bins being in order. Outside of a fully TBDR approach, bins should be readily filling.
If getting it very wrong doesn't matter, why would getting it somewhat right matter? Why bother if it's that irrelevant?

Locality more than adjacency I'd say.
The number of neighbors a bin can have on a 2D plane with fixed grid simplifies a lot of things and allows for fast determination of intercepts in simple hardware without involving extra context or an external program.
How readily would bin intercepts be calculated for a triangle running off the top edge of a bin when the row above could have been subdivided into smaller (variable?) regions multiple powers of 2 smaller or larger? Some optimizations for the coarse allocation step AMD has mentioned include calculating the min and max bins along the axes, to save the effort of checking or calculating anything for the likely vast majority of bin IDs out of range. It's not as straightforward if a max or min value in one row or column has no bearing at all on the rest.

With only a handful of levels all aligned to the same pow2 grid, it shouldn't be all that tricky to decode. Dedicated addressing units would definitely speed the process with indirection though. They could reside in TMUs for various swizzling functionality.
This is competing with a fixed scheme that could be done in the pipeline independently and in parallel with a shader, particularly helpful with the limited straightline performance of a CU. Placing it on the texture pipeline also limits throughput, since it is not as wide as a SIMD and can be under heavy contention versus a scheme that can mathematically derive the necessary values without trips to memory.


Not necessarily if primitive shaders are viewed as async compute tasks.
It seems complex since they're in the graphics domain and are directly evaluating or routing geometry data, some of it straight from a hardware stage of the synchronous pipeline. They then export right into another hardware stage of the graphics pipeline.

Each CU could spill to memory, with the results later read back in.
Vega is all about trying to keep as much as it can on-chip, using the LDS to join several stages, culling to help save parameter cache, a setup pipeline that tries to fetch primitives once, and a rasterizer that attempts to accumulate depth and coverage internally before launching pixel shaders for just the ones whose results will matter in the end.
Prior to that, the hardware and driver did try to balance things enough so that the system could at least somewhat frequently keep things local.

It seems counterproductive for a primitive shader to be so retentive that is forced to stream out and fetch back in everything in all cases.


See above where it wouldn't be the front ends' cycles. I need to think through that approach a bit more, but limiting primitive shaders to assembly from within an established bin might be plausible here. Not apparent from what has been presented though.
I think this is not apparent because this places the primitive shader out of order with its place in the process.
How a primitive shader can be limited to a bin or run at all when that bin is populated by the output of that same shader is unclear.

I've avoided that part of the implementation for simplicity, but I'd agree dedicated addressing units would speed things along and be rather simple. With 512 possible addresses/bins the logic should be minimal.
Could you clarify what you mean by 512 possible addresses?
The small bin sizes and large buffers could yield more bins than that.

The bounding box may be larger, but iterating with the function until hitting a bin resolution would retain most of that accuracy. As all results are conservative, accuracy in the case of scan out isn't required.
This is snapping a conservative bounding box to a power of two in the screen space the triangle's coordinates are in. Is this using powers of 2 the literal powers of 2 in the absolute pixel grid, or snapping to the nearest bin boundary-aligned coordinates?
I think there's a problem with uniformity and how much an already conservative bounding box will stretch.
Per the documented limits of the instruction, the result is trivially 0xf for any extents >32 pixels. A power of 2 alignment requirement would blow up the area and bins evaluated while permitting virtually no relevant output. Aligning to a bin dimension (but this requires knowing the bin size we haven't determined yet) is not insane from the outset, but most bin dimensions would very quickly hit the trivial 0xf condition as well.

If that were the case, why add it to the ISA? Small cost to that function, but Raven isn't exactly running 4SEs to ever need it. Only way I see it being useful there is if a Raven is binning for a dGPU and primitive shaders more decoupled than we've been assuming. That's why I've been assuming it's binning along a quadtree.
I'm actually not sure why the instruction as it exists was considered that good of an idea.
It's the Vega ISA document, not just Raven Ridge. The document flat out states that GPUs fewer than 4 SEs use regular math. One SE seems trivial to resolve, while I think two SEs can be determined by whether the extents along any axis exceed 32 pixels, or if bit 5 flips along an axis.
It would shave off some instructions and shader context needed to determine the more complex 4 SE case. The lookup table's values are in the document, so perhaps even that could be implemented without the instruction if need be.

I've stated before that I feel that this isn't the sort of low-level detail that gives long-term benefits by bleeding into an ISA. It often points to an amount of short-term thinking, economy of effort, or fragility that can mean its presence could be a cause or symptom of other troubles.
I'm not sure how much speed is added by it, and if it's too modest it may not bode well for the absolute level of improvement a primitive shader that uses it can provide.
 
Well , he sorta contradicts himself in a way as he hints to a 2019 lauch.

Hence it will be more than just a fix, a better process is highly likely too. That alone could have been a reason for the refresh.

(not saying fixing stuff isn't one)
 
https://hardforum.com/threads/amd-plans-to-release-vega-refresh-in-2018.1949215/

About a vega refresh, which fix hardware problem and enable currently disabled features...

Someone know this mockingbird guy ?
Makes no sense.
The dude (well, okay, shitposter) suggests fixing software problems with new silicon?
No, but it makes sense. I've voiced my opinion before that we will have a fixed Vega just like we had a fixed Fermi
Fermi was a fucked up node shrink that yielded like total shit.
Vega is built on a very, very mature process.
 
Makes no sense.
The dude (well, okay, shitposter) suggests fixing software problems with new silicon?

Fermi was a fucked up node shrink that yielded like total shit.
Vega is built on a very, very mature process.

What? Fermi was a node shrink? What are you talking about? It was a new architecture that brought things like distributed geometry. Fermi's problem were reportedly due to the interconnect, nothing to do with the node.
And that guy is the shitposter, wow!
 
Back
Top