It makes me wonder if Vega would be better off without hierarchical-Z support. Basically, DSBR and hierarchical-Z are competing to do the same thing (prevent shading of fragments that will have no effect on the render target). Why have two things on chip that are trying to do the same thing?
Without hierarchical-Z, perhaps there'd be less paths competing for resources. Hierarchical-Z needs to support two kinds of queries:
- coarse-grained - can any of this triangle be visible in each of the coarse tiles of the render target?
- fine-grained - which fragment-quads (or MSAA fragment-quads) of this triangle can be visible
Per the GDC presentation from DICE on using compute to do triangle culling, Hi-Z is a 32-bit code word per 8x8 region of pixels.
From the patents on AMD's rasterization and Linux patches for Raven Ridge, the binning process would seem to fall into a coarser level 0, with areas of screen space ranging from thousands to hundreds of thousands of pixels. The choice would seemingly vary per combination of render target and shader features.
The coarse bin intercept stage is so broad that it would seem like there's still many orders of magnitude between bin determination (0) and sub-pixel coverage (2) where there's potentially room for an additional reject, perhaps as a shortcut to speed up bin processing before a triangle hits the scan-converter or a way to remove a triangle from a batch context before it overflows (possible in some potential embodiments).
The stride of the rasterizer doesn't seem to have grown to match the bin extents.
The Vega ISA instruction for primitive shader SE coverage posits 32-pixel regions, which is somewhat out of alignment with that sizing, although it doesn't align perfectly with the bin sizes given in the Linux patches either.
Some reasons why HiZ could remain is that while there is some rough overlap, its lifetime and terminating conditions versus those of the DSBR and batching are unclear. There are some shared conditions where HiZ and the DSBR do not apply, such as transparencies or shaders that modify depth. I'm less clear on whether some of those conditions necessarily terminate HiZ vesus delay its use until the pipeline clears. In steady-state function, the lifetime of a bin in process is also short, although I suppose the most relevant data in this case would live on in the depth buffer. Htile would have a lifetime spanning countless bins.
Outside of that, we know that the DSBR and batching can give up in instances HiZ does not, per the descriptions of embodiments and the Linux patch. If pixel or depth context is heavy enough, or some other glass jaw of the DSBR or batching logic is hit, the fallback path is to revert to the standard methods--if a somewhat slightly worse form of them.
There's also a mode where the DSBR doesn't defer shader invocation, where having HiZ would help prevent a regression in performance.
That speculation aside, if the contention is that both HiZ and DSBR walk like a duck and quack like a duck, maybe they aren't different birds?
The bin sizing seems to indicate a goal of keeping whatever color, depth, and stencil context there is on-chip. If this means keeping the relevant tiles in the depth and color caches, wouldn't the metadata like Htile be kept similarly unthrashed?
The DSBR's depth tracking could be initialized from a snapshot of existing depth, one of which might be delivered much more rapidly. Additionally, the math to determine which fixed 8x8 regions of the screen could be checked might be easier to calculate and apply to multiple primitives in the pipeline hardware than the variable bin dimensions, and a handful of 32-bit values every 64 pixels could be kept closer to the front end than a whole bin context.
Past that, it would seem like writing the bin's depth data back could write back Htile at the same time. Since there seems to be software that uses HiZ directly, like in the consoles, the hardware ideal might have to lose to the reality of software already using it. Since HiZ and the DSBR have an early input into the geometry engine's triangle processing, their physical proximity might translate into them being physically conjoined or being modes of the same hardware. The complexity might be causing problems, although the counter-point may be that a pure implementation's downsides leave it with glass jaws that leave mediocre improvement anyway.
Some of these elements might be used to reduce the overall latency of the DSBR, which AMD has indicated as a side-effect of the whole batching process. I don't recall seeing results for the DX12 thread's asynchronous compute test. I'm not so curious about AC, but whether the frequent trend of AMD's straight-line graphics performance often being long enough to counteract its compute prowess is helped or hindered--if we had the ability to toggle this part of the architecture.
But is something really broken? What if everything is working as intended or what if part of the design just turned out to yield bad results causing need to revisit chip architecture?
It has precedent.
x87 floating point stands as a design that was implemented mostly as intended (to an extent, one of the deviations in its implementation may have been more likely to be salvageable
https://cims.nyu.edu/~dbindel/class/cs279/stack87.pdf), but as it turned out some elements of the design weren't as important or helpful as was thought, and a few complex points that the designers hand-waved as being handled by software turned out to be intractable for software to handle. This left even the best x87 CPUs generally hamstrung to around half the throughput of comparable non-x87 FPUs.
Perhaps, and some rumors did hint at a possible scenario like this, the hardware designers or implementation fobbed off some points for the software to handle, and reality "Noped" back at them in response.
It's massively worse with x87 given the heavy ISA and architectural exposure, which the DSBR doesn't seem to have--although if there were exposure AMD could have neglected to mention it.
I don't think GCN will have the opportunity to correct any missteps after 20 years like x86 did by deprecating x87 for SSE2.
Than you scrap it/don't talk about it. They very much did talk about Vega.
There were a number of instances where RTG and/or Koduri should have kept their mouths shut but didn't. I appreciated his desire to make RTG more assertive in its role, but at least some rumors make it seem that the gap between results and talking big finally prompted AMD to tell him to take a long dance off a short pier.