It would be relatively straightforward to generate a tool that goes through a vertex shader and takes only code related to SV_Position calculation. You could run this code first to determine whether the primitive is visible or not. After that allocate parameter cache storage + execute all other vertex shader code (related to fetching all other vertex parameters and outputting them to parameter cache). I don't know whether this is enough to get a big perf boost, or would you want to add some coarse grained tests first (which would need higher level knowledge of the geometry data and the vertex shader code). If AMD releases an OpenGL or Vulkan primitive shader extension, it would provide better documentation how it works exactly. It would help the discussion. Now we can't do anything else than speculate whether it can be automated or not.Is it using the "Vega Native Pipeline"? That does look like 4 geometry engines at ~1.6GHz (i.e. same per-clock performance as Fiji). The tallest bar looks like 17*1.6 = ~27.
To be honest, it doesn't really look like that manual mode will often (if ever) be used, so I'd say the main question right now should be how much performance AMD can extract from the auto mode. Maybe that's something that will start low and then evolve through the months/years.
Now we can't do anything else than speculate whether it can be automated or not.
I confess I don't really understand if your doubt refers to the specific process you described or the ability to automate or not at all, but @Rys has confirmed that it can and will be automated.
I was mostly thinking about doing it in the most optimal way. A developer could for example include lower precision (conservative) vertex data for culling purposes, saving bandwidth for hidden primitives versus loading full width vertex positions. Also the developer would likely want to separate the position data from the vertex (SoA style layout) to maximize data cache utilization for primitives that are hidden. You don't want to interleave position data and other vertex data to same cache lines if the GPU only reads the position data. You want to split position data separately to pack it tightly in cache lines. Things like that. If the driver is automating early vertex transform + culling, it can't do all the optimizations a developer could do, because the driver can't really change your software's data layouts in memory.I confess I don't really understand if your doubt refers to the specific process you described or the ability to automate or not at all, but @Rys has confirmed that it can and will be automated.
I am guessing this is what the whitepaper means when it says:I am especially interested in knowing whether their system could split SV_Position related code to completely separate step that is executed at tile binning stage. This way you could execute the parameter cache writing part only for visible triangles, and you wouldn't need to store them to tile buffers either. This would be a huge win.
We can envision even more uses for this technology in the future, including deferred vertex attribute computation, multi-view/multi-resolution rendering, depth pre-passes, particle systems, and full-scene graph processing and traversal on the GPU
That's awesome. Can't wait to see this in actionincluding deferred vertex attribute computation
Just four more words: Not in current driver (17.8.1 WHQL).Just two words:
STANDARD SWIZZLE
Cool, but I want to see some sample with dedicated HBC buffer D:
V_SCREEN_PARTITION_4SE_B32
D.u = TABLE[S0.u[7:0]].
TABLE:
0x1, 0x3, 0x7, 0xf, 0x5, 0xf, 0xf, 0xf, 0x7, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf,
0xf, 0x2, 0x6, 0xe, 0xf, 0xa, 0xf, 0xf, 0xf, 0xb, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf,
0xd, 0xf, 0x4, 0xc, 0xf, 0xf, 0x5, 0xf, 0xf, 0xf, 0xd, 0xf, 0xf, 0xf, 0xf, 0xf,
0x9, 0xb, 0xf, 0x8, 0xf, 0xf, 0xf, 0xa, 0xf, 0xf, 0xf, 0xe, 0xf, 0xf, 0xf, 0xf,
0xf, 0xf, 0xf, 0xf, 0x4, 0xc, 0xd, 0xf, 0x6, 0xf, 0xf, 0xf, 0xe, 0xf, 0xf, 0xf,
0xf, 0xf, 0xf, 0xf, 0xf, 0x8, 0x9, 0xb, 0xf, 0x9, 0x9, 0xf, 0xf, 0xd, 0xf, 0xf,
0xf, 0xf, 0xf, 0xf, 0x7, 0xf, 0x1, 0x3, 0xf, 0xf, 0x9, 0xf, 0xf, 0xf, 0xb, 0xf,
0xf, 0xf, 0xf, 0xf, 0x6, 0xe, 0xf, 0x2, 0x6, 0xf, 0xf, 0x6, 0xf, 0xf, 0xf, 0x7,
0xb, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x2, 0x3, 0xb, 0xf, 0xa, 0xf, 0xf, 0xf,
0xf, 0x7, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x1, 0x9, 0xd, 0xf, 0x5, 0xf, 0xf,
0xf, 0xf, 0xe, 0xf, 0xf, 0xf, 0xf, 0xf, 0xe, 0xf, 0x8, 0xc, 0xf, 0xf, 0xa, 0xf,
0xf, 0xf, 0xf, 0xd, 0xf, 0xf, 0xf, 0xf, 0x6, 0x7, 0xf, 0x4, 0xf, 0xf, 0xf, 0x5,
0x9, 0xf, 0xf, 0xf, 0xd, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x8, 0xc, 0xe, 0xf,
0xf, 0x6, 0x6, 0xf, 0xf, 0xe, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x4, 0x6, 0x7,
0xf, 0xf, 0x6, 0xf, 0xf, 0xf, 0x7, 0xf, 0xf, 0xf, 0xf, 0xf, 0xb, 0xf, 0x2, 0x3,
0x9, 0xf, 0xf, 0x9, 0xf, 0xf, 0xf, 0xb, 0xf, 0xf, 0xf, 0xf, 0x9, 0xd, 0xf, 0x1
4SE version of LUT instruction for screen partitioning/filtering.
This opcode is intended to accelerate screen partitioning in the 4SE case only. 2SE and 1SE cases use normal ALU instructions.
This opcode returns a 4-bit bitmask indicating which SE backends are covered by a rectangle from (x_min, y_min) to (x_max, y_max).
With 32-pixel tiles the SE for (x, y) is given by { x[5] ^ y[6], y[5] ^ x[6] } . Using this formula we can determine which SEs are covered by a larger rectangle.
The primitive shader must perform the following operation before the opcode is called.
1. Compute the bounding box of the primitive (x_min, y_min) (upper left) and (x_max, y_max) (lower right), in pixels.
2. Check for any extents that do not need to use the opcode ---
if ((x_max/32 - x_min/32 >= 3) OR ((y_max/32 - y_min/32 >=3) (tile size of 32) then all backends are covered.
3. Call the opcode with this 8 bit select: { x_min[6:5],y_min[6:5], x_max[6:5], y_max[6:5] } .
4. The opcode will return a 4 bit mask indicating which backends are covered, where bit 0 indicates SE0 is covered and bit 3 indicates SE3 is covered.
Not for a long time, they're only coming to mobile this year, next year desktop apparentlySee, this is the correct way to get a sale of both a Vega GPU and a Ryzen CPU.
Now that you got it AMD, please stop this Vega64 and games and monitors and cpus and mice pads and anti mining speech bundle nonesense
I'm curious what tools would be made available, or architectural/ISA features would enable practical adoption.AMD is still trying to figure out how to expose the feature to developers in a sensible way. Even more so than DX12, I get the impression that it's very guru-y. One of AMD's engineers compared it to doing inline assembly. You have to be able to outsmart the driver (and the driver needs to be taking a less than highly efficient path) to gain anything from manual control.
To what level of AMD's DSBR functionality is this taken? Visibility as calculated based on the already mentioned frustrum and facing calculations, or all the way through the deferred HSR step and sub-pixel granularity operations after the scan-converter?I am especially interested in knowing whether their system could split SV_Position related code to completely separate step that is executed at tile binning stage. This way you could execute the parameter cache writing part only for visible triangles, and you wouldn't need to store them to tile buffers either. This would be a huge win. However automatic splitting isn't always a win, as in worst case, you need to process the position transform code twice (and that might be expensive if you for example blend multiple displacement maps in terrain renderer).
For conventional shaders, the patterns found in the metadata are well defined: there's a well-defined set of vertex buffers usage models ...
I think you're right in the sense that GCN's existing stages (something like: IA, VS, HS, TS, DS, GS, SO, PA, RS, PS, RBE) are "wirable" in a variety of ways to make all the kinds of pipelines that current graphics APIs support. And I think your observations on the console design-pull, the bias in the hardware towards API experimentation that console developers are happy to engage with, are correct. I also think you're right in the sense that some clever configuration of the stages can bring about novel virtual pipelines on GCN.So, I believed they generalized the graphics-API shader already when they did GCN1. I'm surprised they don't see the pixel as a primitive too and blur the distinction between PS (primitive shader) and PS (pixel shader). Ups ...
Yes, very much in agreement there.This is how I see this in the future:
amplification (ff) -> position calculation (code) -> cull homogenious (ff) -> attribute source caching (code) -> bin/cull tile (ff) -> attribute calculation (code) -> rasterize/amplification (ff) -> attribute interpolation and new calculation (code) -> sort (ff) -> blend (code) -> dump (ff)
I included a possible programable blend FixedFunction (ff) in there. And you see most of the ff is mostly thread invokation or killing + a little bit extra. And most of the code is compute + a little extra. If you are able to actually specify the pipeline yourself you can do whatever you want. Even a screen-size compute-shader writing to ROPs without a rasterizer.
Key word: practical.Edit: I feel it, practical decoupled shading will be possible soon, I'm a believer.