AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Keep in mind "automatic" isn't much more than the driver optimizations that likely already occur behind the scenes. Nvidia's tiled raster for example may very well be the equivalent of a primitive shader. The only difference being exposure to devs.
 
Keep in mind "automatic" isn't much more than the driver optimizations that likely already occur behind the scenes. Nvidia's tiled raster for example may very well be the equivalent of a primitive shader. The only difference being exposure to devs.

I recall Fermi being criticized in old articles for some kind of "software" tessellation, which perhaps may have involved some additional optimizations in the purportedly fixed-function hand-off. Perhaps some element of that is similar.

The automatic generation in this case is creating a set of derivative shaders based on developers' code and inserting it into a phase of vertex processing the developers did not reference, which seems a bit more involved. AMD's point about primitive shaders are that if you get to the point of involving the rasterizer, it's already too late.
 
I recall Fermi being criticized in old articles for some kind of "software" tessellation, which perhaps may have involved some additional optimizations in the purportedly fixed-function hand-off.
I think those were some rumors started by none other than Charlie, claiming Fermi will have bad Tessellation performance due to it using a software solution, later on that proved to be wrong, as Fermi came with so much hardware Tessellators it was leaps and bounds beyond AMD's GPUs for several generations.
 
Last edited:
I think those were some rumors started by none other than Charlie, claiming Fermi will have bad Tessellation performance due to it using a software solution, later on that proved to be wrong, as Fermi came with so much hardware Tessellators it was leaps and bounds beyond AMD's GPUs for several generations.
Exactly. Charlie didn’t understand that DX11 tessellation had a lot of shader involvement. Which isn’t a too surprising comment to make from somebody who also claimed that Nvidia’s CPU and GPU would use the same execution pipeline. [emoji4]
 
Neither do I, thats why I was arguing with him in that thread... but he's still sticking to his 'opinion'.
Doesn't matter if it's my opinion if it's correct. In a system designed to take advantage of feedback it stands to reason they would use everything available. Besides, they even included instructions to accelerate what I suggested in the ISA. That and the part of the pipeline discarding primitives would seem an ideal place to handle all culling of said primitives. Where 300 instructions to discard one is a performance win. Binning isn't too unlike rasterization with big pixels. Then throw in some Z culling which has been around forever. Testing against the current scene and the rest of a bin. I'd be shocked if they weren't doing something like what I described in the face of triangle and bandwidth constraints.
 
Is the 300 instructions to cull one triangle figure a rhetorical flourish?

There are many implications and questions to that and everything else claimed, but before wading far into the weeds I will say my gut instinct is to take a lost cycle in the geometry engine and have faith that maybe the one of the next 299 triangles will launch at least one wavefront that affects final rendering.
 
I guess that figure was pulled out of this. :)

The specific reference I saw for 300 was the ns cost for a memory reference necessary for an indirect draw being handled by the command processor, which the system has limited ability to hide if there are calls with 0 triangles.

This does show where there are points in the architecture where there's massively less leeway to handle costs concurrently. One CP versus 2-4 DSBRs is still dwarfed by the parallel capacity of the back end. Given the Vega white paper's placement of primitive shaders, it would require time travel for an intra-draw element like a primitive shader to retroactively decide the command processor shouldn't have launched the vertex process that contains it.

The architectural descriptions and patents also usually describe the tiled deferred rasterization step as being post primitive assembly, which if mapped to the "everything is a primitive shader" claim would create a shader spanning VGT, the FIFO, PA, and right up to the start of the SPI instantiating wavefronts.

The balancing act in managing the VGT to PA path is where I'm curious if there's an impact for Vega.
A primitive shader's code is inside the VGT path, and it is culling code that exists in addition to (or redundantly with) code or hardware that implements the VGT and standard culling portion of the process. AMD indicated that the standard pipeline is still there, which may explain some of the complexity in activating or exposing primitive shaders more generally, if it does involve balancing overproduction or starvation across elements like a limited number of inter-stage FIFOs.

I also note that the culling shaders do not try to check for all coverage scenarios for MSAA, which makes me wonder if a primitive shader would either.
 
The specific reference I saw for 300 was the ns cost for a memory reference necessary for an indirect draw being handled by the command processor, which the system has limited ability to hide if there are calls with 0 triangles.
Do you have a link to that? I'm interested GPU-driven rendering and was wondering the costs of zero triangle draw calls for indirect draws. Given that presentations found it important enough to compact the draw indirect buffer, the info would be interesting.
 
Do you have a link to that? I'm interested GPU-driven rendering and was wondering the costs of zero triangle draw calls for indirect draws. Given that presentations found it important enough to compact the draw indirect buffer, the info would be interesting.
Don't do zero triangle draws. They still cost GPU cycles (I only have console numbers, so I can't give them). ExecuteIndirect supports indirect draw count (filled by GPU). OpenGL 4.3+ supports indirect draw count too (https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_indirect_parameters.txt).

Compacting data with local+global atomic counter is efficient. There's however no order guarantees (depth sorting doesn't remain). DX12 SM6.0 has GlobalOrderedCountIncrement. Haven't tried this, but the GCN equivalent instruction (DS_ORDERED_COUNT) works well for this case on consoles. DICE had some info about using it in their culling implementation.
 
Do you have a link to that? I'm interested GPU-driven rendering and was wondering the costs of zero triangle draw calls for indirect draws. Given that presentations found it important enough to compact the draw indirect buffer, the info would be interesting.

There is an embedded link in the post I replied to for a GDC16 slide deck for a presentation concerning GPU-driven culling with compute shaders for the Frostbite engine.
http://www.wihlidal.ca/Presentations/GDC_2016_Compute.pdf
 
Don't do zero triangle draws. They still cost GPU cycles (I only have console numbers, so I can't give them). ExecuteIndirect supports indirect draw count (filled by GPU). OpenGL 4.3+ supports indirect draw count too (https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_indirect_parameters.txt).

Compacting data with local+global atomic counter is efficient. There's however no order guarantees (depth sorting doesn't remain). DX12 SM6.0 has GlobalOrderedCountIncrement. Haven't tried this, but the GCN equivalent instruction (DS_ORDERED_COUNT) works well for this case on consoles. DICE had some info about using it in their culling implementation.
@sebbbi Thanks for the input, BTW do you still work on GPU driven pipelines? or does Claybook take up all your time? If so have you experimented with DX12 yet? Do you find it a good fit for GPU driven pipelines? Also any advice for breaking a model down into clusters? Its a hard problem to try to do optimally. How did you handle bounding volumes for clusters in skinned meshes? I haven't come up with a good solution for that.

edit - Sorry for being off topic but I figured I had his attention so why not.

There is an embedded link in the post I replied to for a GDC16 slide deck for a presentation concerning GPU-driven culling with compute shaders for the Frostbite engine.
http://www.wihlidal.ca/Presentations/GDC_2016_Compute.pdf
Thanks... I must have missed it as I already have that presentation, or I perused the powerpoint version of it.

edit - the powerpoint version doesn't have the notes below the slide.
 
Last edited:
This does show where there are points in the architecture where there's massively less leeway to handle costs concurrently. One CP versus 2-4 DSBRs is still dwarfed by the parallel capacity of the back end. Given the Vega white paper's placement of primitive shaders, it would require time travel for an intra-draw element like a primitive shader to retroactively decide the command processor shouldn't have launched the vertex process that contains it.
It wouldn't retroactively decide, but poll input from prior geometry as one possible test. Move Z culling into the primitive testing at a per bin resolution. There would also be some mechanism to analyze or reduce the bins. For example if the last triangle occluded some portion of the triangles. Keep running simple passes until the bin was full of valid geometry or even spawn a new partition within.

The architectural descriptions and patents also usually describe the tiled deferred rasterization step as being post primitive assembly, which if mapped to the "everything is a primitive shader" claim would create a shader spanning VGT, the FIFO, PA, and right up to the start of the SPI instantiating wavefronts.
That's roughly what I'm suggesting. All of that running on a CU until satisfied with whatever bins we're created. A single draw call possibly instantiating only one wavefront for establishing bins.
 
Back
Top