AMD RDNA3 Specifications Discussion Thread

It seems, according to this slide:

nZoFSpEM4JvgFfVErFfpnc.jpg


that there is native "fixed function primitive culling hardware to remove SW based culling overhead" which implies that AMD has gone "backwards" with NGG (primitive shaders), adding new hardware for primitive culling.

Is that the correct interpretation?
 
It seems, according to this slide:

nZoFSpEM4JvgFfVErFfpnc.jpg


that there is native "fixed function primitive culling hardware to remove SW based culling overhead" which implies that AMD has gone "backwards" with NGG (primitive shaders), adding new hardware for primitive culling.

Is that the correct interpretation?
I‘m confused that they write in headline 12 prim/clk and in the text 24 prim/clk. I think normally it uses 12 prim/clk, when you use primitive shader and put it in your program you get 24 prim/clk?

Maybe @Rys can give a little bit more clearance about this? Maybe he writes a benchmark ;) Waiting 10 years for beyond3d suite 2.0

I‘m also wondering about Nvidia and it’s 12 Rasterizer. I was thinking that Ada will have 8 Rasterizer/GPC
 
Last edited:
I‘m confused that they write in headline 12 prim/clk and in the text 24 prim/clk. I think normally it uses 12 prim/clk, when you use primitive shader and put it in your program you get 24 prim/clk?
"Primitives" and "vertices" in the text - they're not actually the same thing.

In a mesh there's approximately one triangle (primitive) per vertex, but when a mesh "wraps" around an object, going from front-facing to back-facing there are potentially some triangles that end up being viewed edge-on. They are degenerate (have 0 area measured in screen pixels), so should be culled, though they are not actually back-facing.

Another cause of culling within a mesh is when some triangles end-up entirely outside of the screen. The mesh in this case is near the edge and only parts of it fall over the edge. Only triangles whose full set of vertices are outside of the screen can be culled though, you can't cull a triangle which only has one vertex that's outside.

But anyway, none of this relates to the question I have, the apparent non-use of NGG because there's fixed function hardware to remove software culling overhead.
 
"Primitives" and "vertices" in the text - they're not actually the same thing.

In a mesh there's approximately one triangle (primitive) per vertex, but when a mesh "wraps" around an object, going from front-facing to back-facing there are potentially some triangles that end up being viewed edge-on. They are degenerate (have 0 area measured in screen pixels), so should be culled, though they are not actually back-facing.

Another cause of culling within a mesh is when some triangles end-up entirely outside of the screen. The mesh in this case is near the edge and only parts of it fall over the edge. Only triangles whose full set of vertices are outside of the screen can be culled though, you can't cull a triangle which only has one vertex that's outside.

But anyway, none of this relates to the question I have, the apparent non-use of NGG because there's fixed function hardware to remove software culling overhead.

Was NGG ever used?
 
Was NGG ever used?
Really don't know.

The reference to "software based culling overhead" is the key here and I can't tell what AMD is actually describing. This overhead has been "removed" in RDNA 3. Does that imply AMD has been using NGG, but with RDNA 3 that's been dropped in favour of fixed function hardware?
 
Really don't know.

The reference to "software based culling overhead" is the key here and I can't tell what AMD is actually describing. This overhead has been "removed" in RDNA 3. Does that imply AMD has been using NGG, but with RDNA 3 that's been dropped in favour of fixed function hardware?
I think they mean that you had to write some special code if you want to use the new pipline in your software. Didn't we discussed here that there are special commands for NGG? And i think now you can use the normal Piplines (DX11, DX12 and Vulcan) withouth the command and the polyons will automaticly bean routet to NGG?
 
I think they mean that you had to write some special code if you want to use the new pipline in your software. Didn't we discussed here that there are special commands for NGG? And i think now you can use the normal Piplines (DX11, DX12 and Vulcan) withouth the command and the polyons will automaticly bean routet to NGG?
Was NGG ever exposed in client side s/w as something you can program for? AFAIR all geometry went through NGG automatically starting with RDNA1.
 
Didn't we discussed thet here:
What you can do on PS5 rarely has any relevance to what you can do on PC.
 

I hadn't seen this article before today. It talks about RDNA, which in effect means RDNA 1 and 2.

There are now only 2 HW shader stages for vertex/geometry processing:
  • Surface shader which is a pre-tessellation stage and is equivalent to what LS + HS was in the old HW.
  • Primitive shader which can feed the rasterizer and replaces all of the old ES + GS + VS stages.

These are referred to as hardware stages, which I suppose at the least implies that they are initiated/scheduled by dedicated hardware, even though they run as shaders.

Compared to the old HW VS, a primitive shader has these new features:
  • Compute-like: they are running in workgroups, and have full support for features such as workgroup ID, subgroup count, local invocation index, etc.
  • Aware of both input primitives and vertices: there are registers which contain information about the input primitive topology and the overall number of vertices/primitives (similar to GS).
  • They have to export not only vertex output attributes (positions and parameters), but also the primitive topology, ie. which primitive (eg. triangle) contains which vertices and in what order. Instead of processing vertices in a fixed topology, it is up to the shader to create as many vertices and primitives as the application wants.
  • Each shader invocation can create up to 1 vertex and up to 1 primitive.
  • Before outputting any vertex or primitive, a workgroup has to tell how many it will output, using s_sendmsg(gs_alloc_req) which ensures that the necessary amount of space in the parameter cache is allocated for them.
  • On RDNA2, per-primitive output params are also supported.

It's notable that RDNA 2 has functionality there that's not present on RDNA 1. Which may be relevant to general conclusions regarding NGG culling being of no benefit on RDNA 1.

Notes about hardware support​

  • Vega had something similar, but I haven’t heard of any drivers that ever used this. Based on public info I cound find, it’s not even worth looking into.
  • Navi 10 and 12 lack some features such as per-primitive outputs which makes it impossible to implement mesh shaders on these GPUs. We don’t use NGG on Navi 14 (RX 5500 series) because it doesn’t work.
  • Navi 21 and newer have the best support. They have all necessary features for mesh shaders. We enabled shader culling by default on these GPUs because they show a measurable benefit.
  • Van Gogh (the GPU in the Steam Deck) has the same feature set as Navi 2x. It also shows benefits from shader culling, but to a smaller extent.

The original merge request for culling is here:


The VS (or TES) is butchered into two parts:
  1. Top part: ES vertex threads compute only the position output and store that to LDS.
  2. Culling code
    • GS threads load the positions of each vertex that belongs to their triangle and decide whether to accept or cull the triangle.
    • Surviving vertices are repacked if needed.
  3. Bottom part: ES threads of the repacked, surviving vertices compute the other outputs.

which matches, as far as I can tell, the model of culling that was described for Vega (but never used).

This article refers explicitly to RDNA 3:


One interesting bit brought up by this patch series is that with GFX11, Next-Gen Geometry (NGG) is now always enabled.

and I think in the speculation thread we discussed NGG as being the only option for RDNA 3 in light of code commits.

The mystery remains, what hardware interaction does NGG use and is that hardware being referenced in the slide I linked earlier? The only hint at hardware usage that I can find is calling a s_sendmsg, where space in the output buffer is requested for vertex/primitives to be stored. Messages, themselves, have to be buffered (queued) in order to be handled and maybe the slide is referring to the intense message traffic produced by NGG (independent of culling)? Such intense message traffic can be interpreted as "software overhead", I suppose and have been improved in RDNA 3 versus RDNA 2.

Note for @Digidi : sorry when I quoted the slide earlier I misquoted it, which is how you ended up talking about primitives and I corrected you saying vertices...
 
RDNA3 really has dedicated matrix accelerators after all!

RDNA 3 Work Group.png

That's really exciting!

It's notable that RDNA 2 has functionality there that's not present on RDNA 1. Which may be relevant to general conclusions regarding NGG culling being of no benefit on RDNA 1.
Well of course, that's likely one reason why it doesn't support mesh shaders. RDNA1 is a dinosaur architecture.
 
RDNA3 really has dedicated matrix accelerators after all!

That's really exciting!


Well of course, that's likely one reason why it doesn't support mesh shaders. RDNA1 is a dinosaur architecture.
Yes but no. They're there to support the WMMA-instructions, Dot2 (FP16, BF16, INT8) and Dot4 (INT4), but they still employ all the "normal ALUs" in Compute Unit, not just those accelerators.
 
Last edited:
RDNA3 really has dedicated matrix accelerators after all!

View attachment 7552

That's really exciting!

I think that conclusion is wrong. This slide is pretty explicit in its title, in my opinion:

4p6e2PXgrmdZYhSyxjm3Hc.jpg


saying that the 64 lanes of the vector unit can also function as the "matrix accelerator" with two variants of dot product instructions, either DOT2 or DOT4. Well, these aren't traditional dot product instructions, because they are accumulating variants.

Well of course, that's likely one reason why it doesn't support mesh shaders. RDNA1 is a dinosaur architecture.
This is factually the case according to one of the articles I've just linked, Mesh Shaders requires per-primitive output parameters.
 
Last edited:
Back
Top