AMD RDNA3 Specifications Discussion Thread

Has AMD hinted at which matrix workloads or libraries they’re targeting? FSR 3.0?

If SIMD throughput is 64x32b or 2x32x32b that would make Navi31 a 12288x32b chip. Why is it marketed as 6144?
 
Has AMD hinted at which matrix workloads or libraries they’re targeting? FSR 3.0?

If SIMD throughput is 64x32b or 2x32x32b that would make Navi31 a 12288x32b chip. Why is it marketed as 6144?
Possibly related to the fact that while FP capability per CU doubled, INT didn't. Or that the double capability requires some ILP to work.
 
I suspect you're right, but it wildly contradicts the "extract maximum value from each transistor" statement on that first slide I included.
Depends on which spends more transistors, the ALUs or the data movement. If getting operands to the units takes more die area than the ALUs themselves, doubling up on compute might be transistor-efficient to get more use of the data paths, even if the compute sits idle a lot of the time. Just to get more use out of the system in the rare cases when enough operands can be delivered to run two ops in parallel.
 
64x32b implies ILP isn’t required.
That's "wave64 mode", typically used by pixel shaders.

Of course if there are integer instructions (per pixel) then those instructions in the pixel shader progress at half rate, because there's only 32 lanes of integer ALU capability.

Depends on which spends more transistors, the ALUs or the data movement. If getting operands to the units takes more die area than the ALUs themselves, doubling up on compute might be transistor-efficient to get more use of the data paths, even if the compute sits idle a lot of the time. Just to get more use out of the system in the rare cases when enough operands can be delivered to run two ops in parallel.
Super SIMD was supposedly motivated by making more effective use of vector register file bandwidth - a lot of time the bandwidth with old architectures was wasted.

As it happens, wave64 mode is the common case that will get good value out of these 64 lanes. The concern then rests with compute workloads in cases where 32 work item workgroups are used. Ray tracing is one of those cases, but maybe the traversal code has high ILP?
 
That's "wave64 mode", typically used by pixel shaders.

Of course if there are integer instructions (per pixel) then those instructions in the pixel shader progress at half rate, because there's only 32 lanes of integer ALU capability.


Super SIMD was supposedly motivated by making more effective use of vector register file bandwidth - a lot of time the bandwidth with old architectures was wasted.

As it happens, wave64 mode is the common case that will get good value out of these 64 lanes. The concern then rests with compute workloads in cases where 32 work item workgroups are used. Ray tracing is one of those cases, but maybe the traversal code has high ILP?

If it all of that is true why is AMD sandbagging? Is there actually enough operand bandwidth to feed 64 FMAs?
 
If it all of that is true why is AMD sandbagging? Is there actually enough operand bandwidth to feed 64 FMAs?
My understanding is that the bandwidth is only available if at least one operand comes from a prior instruction, where the resultant of that instruction is cached and fed to the ALUs without requiring it to be read from a VGPR. There is an arrow on the diagram that implies as much:

FLbapADWgwzNWtiDyaatAc.jpg


As an aside, I think AMD is sandbagging with many of the performance claims in these slide decks. But the margin from +20% to +50% isn't really that exciting so it probably doesn't matter much.

It's not going to change the +/- RTX4080 at $200 less conclusions that we have to wait a month for.
 
Is Primitive Shaders now working better then Mesh Shader or is it just another name for the same thing?
Primitive shader is a s/w defined stage of geometry processing in RDNA GPUs. As far as we can tell all geometry runs through it on RDNA, mesh shaders including. It looks like RDNA3 has some additional h/w supporting this path now freeing main SIMDs from some calculations (presumably).
 
But i thought that mesh shader have to be in Hardware? So AMD have now more hardware then needed for mesh shader?
 
Hmm i searched a little bit found these 3 intersting articles about AMD implementation:



 
View attachment 7554
Sounds they can now skip over testing triangles, if the ray intersects none of the child boxes?
I don't really understand this, but I suspect they are using spare bits that were perhaps unused in RDNA 2.

Referring back to:


I'm going to guess that 32-bits (4 bytes) of ID is excessive. So by using less bits, they can put instance/ray flags into the ID (pointer, as described in the slide).

Also, notice that FP32 box nodes have unused bytes, 16 of them according to that page. So maybe that's where the geometry flags go?

The Phoronix article misses this crucial slide though:

V8RvmwCuwpxQPebqoNakcc.jpg


which means that RDNA 3 adds two new modes, Largest-first and Closest-midpoint.

I think at each traversal step, where it specifies the box node it wants to evaluate, it sets bits relating to instancing and/or ray flags as part of the ID that it sends to the ray accelerator. So the ID is really a composite data field when requesting box node results. When traversal is being performed for shadow rays, say, turn on the bit that specifies largest-first child sorting.

But idk about 'DXR ray falgs'.
View attachment 7555
I don't understand how discarding empty ray quads could have been a problem on RDNA 2 and why RDNA 3 is better when these quads are discarded.

When a quad is predicated-off, as the slide seems to indicate, it should have no material shading execution under RDNA 2. Which should mean that there is no memory request associated with that quad.

On the contrary, the slide indicates that RDNA 3 moves distinct materials into disparate quads. So, starting with three distinct materials, it puts those materials into three distinct quads. This prevents the first quad from performing two distinct memory requests because it is handling two materials. In this aspect this would be an improvement over RDNA 2.

This doesn't really answer what happens when there's a lot of predicated-on work items, such that every quad has two or more materials... I suppose there is more likely to be a common material amongst the quads or within quads as the number of predicated-on work items increases.
 
Back
Top