Possibly related to the fact that while FP capability per CU doubled, INT didn't. Or that the double capability requires some ILP to work.Has AMD hinted at which matrix workloads or libraries they’re targeting? FSR 3.0?
If SIMD throughput is 64x32b or 2x32x32b that would make Navi31 a 12288x32b chip. Why is it marketed as 6144?
Possibly related to the fact that while FP capability per CU doubled, INT didn't. Or that the double capability requires some ILP to work.
Depends on which spends more transistors, the ALUs or the data movement. If getting operands to the units takes more die area than the ALUs themselves, doubling up on compute might be transistor-efficient to get more use of the data paths, even if the compute sits idle a lot of the time. Just to get more use out of the system in the rare cases when enough operands can be delivered to run two ops in parallel.I suspect you're right, but it wildly contradicts the "extract maximum value from each transistor" statement on that first slide I included.
That's "wave64 mode", typically used by pixel shaders.64x32b implies ILP isn’t required.
Super SIMD was supposedly motivated by making more effective use of vector register file bandwidth - a lot of time the bandwidth with old architectures was wasted.Depends on which spends more transistors, the ALUs or the data movement. If getting operands to the units takes more die area than the ALUs themselves, doubling up on compute might be transistor-efficient to get more use of the data paths, even if the compute sits idle a lot of the time. Just to get more use out of the system in the rare cases when enough operands can be delivered to run two ops in parallel.
That's "wave64 mode", typically used by pixel shaders.
Of course if there are integer instructions (per pixel) then those instructions in the pixel shader progress at half rate, because there's only 32 lanes of integer ALU capability.
Super SIMD was supposedly motivated by making more effective use of vector register file bandwidth - a lot of time the bandwidth with old architectures was wasted.
As it happens, wave64 mode is the common case that will get good value out of these 64 lanes. The concern then rests with compute workloads in cases where 32 work item workgroups are used. Ray tracing is one of those cases, but maybe the traversal code has high ILP?
My guess is that they want to present it as "very effective per flop architecture" which would turn into the opposite with 2x the FP32 ALU number - see Turing vs Ampere.If it all of that is true why is AMD sandbagging? Is there actually enough operand bandwidth to feed 64 FMAs?
My understanding is that the bandwidth is only available if at least one operand comes from a prior instruction, where the resultant of that instruction is cached and fed to the ALUs without requiring it to be read from a VGPR. There is an arrow on the diagram that implies as much:If it all of that is true why is AMD sandbagging? Is there actually enough operand bandwidth to feed 64 FMAs?
Primitive shader is a s/w defined stage of geometry processing in RDNA GPUs. As far as we can tell all geometry runs through it on RDNA, mesh shaders including. It looks like RDNA3 has some additional h/w supporting this path now freeing main SIMDs from some calculations (presumably).Is Primitive Shaders now working better then Mesh Shader or is it just another name for the same thing?
Shaders are s/w thus a primitive shader is also s/w. But it can run on separate h/w in part or even fully. In this case it's a s/w (shader) which does some sort of geometry processing.Short question what is s/w ?
There are some h/w requirements for supporting mesh shaders. But the processing itself is done on main FP32 ALUs.But i thought that mesh shader have to be in Hardware?
Impossible to tell but unlikely. No reason for them to have something like that.So AMD have now more hardware then needed for mesh shader?
Is Primitive Shaders now working better then Mesh Shader or is it just another name for the same thing?
IIRC mesh shaders require some hardware plumbing that allow certain data sharing between the CUs and rasterizer blocks but I'm not positive.But i thought that mesh shader have to be in Hardware? So AMD have now more hardware then needed for mesh shader?
I don't really understand this, but I suspect they are using spare bits that were perhaps unused in RDNA 2.View attachment 7554
Sounds they can now skip over testing triangles, if the ray intersects none of the child boxes?
I don't understand how discarding empty ray quads could have been a problem on RDNA 2 and why RDNA 3 is better when these quads are discarded.But idk about 'DXR ray falgs'.
View attachment 7555