What I'm trying to think about (out loud
) is how AMD would achieve more finely-grained scheduling in the seemingly new WGP which does not contain CUs and at the same time make ray acceleration work better.
That would be a nice improvement but what would be the benefit of the WGP structure in that case? Might as well stick to CUs with dedicated resources like RDNA1.
With AMD seemingly moving away from 64-work-item hardware threads to 32, the gains in granularity while scheduling dynamic branching during ray tracing (traversal and hit shaders) will not be achieved if RA throughput is too low.
So my interpretation/guess regarding the rumours is specifically that there is no contention across VALU SIMDs for the TMU/RA. Currently the TMU/RA hardware is shared by two SIMDs. That introduces a level of contention which can't be handled exclusively at the WGP level and can't be handled exclusively at the SIMD level.
In truth, there's always going to be a similar intermediate contention level, if LDS is common to all SIMDs within a WGP.
TMUs and RAs are both "compute-limited", they each feature a pipeline that implements instructions of some type to achieve their results.
LDS is wiring/banking limited, where memory banks and read/write ports have to support multiple paths to registers/VALUs. There's varying amounts of latency which SIMD-level scheduling already has to account for.
So it would seem that while it makes sense to grow LDS and make it dual-mode (shared or private per SIMD), to achieve greater ray tracing throughput AMD has no choice but to implement more RAs. It appears that at least some of the wiring if not the logic (e.g. for addressing in local cache for fetches) is common to both TMUs and RAs, so when AMD implements more RAs, more TMUs come along too.
Rumours seem to suggest that we're looking at 30 WGPs per graphics chiplet, with eight SIMD-32s per WGP. That's a lot of sharing/wiring/banking/variable-latencies for LDS. That makes me feel queasy about the practicalities. Also, with a limit of 128 work items in D3D on PC, supporting 256 across 8 SIMDs seems pointless, adding to my queasiness.
To reduce my queasiness, while RA throughput is being boosted within a WGP, it could all be shared by all of the SIMDs, similarly to how LDS is shared. This is a different theory, where I'm contemplating the use of LDS during ray traversal. Essentially the shaders that perform ray traversal make heavy use of LDS to track the state of each ray, since some of this state is shared by all rays. So, instead of making TMUs and RAs private per SIMD, it might be preferable to make LDS a "porthole" for the operation of the TMU/RA hardware.
With LDS made a porthole like this (and despite it seeming to be a bottleneck) it would provide SIMDs with a more responsive amount of RA throughput. In simple terms if one SIMD issues a ray query, and the other 7 SIMDs are doing other work (traversal or hit evaluation), the SIMD gains 8x the RA throughput it would otherwise have had if it was using a private RA.
So in this theory LDS becomes more central to the scheduling of work inside the WGP. Its latencies will probably be worse than seen in RDNA 2, and it is currently a bottleneck due to its size in RDNA 2 ray tracing ("barely holds the data required"). So RDNA 3 LDS would need to be much bigger to take on more functionality during ray acceleration scheduling and also cope with more work items in flight.
So, I'm torn. Fine-grained private-per-SIMD TMU/RA is sort of the trend we've seen with AMD (currently private per CU). But a group of TMUs/RAs shared by all of a WGP is attractive because it can soak up bursts of work better
NVidia seems to be doing broad sharing of TMUs and ray traversal, so it would seem more likely that AMD would go the latter way, with everything shared by all SIMDs in a WGP. At which point, maybe the count of TMUs/RAs per SIMD does not need to increase, as their utilisation would be higher...
I was under the impression the bottleneck was more traversing the BVH using shaders (i.e. a highly branching workload on SIMD) rather than the intersection rate being insufficient.
The highly-branching workload associated with ray tracing is not merely during ray traversal, it's also in hit shading. (Shadows are the exception, since there's not really any shading to perform with shadows). So to disentangle BVH traversal, ray-triangle intersection and hit shading is quite hard.
In the end, NVidia doubled intersection-testing throughput in Ampere versus Turing, so it would appear it's an element of performance that AMD might want to attack. I'm not saying that by doubling RA throughput per SIMD in RDNA 3, AMD will entirely solve the performance deficit it has. I'm simply thinking about the internal structure of the WGP and whether the rumours hint at a change in RA throughput per SIMD.
It may be that average RA throughput in RDNA 2 really isn't the bottleneck. Instead, performance problems associated with the RAs could be a problem of burstiness, where queue lengths grow and stalls propagate. In that situation, sharing RAs across more SIMDs would be the solution.