That's the Work Group Processor, where two CUs are tightly integrated (sharing constant/instruction caches and the LDS). I think this is analogous to Nvidia's TPC, where two SMs are chained together (three in GT200).So, we're back to 4 SIMDs per CU (or whatever arbitrary name you will call it). Now all we need is a 4-cycle round robin cadence to be back at GCN with 4x throughput.
No, that's per 'CU'.That's the Work Group Processor
TPC is just a physical layout thingy; you can't just double the shmem available to a single SM with a toggle.I think this is analogous to Nvidia's TPC, where two SMs are chained together (three in GT200).
I'd say some techtubers already did that with a smattering of chinese forums auto-translate (along with ol' good fud)They just now need to keep re-posting snippets from this thread
Triple confirmed: RDNA 3 has 128-wide wavefronts and yet again zero-cycle ALU instruction latency.So, we're back to 4 SIMDs per CU (or whatever arbitrary name you will call it). Now all we need is a 4-cycle round robin cadence to be back at GCN with 4x throughput.
Not sure whether the programming model reflects your speculation:We can infer that 2 SIMDs can cooperate in a non-pixel-shader workgroup because it reduces latency
I was referring to CU mode, not WGP mode, though I admit I didn't reinforce that constraint at that point.Not sure whether the programming model reflects your speculation:
* Memory access within a workgroup is coherent only for requests skipping L0$, or atomic operations. (except when you specify the CU mode at dispatch time, presumably).
In CU mode barriers are CU level.* Barriers are monitored at the WGP level anyway, since it is a workgroup primitive.
Only in WGP mode.* LDS are equally reachable by all SIMDs in WGP.
There are two LDS arrays inside LDS, each array is localised to the parent CU. So in CU mode you get doubled bandwidth.Though the ISA documentation does say explicitly that CU mode can lead to higher effective LDS bandwidth. But that sounds more like a specific matter of certain memory access patterns disfavoring(?) the WGP mode of the LDS, rather than something to do with some unspecified tight integration within a WGP half.
The nuance here is that these are all currently execution mode differences at runtime, where they can be emulated alright by actual architectural hardware features at WGP level. You are trying to extrapolate that there are more benefits from this mode, which is possible. But the reason can also be as simply as a compatibility mode for kernels previously written with certain programming model assumptions that have been around for a decade (e.g. wavefront size = 64), which isn't something a shader compiler alone can fix/patch transparently.I was referring to CU mode, not WGP mode, though I admit I didn't reinforce that constraint at that point.
In CU mode barriers are CU level.
Only in WGP mode.
There are two LDS arrays inside LDS, each array is localised to the parent CU. So in CU mode you get doubled bandwidth. I didn't see the note in the ISA about reduced bandwidth in WGP mode, but that merely backs up what I was saying before.
L0$ per WGP, L1$ per shader array. Well, that's how RDNA (2) is configured.
The ALU:RA ratio is pretty high in RDNA 2 already, so making it much higher may well be harmless given that the entire BVH, in the worst case, can't fit into infinity cache (i.e. lots of latency). More WGPs in Navi 31 still means a theoretical increase in intersection throughput (e.g. 3x Navi 21).
We can infer that 2 SIMDs can cooperate in a non-pixel-shader workgroup because it reduces latency (wall clock latency for all the work items in the workgroup).
I think you're going to have to be very specific in pointing out a problem with what I've written. I'm not saying I haven't made a mistake, but while you're hiding the problem you've identified I can't read your mind and I'm not motivated to find the problem in text that I've already spent well over 6 hours writing (I posted version 3, in case you're wondering).
It's one RA per CU as it is. Or two RAs per WGP. They are closely coupled to if not an integral part of the TMU. If they don't change the ALU:TEX ratio, there's no reason to believe, RT performance won't scale with ALU performance. Bigger ∞$ would only help RT perf, especially if they can make the BVH stick in that $.When AMD put more shaders in the WGps how does this effect theire Raytracing approach? I thought they need mor WGps for Raytracing not less?
How did you get more WGPs in Navi 31? Navi 21 has 40, latest rumours say Navi 31 has only 30.
30 per die, but Navi31 seems to have two dies, so the total is 60.
Oh yes that’s right. Though that’s not 3x Navi 21 if latest rumors are accurate.
Scaling hard!It should be >3x Navi 21
Correct.5120 ALUs on Navi 21 vs. 15360 ALUs on dual-chiplet Navi 31. Furthermore, Navi 31 has up to 512MB Infinity Cache, 4x the 128MB in Navi 21.
4*96MB = 384MB LLC on Navi 32?Think the entire lineup gets 4x LLC bumps as one last SRAM huzzah.
Oh yes that’s right. Though that’s not 3x Navi 21 if latest rumors are accurate.
It's 60 WGPs but each WGP now has 256 ALUs, whereas one WGP in Navi 21 has 128 ALUs.
5120 ALUs on Navi 21 vs. 15360 ALUs on dual-chiplet Navi 31. Furthermore, Navi 31 has up to 512MB Infinity Cache, 4x the 128MB in Navi 21.
We should also expect N31's clocks to reach higher frequencies, since it's made on N5P instead of Navi 21's N7P.
It should be >3x Navi 21, even if power consumption jumps to RTX 3090 levels.
4*96MB = 384MB LLC on Navi 32?
128MB LLC on Navi 33?
GCDs have no LLC.But these lower tier chips will be single chiplet presumably and it's just double the LLC per chiplet from what I've read?
No such thing.My expectation at this stage is that the single chiplet N31
It is one big GPU.The difference here being that hopefully scaling is much less like 2 GPU's in Crossfire and more like 1 big GPU
No, it's the biggest gen on gen uplift in eons across the stack.the rest of the stack will be a more traditional performance uplift of around 50% (a bit more if we're very lucky).