I don’t think ALU lanes are the issue. It’s 160 CUs vs 84 SMs. Not even in the same ballpark.
80 WGPs with 4x SIMD-32s vs 84 SMs with 8x SIMD-16s. Hmm...
Are you saying that 4 TMU lanes in an SM versus 8 TMU lanes in a WGP is a major factor here? Is there something else? I can't read your mind.
Arcturus has 8192 FP32 ALU lanes, probably with 1:2 FP64 throughput. And if RDNA2's clocks are any indication, it too should clock at around 2GHz for >32 TFLOPs.
There's a spec? There's a die size?
I honestly don't get why gaming Ampere has so many ALUs - much more than the GA100 that is compute-oriented. It definitely doesn't translate into gaming performance.
That said, I don't know why AMD would follow suit. They tried their hand at using chips with lots of compute units to compete in the gaming market, and the result was a chip with comparatively low power efficiency (Vega 10).
RDNA looks like it is focussed on being bandwidth-efficient (in the CUs the focus is on minimised cache-thrashing). RDNA 2 looks like a major iteration on that concept, though perhaps the L1 and L2 papers and patent stuff is all already in RDNA.
RDNA is gaming-centric, so it shouldn't have more compute resources than what the other execution units and effective memory bandwidth can keep up with, for rasterization.
What other execution units? ALU:TEX in my proposal is unchanged. My hypothesis includes ALU:colour-fillrate ratio being either doubled or quadrupled. We don't know if it's 128 or 64 ROPs. ALU:zixel-rate is either the same or doubled, because my theory is that AMD will double zixel rate per ROP.
Sounds like a gaming GPU to me. 80 WGPs is a lot of ray tracing, too.
If you disagree with 80 WGPs, then you have to explain a massive die that's been seen in a gaming card with GDDR6. 128MB last level cache could be the answer, but it seems really unlikely to me, simply because a monster cache that is almost the same size as all of the CUs (which are about 112mm²) hasn't been seen in XSX (and PS5's die size corresponds to that). If RDNA 2 has a monster cache then that would appear to imply that the consoles have no RDNA 2 features except for ray tracing. That would make them even more horrible than I thought they were...
Earlier I said XSX CUs are 15% of the die - that's wrong, they're about 31%. Also, I said that there's one L0 per WGP, but in the RDNA whitepaper it shows that 4x TMUs (per CU) have a dedicated L0 ("vector L0", which also caches non-textured memory reads). The instruction and scalar data caches are shared by both CUs in addition to LDS - because a workgroup is the general concept of a shared-state of computation, consisting of multiple wave64s (one to four) or wave32s (one to eight) of work-items.
I don't know why this guy has credibility for AMD leaks, but here's a fresh one:
236mm² is 79mm² larger than Navi 14:
which is 157mm². Just in case you've not been paying attention, 40CUs in Navi 10 take 80mm², or if you prefer, 20 CUs in Navi 14 take 40mm² (there's actually 24).
So, how the hell does a "rumoured 32 CU" Navi 23 spend 79mm², when 8 more CUs should be about 16mm²?:
- It has half the huge last level cache size of Navi 21?
- It has RDNA 2 CUs (not seen in XSX or PS5) which are twice the area of RDNA 1 CUs? Damn, that ray tracing had better be godlike.