That would be strange.
Navi 32 with 60CU's had total die area of 346mm², and that obviously included four 6nm MCD's totaling about 146mm².
Navi 48 is supposed to be 64CU's, but there should be die savings from moving to 4nm
N4 is just very slightly improved N5. No big area savings from that.
and putting the memory bus and L3 IC on the main 4nm die as well(scaling here isn't great, but it's still something).
Perhaps 64CU is wrong and they actually upped CU count a fair bit instead of going for narrow+faster clockspeeds? Or maybe RDNA4 CU's are a really sizeable chunk wider to accommodate better RT/AI? I dont know, but that doesn't strike me as the sort of architectural or area efficiency gains they'd have liked, especially if the aim is to be able to price this GPU more aggressively.
RDNA4 CU is probably quite different than RDNA3 CU.
What is already publicly known is that the TMUs will be much beefier, capable of twice the texturing speed in many (but not all) situations. The beefier TMUs will both give perf increase and will also be bigger.
And then about the shaders itself:
RDNA3 doubled the number of multipliers per CU, but could dual-issue multiplications very rarely in real-world code, leading to very small performance increase from those dual-multipliers.
What I'm expecting is that they will get rid of many of the bottlenecks which prevented utilizing these dual-multipliers often. Might mean more register file ports or beefier frontend on the CUs.
Whatever they'll do, it will also mean the CUs might be considerably beefier.
Also, they might have finally added dedicated HW for the BVH tree traversal, not going it on software in shader cores anymore. If added, this will also add more area.
And the final thing: Cache. The L3 cache ("inifinity cache") can consume lots of die area if big. AFAIK we do not know the size of the cache yet, but I'd love to see the same 128 MiB 6800XT had, even though I'm pessimisting and thinking 64 MiB is more probable than 64 MiB.
Big cache is expensive, but can both improve performance by helping the bandwidth bottleneck, and also increase energy-efficiency, as DRAM accesses consume quite much energy.
IMHO 32 improved DCUs (64 CU's) with (often) twice the texturing power, beefier register files and instruction frontend, dedicated BVH tree traversal engine and 128 MiB of L3 cache would be quite nice, balanced and not very small chip.