I wonder how they cope with much much higher data flow between those compute chiplets
Well, we never got any options to communicate across compute workgroups, other than using VRAM.
Thus we are already used to minimize such communication, because we can assume it's prohibitively slow.
It's how GPGPU worked since day one. Nothing changes. For pixel and vertex shaders there is no general way to communicate with other threads in the same group, even.
Jawed has mentioned rasterization details.
Besides, if there were global ray reordering in HW (noboby does this afaict), moving rays across chiplets would become an issue. Just to mix in some hypothetical speculations.
So the only data flow across compute chiplets i see is to get work from some global queue, and eventually steal / redistribute work across chiplets.
But that's very little data compared to the flow happening on doing the actual work, which they have already solved with RDNA3. Basically just an index and a count per work item, for example.
Plus some context on the workloads, some synchronization primitives, etc. Seems peanuts.
So, being an amateur about HW, my assumption actually is: It should be easy to make a compute chiplet GPU iterating over RDNA3, which already covered the real BW problems.