Speculation and Rumors: AMD RDNA4 ...

As I understand it, the big difference is that GCD contains the command processor, while the SED is just arrays of shaders, and command processor is on the base die.
Yep, that seems to be the differentiation in this instance. But those are just made-up names, not general conventions or universal standards, hence my „however you may call those… “.

But this patent seems to focus on scaling up performance, since adding an active interposer does add cost as well. You don't do that on a mid-range class card (yet).
 
There's of course the obligatory caveat that just because a company patents something doesn't mean that they are building it any time soon, or ever.
Oh, the RDNA3 CU patent was posted here like a zillion times over and it was very real so...
 
After how disappointing RDNA 3 architecture is I expect only minor and iterative changes for RDNA 4. Multi compute die is a pipe dream IMO. Still seems an insurmountable problem for graphics. Maybe we will see the return of HBM or at least stacked GDDR to break the bandwidth wall we are stuck at. Infinity cache is useful but it doesn't negate the need for continued bandwidth scaling.
 
After how disappointing RDNA 3 architecture is I expect only minor and iterative changes for RDNA 4. Multi compute die is a pipe dream IMO. Still seems an insurmountable problem for graphics. Maybe we will see the return of HBM or at least stacked GDDR to break the bandwidth wall we are stuck at. Infinity cache is useful but it doesn't negate the need for continued bandwidth scaling.
GDDR6W gives you 1.4TB/s on 8 chips (512-bit, 64bit per chip (but still probably 16 bit real channels, there's just 4 memory dies inside the chip instead of 2 in GDDR6))
 
Ain't no one ever making 512b buses anymore.
Samsung disagrees in their press release at least
Samsung Electronics is proceeding standardization for GDDR6W products. It has also announced that it will expand the application of GDDR6W to small form factor devices such as notebooks as well as new high-performance accelerators used for AI and HPC applications, through cooperation with its GPU partners.
Of course it's possible one would just put 4 chips for 256bit but still. Don't know how big of a difference it would be that the 512-bit bus would be to just 8 chips, not 16 like before.
 
the product is a true chiplet ™ design for all RDNA4 parts.
I wonder how they cope with much much higher data flow between those compute chiplets as lot of wires between shader engines leads to increased power consumption, right ? Isn´t it a reason why they didn´t go for mlultiple GDC in RDNA3 because it wasn´t feasible as Sam Naffziger hinted in the video from Gamers Nexus ?

software part would be interesting though, make two or more shader chiplets as one single GPU to the driver
 
I wonder how they cope with much much higher data flow between those compute chiplets as lot of wires between shader engines leads to increased power consumption, right ? Isn´t it a reason why they didn´t go for mlultiple GDC in RDNA3 because it wasn´t feasible as Sam Naffziger hinted in the video from Gamers Nexus ?

software part would be interesting though, make two or more shader chiplets as one single GPU to the driver
SEDs should be sharing almost no data amongst themselves. Global atomics and vertex workloads would involve shared traffic.

Vertices, once shaded and ready for primitive assembly need to end up at the shader engines that will do rasterisation. In some cases that will be multiple shader engines as the resulting triangles will touch multiple screen space tiles. The decision over which tiles to send the triangles is taken after coarse rasterisation has been performed. Then each tile's shader engine does fine rasterisation.

Algorithms where the GPU generates its own work will also involve SEDs sending work data to each other...

AID to AID connections will presumably be a scaled-up version of what we see between RDNA 3's GCDs and each MCD. Clearly three AIDs means that the two AIDs at each end will require two hops for roughly one-third of their memory requests... Clearly Infinity Cache within each AID plays a big part.
 
I wonder how they cope with much much higher data flow between those compute chiplets
Well, we never got any options to communicate across compute workgroups, other than using VRAM.
Thus we are already used to minimize such communication, because we can assume it's prohibitively slow.
It's how GPGPU worked since day one. Nothing changes. For pixel and vertex shaders there is no general way to communicate with other threads in the same group, even.

Jawed has mentioned rasterization details.
Besides, if there were global ray reordering in HW (noboby does this afaict), moving rays across chiplets would become an issue. Just to mix in some hypothetical speculations.

So the only data flow across compute chiplets i see is to get work from some global queue, and eventually steal / redistribute work across chiplets.
But that's very little data compared to the flow happening on doing the actual work, which they have already solved with RDNA3. Basically just an index and a count per work item, for example.
Plus some context on the workloads, some synchronization primitives, etc. Seems peanuts.

So, being an amateur about HW, my assumption actually is: It should be easy to make a compute chiplet GPU iterating over RDNA3, which already covered the real BW problems.
 
I don't see chiplets ever being more than a one-off advantage, whichever variants of "chiplet" and "stacking" are involved, and however many stages of evolution they go through for GPUs, specifically.

There's no reason to expect NVidia to be more than one generation late with chiplets - it was one generation later with a monster on-die cache.

Maybe RDNA 4 is where this happens:
b3da048.png



So we have:
  • active interposer die (AID) - graphics control processor and last level cache
  • shader engine die (SED) - corresponds with shader engines seen in current RDNA GPUs
  • multimedia and I/O die (MID) - crap that belongs on a cheap process.
There is no GCD as such in this design.

That certainly looks like the future. It would be a little bizarre for chip packaging to advance this far and still have to settle for off package VRAM.
 
After how disappointing RDNA 3 architecture is I expect only minor and iterative changes for RDNA 4. Multi compute die is a pipe dream IMO. Still seems an insurmountable problem for graphics. Maybe we will see the return of HBM or at least stacked GDDR to break the bandwidth wall we are stuck at. Infinity cache is useful but it doesn't negate the need for continued bandwidth scaling.
Architecture disappointing?

I think you got things totally mixed up.

The architecture got much more improvements than expected, but because the leaked numbers were misinterpreted and bad speculation was based on those misintepreted numbers, people had unrealistic expectations.
And then when that speculation did prove to be false, people were disappointed.

To me, the disappointing part is not the architecture. The disappointing part is exactly the other way: that because of all those changes in the architecture, they just did not increase their core counts enough.
 
GDDR6W gives you 1.4TB/s on 8 chips (512-bit, 64bit per chip (but still probably 16 bit real channels, there's just 4 memory dies inside the chip instead of 2 in GDDR6))
I honestly don't see how G6W is better than the old G6X available for two years now. Don't think it has a future.
 
Back
Top