AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
You mean 192 CUs right? Yeah that would be better than 2 GCDs with 96 CUs each and should be a reasonable size on 5nm especially with memory and cache broken off into separate chiplets. It would be a much lower risk option for the first chiplet arch.

Yes 192. I probably had "6" on the mind due to 96 (Navi 21) and 256 (old configuration).
 
It's like all the patents reportedly linked to RDNA 3 describe the main desire for the MCM implementation be opaque to SW. Hopefully, these claims really do apply to RDNA 3 and not to a RDNA 3 + X.

// Found the recent patent describing a relatively straight-forward way to split pixel operations between chiplets:

it seems, that desire is not easy to fulfill

Might as well speculate this for Navi 31:
  • 3x GCDs of 16x WGPs and 4096 VALU lanes, each about 100mm² +
  • 3x MCDs of 128-bit GDDR and 128MB infinity cache, each around 150mm² +
  • I/O die of 150mm²
  • 24GB GDDR
That's 900mm² for 7900XT.

Then this for Navi 32, re-using Navi 31 chiplets:
  • 2x GCDs of 16x WGPs and 4096 VALU lanes, each about 100mm² +
  • 2x MCDs of 128-bit GDDR and 128MB infinity cache, each around 150mm² +
  • I/O die of 150mm²
  • 16GB GDDR
That's 650mm² for 7800XT.

Then Navi 33, 7700XT is approaching 400mm² as a monolithic GPU with 16x WGPs, 128MB infinity cache, 128-bit 8GB GDDR.

With the same chiplets used by both Navi 31 and 32, any broken WGPs would end up in the salvage SKUs:
  • 7900 with 44 WGPs = 2x full GCDs + 1x GCD with 4x broken WGPs
  • 7800 with 24 WGPs = 2x GCDs each with 4x broken WGPs
In the end, if AMD has learnt from Ryzen then small compute chiplets with lots of SKUs derived from partly-faulty premium chiplets seems likely. Zen 3 chiplets are 81mm², Zen 2 chiplets are 74mm², both on 7nm, relatively expensive in 2019 when Zen 2 launched.

You have to argue against this kind of SKU lineup built mostly from ~100mm² GCDs:
  • 7900XT = 48 WGPs
  • 7900 = 44 WGPs
  • 7800XT = 32 WGPs
  • 7800 = 24 WGPs
  • 7700XT = 16 WGPs
  • 7700 = ...
which maximises the return on 5nm wafers, when dismissing "multi-GCD" rumours.

I'm not familiar with all RDNA3 architecture detail either, but it's a bit of a stretch ...
 
You have to argue against this kind of SKU lineup built mostly from ~100mm² GCDs:
  • 7900XT = 48 WGPs
  • 7900 = 44 WGPs
  • 7800XT = 32 WGPs
  • 7800 = 24 WGPs
  • 7700XT = 16 WGPs
  • 7700 = ...
which maximises the return on 5nm wafers, when dismissing "multi-GCD" rumours.

The argument against multi-GCD isn't about the yield benefits of small chiplets. It's about how much risk AMD will take on their first attempt. Wiring up 6 or 7 chiplets and making it all work efficiently seems like a high bar. My vote is still single GCD + memory chiplets.
 
3x MCDs of 128-bit GDDR and 128MB infinity cache, each around 150mm² +.
Why so much? 64MB V-cache costs 41 mm² (source, previous articles stated 36 mm²). 64bit interface + 64bit MC + some IF costs ~25 mm² (source: Navi 21 die-shot). Even with some kind of interface, it shouldn't go above ~40 mm². So, base die carrying all interfaces and MC could carry V-cache (Infinity Cache) and the resulting 128MB / 128bit 3D chiplet wouldn't be bigger than 80-85 mm². Maybe I'm just missing something important.
 
The argument against multi-GCD isn't about the yield benefits of small chiplets. It's about how much risk AMD will take on their first attempt. Wiring up 6 or 7 chiplets and making it all work efficiently seems like a high bar. My vote is still single GCD + memory chiplets.
Milan X is 17 chiplets in the top chip.

The best argument against my posting is the difficulty of the layout for three MCDs amongst three GCDs. The "triangular" connections are a problem, which I think kills my suggestion. Unless I can dream up a similar solution

rotated IO for MCDs
 
Why so much? 64MB V-cache costs 41 mm² (source, previous articles stated 36 mm²). 64bit interface + 64bit MC + some IF costs ~25 mm² (source: Navi 21 die-shot). Even with some kind of interface, it shouldn't go above ~40 mm². So, base die carrying all interfaces and MC could carry V-cache (Infinity Cache) and the resulting 128MB / 128bit 3D chiplet wouldn't be bigger than 80-85 mm². Maybe I'm just missing something important.
The MCD I'm hypothesising is:
  • 128MB of cache which is somewhere in the region of 80mm² on Navi 21
  • 18mm² for 128-bit PHY
  • 23mm² for MCs + Infinity Fabric
That's 121mm². Then some area for two sets of 1TB/s interfaces. I was being conservative with 150mm².

The IO die has all the other interfaces and the graphics scheduling. I think that's my biggest mistake in terms of area, it should be more like 80mm², judging from:

N21_annotated_text.jpg


from rdna2 (nemez.net)

Anyway, I think the specific combination of GCDs and MCDs is unlikely due to the triangular layout problem.
 
Aside from managing the graphics pipeline state machine across two GCDs, the most adverse risk for breaking up the GCD is device-coherent traffic, for e.g., all the world-space primitive stages & device memory atomics.

But even then, TSV should be able to achieve the bandwidth of Navi 21's Infinity Cache quite easily, i.e., 1024 bytes/clk bidirectional at fabric clock — 8192 pins at 4Gbps data rate (HBM2E/3 speed) assuming 2Ghz fclk + DDR; or fewer pins if one assumes higher speed SerDes. There is also V-Cache in production demonstrating directly stacked TSV delivering 2TB/s inter-chip for one small chiplet. In other words, using multiple bridge chiplets have very much the potential to match or exceed the current 4-5 TB/s bidirectional GL1-L2 bandwidth in Navi 21 (2*16ch*64B*max 2.5Ghz), for device-coherent traffic to at least not get worse.
 
Might be easier to look at some examples with real data.

In DL2 the 6900xt is 3x as fast with RT off vs on. If RDNA 3 is 3.5x faster that would make it 2.2x as fast as the 3090 which is pretty good. It would also mean twice as fast as the 3090 in Control.

If the 4090 is "only" twice as fast as the 3090 with RT enabled it could be a close fight. It's probably better than that though.
 
Looking at DL2 frametimes is interesting.

6900xt at 4K = 54fps = 18.5ms
6900xt at 4K RT = 18fps = 55.6ms
RT cost = 37.1ms

7900xt at 4K = 54fps x 2 = 108fps = 9.3ms
7900xt at 4K RT = 18fps x 3.5 = 63fps = 15.9ms
RT cost = 6.6ms

7900xt = 37.1/6.6 = 5.6x faster than the 6900xt in pure RT. That would be amazeballs.
 
"1 GCD 5nm, 2x FP32 cores, 1xFP16 cores, 3ghz, 40% die size reduction"
Those are compared to what?
How do you get 2X 32FP but only 1X 16FP when each 32FP can do 2 X 16FP?
Is all that sitting on an interposer? If so, why not just use HBM memory and forget the L3 cache?
 
Maybe HBMs are less cost-effective. GlobalFoundry had a plan to manufacture HBM on 12nm+. That could change the situation a bit.
 
I think these expectations are pure fantasy. 3.5x faster in actual RT games performance is almost certainly not happening.
 
Might as well speculate this for Navi 31:
  • 3x GCDs of 16x WGPs and 4096 VALU lanes, each about 100mm² +
  • 3x MCDs of 128-bit GDDR and 128MB infinity cache, each around 150mm² +
  • I/O die of 150mm²
  • 24GB GDDR
That's 900mm² for 7900XT..

The reticle limit is still around 858 mm² so it needs to be Chip-on-Wafer-on-Substrate (CoWoS).
 
What would it tell us about Navi 31?
It would show you whether your method is valid.

We "know" how much faster Ampere is supposed to be in ray tracing versus Turing, so you can check whether your technique is meaningful.

The reticle limit is still around 858 mm² so it needs to be Chip-on-Wafer-on-Substrate (CoWoS).
Reticle limit has no relevance when dealing with some kinds of chiplet-based designs. Similar to how the largest server processors are already well above 1000mm² in effective die area. Add up the chiplet areas for Milan X!

A good reason for RDNA 3 being based solely upon a single graphics die (GCD) is that all the rumours that originally referred to multiple dies were getting confused between Navi 3x and MI300.

AMD Instinct MI300 to be first generation exascale APU with Zen4 CPU and CDNA3 GPU - VideoCardz.com

H100 is CoWoS, isn't it?
 
It would show you whether your method is valid.

You mean calculating RT cost by looking at actual measured frametimes with RT on and off? What criteria would you use to decide those numbers are invalid?

We "know" how much faster Ampere is supposed to be in ray tracing versus Turing, so you can check whether your technique is meaningful.

How much faster is Ampere supposed to be? The 3080 has the same number of RT cores as the 2080 Ti and has 50-70% faster incremental frametimes for RT.

Either way I don’t see the relevance to the hypothetical 7900xt scenario above.
 
How much faster is Ampere supposed to be? The 3080 has the same number of RT cores as the 2080 Ti and has 50-70% faster incremental frametimes for RT.
You're now getting to my point. We have some nebulous "Ampere is x times faster than Turing for ray tracing". You reference two games, Dying Light 2 and Control and instead of using the data points for Ampere versus Turing to validate whether those games and their RT performance are even worth discussing, you just ignore the possibility of validating your technique.

Either way I don’t see the relevance to the hypothetical 7900xt scenario above.
In the same way we can validate whether TFLOPS is a useful metric for determining gaming performance, we could validate whether the claim about RT speed-up is useful.

NVidia has made claims about RT speed-up with Ampere. We could use game performance to validate those claims. We might find games that align with those claims.

I have no idea whether it's possible to validate those claims of RT speed-up. With respect to RDNA 3 you've proposed a technique, but not tested it.
 
Status
Not open for further replies.
Back
Top