AMD RDNA4 Architecture Speculation

Not to mention that the original infinity cache was significantly slower than the current implementation.
 

AMD Radeon RX 9070 specs listed by UK Retailer: RDNA4 may stick to PCIe 4.0×16​


According to OCUK, the Radeon RX 9070 XT and 9070 non-XT are both listed with 4096 cores. We cannot confirm this at present, as the specifications we have imply otherwise specifically, that the 9070 non-XT should have 3584 cores. However, it is confirmed that both cards feature 16GB of GDDR6 memory and use a 256-bit memory bus.

RDNA4-SPECS-OCUK.png


The retailer also claims that the RX 9070 XT model should be capable of 4K gaming and includes titles Alan Wake 2, which is a graphics heavy title with ray tracing.

RX-9070-FAQ.png


 
Meanwhile, according to price aggregation service Geizhals, a popular site associated with many German review sites, AMD may be preparing to launch the Radeon RX 9070 series on January 24, with the product reveal two days earlier. The launch date coincides with the GeForce RTX 5090 review embargo we discussed earlier.
As mentioned, we have no confirmation because AMD is shifting dates daily, and there’s hardly any official or even semi-official information (such as product NDAs). As far as we know, AMD hasn’t even started shipping review samples yet. However, it appears that retailers already have them in stock.
What a weird launch...
 
I swear this whole generation is cursed lol, we had no credible leaks for Blackwell at all and AMD is the most schizo it's been in a while.
There were plenty of credible leaks about Blackwell specs. All the performance claims made by others were obviously nonsense though, and I'm not just saying that with hindsight. It's insane to me how much talk there was of people expecting like a 60%+ performance gain again, on basically the same node as Lovelace. All while knowing how generally unremarkable the Blackwell specs were, outside perhaps the 5090.

We'll have to see with AMD. I do think they got caught out with Nvidia's pricing, which ruined their planned announcements, but there's still potential for them to at least offer something compelling, value-wise. Something that can provide a meaningful boost in performance per dollar while offering a reasonable level of VRAM in the $300-500 ranges is really what is most needed right now for GPU's. And if FSR4 is good(even simply equaling DLSS2 at launch is good) and gets decent adoption, then I could see these GPU's becoming popular to recommend. But we'll see.
 
RX 9070XT (N48 XL)

PCIe 5.0 x16 / 2.0 GHz Clock base / 2.4 GHZ Clock boost / 56 CUs (3584 SPs) / 16GB @20Gbps (256bit)
So try and separate the lower model's performance in reviews a bit more through lower clocks, but then people discover it can clock higher somewhat comfortably? That is exactly how they used to do it, and it made for some great products. Only a slight cut down in core specs, and a good price($400 please). That's the way to do it. It wont be a high margin part, but hey, that's what the AI business is for, right? AMD needs to worry far more about marketshare with Radeon at this point.
 
Only if it won't be power limited.
Sure, but at 2.4Ghz, it would need a pretty harsh power limit cap to prevent people from getting more from it if the other part is doing 2.9Ghz.

It would also be a supremely stupid move when AMD GPU's dearly need a reputation boost more than ever. With such a slight cut down in actual raw specs, they really dont need to push so many people to buying the top part anyways. This cut down version will likely be the one they make much more of in the end.
 
Sure, but at 2.4Ghz, it would need a pretty harsh power limit cap to prevent people from getting more from it if the other part is doing 2.9Ghz.
Well the clock difference imply that there will be a significant power difference which in turn may imply that the power supply will be different and this alone may prevent any sort of high OC.
Again hard to tell much from the numbers alone, we need to see the actual h/w.
 
LoL. I think I might have figured out how we got that 240mm2 rumor.

I was questioning the 380-390mm2 N48 that some people now are trying to reinforce and found a picture of N48 in someone's hand that I hadn't seen before.
I pulled up Techpowerup's N31 die pictures and measured the short edge of the MCD as 5mm and compared to the inside edge of the QR code.
If you assume the QR code is the same size on N31 and N48 (probably a bad idea) then I got roughly 10.6 x 21.5 = 228mm2.
Throw some MOE(3%) for each measurement and that bumps up to ~241mm2.
I also double checked using the bigger caps/resistors on the edge (again, assuming they are the same size on both) and came to a pretty similar measurement.

Edit- Using a N32 picture from TPU and the same method with the same N48 picture, I get 260-280mm2 on N48. So not exactly reliable.
 
Last edited:
Also, they might have finally added dedicated HW for the BVH tree traversal, not going it on software in shader cores anymore.
Yes, but also no. The trick about BVH traversal is not an independently running traversal unit, but a decent LRU cache which supports bounding box queries directly and that's also somewhat aware of L1 cache contents and give you a good estimate intermediate node straight from the cache rather than a full traversal starting at the root node. Likewise also the option to traverse the BVH out of order, based on queries against the local cache and the scoreboard of already requested cache lines. (I.e. probe multiple addresses and be told the which one - if any - is closest to be being accessible next. Rather than only being able to issue a blind load, and then being required to stall until completion when it inevitably has missed the cache.)

That's key to enabling deeper BVH trees with narrower nodes by default, which would be otherwise counterproductive due to the increased number of sequential memory dependencies. Bundling of coherent rays simply happens as a desirable side effect of proper cache probes.

The real question is if RDNA4 will already have cache probes or even spatial queries against a dedicated cache. But I guess that's still very much unlikely.
 
Last edited:
The trick about BVH traversal is not an independently running traversal unit, but a decent LRU cache which supports bounding box queries directly and that's also somewhat aware of L1 cache contents and give you a good estimate intermediate node straight from the cache rather than a full traversal starting at the root node.
The new temporal hints in RDNA4 seem to do this to some extent — in theory the traversal kernel has more control over how each individual read will interact with different levels of the cache hierarchy & what their seeding LRU order would be.

So e.g. the TLAS read could be “high temporal” (keep in L0?); reading first couple level of BLAS could be “regular” (normal LRU in L0?); and reading the rest/leaves could be “non-temporal” (read no allocate in L0?).
 
Back
Top