AMD RDNA4 Architecture Speculation

Not to mention that the original infinity cache was significantly slower than the current implementation.
 

AMD Radeon RX 9070 specs listed by UK Retailer: RDNA4 may stick to PCIe 4.0×16​


According to OCUK, the Radeon RX 9070 XT and 9070 non-XT are both listed with 4096 cores. We cannot confirm this at present, as the specifications we have imply otherwise specifically, that the 9070 non-XT should have 3584 cores. However, it is confirmed that both cards feature 16GB of GDDR6 memory and use a 256-bit memory bus.

RDNA4-SPECS-OCUK.png


The retailer also claims that the RX 9070 XT model should be capable of 4K gaming and includes titles Alan Wake 2, which is a graphics heavy title with ray tracing.

RX-9070-FAQ.png


 
Meanwhile, according to price aggregation service Geizhals, a popular site associated with many German review sites, AMD may be preparing to launch the Radeon RX 9070 series on January 24, with the product reveal two days earlier. The launch date coincides with the GeForce RTX 5090 review embargo we discussed earlier.
As mentioned, we have no confirmation because AMD is shifting dates daily, and there’s hardly any official or even semi-official information (such as product NDAs). As far as we know, AMD hasn’t even started shipping review samples yet. However, it appears that retailers already have them in stock.
What a weird launch...
 
I swear this whole generation is cursed lol, we had no credible leaks for Blackwell at all and AMD is the most schizo it's been in a while.
There were plenty of credible leaks about Blackwell specs. All the performance claims made by others were obviously nonsense though, and I'm not just saying that with hindsight. It's insane to me how much talk there was of people expecting like a 60%+ performance gain again, on basically the same node as Lovelace. All while knowing how generally unremarkable the Blackwell specs were, outside perhaps the 5090.

We'll have to see with AMD. I do think they got caught out with Nvidia's pricing, which ruined their planned announcements, but there's still potential for them to at least offer something compelling, value-wise. Something that can provide a meaningful boost in performance per dollar while offering a reasonable level of VRAM in the $300-500 ranges is really what is most needed right now for GPU's. And if FSR4 is good(even simply equaling DLSS2 at launch is good) and gets decent adoption, then I could see these GPU's becoming popular to recommend. But we'll see.
 
RX 9070XT (N48 XL)

PCIe 5.0 x16 / 2.0 GHz Clock base / 2.4 GHZ Clock boost / 56 CUs (3584 SPs) / 16GB @20Gbps (256bit)
So try and separate the lower model's performance in reviews a bit more through lower clocks, but then people discover it can clock higher somewhat comfortably? That is exactly how they used to do it, and it made for some great products. Only a slight cut down in core specs, and a good price($400 please). That's the way to do it. It wont be a high margin part, but hey, that's what the AI business is for, right? AMD needs to worry far more about marketshare with Radeon at this point.
 
Only if it won't be power limited.
Sure, but at 2.4Ghz, it would need a pretty harsh power limit cap to prevent people from getting more from it if the other part is doing 2.9Ghz.

It would also be a supremely stupid move when AMD GPU's dearly need a reputation boost more than ever. With such a slight cut down in actual raw specs, they really dont need to push so many people to buying the top part anyways. This cut down version will likely be the one they make much more of in the end.
 
Sure, but at 2.4Ghz, it would need a pretty harsh power limit cap to prevent people from getting more from it if the other part is doing 2.9Ghz.
Well the clock difference imply that there will be a significant power difference which in turn may imply that the power supply will be different and this alone may prevent any sort of high OC.
Again hard to tell much from the numbers alone, we need to see the actual h/w.
 
LoL. I think I might have figured out how we got that 240mm2 rumor.

I was questioning the 380-390mm2 N48 that some people now are trying to reinforce and found a picture of N48 in someone's hand that I hadn't seen before.
I pulled up Techpowerup's N31 die pictures and measured the short edge of the MCD as 5mm and compared to the inside edge of the QR code.
If you assume the QR code is the same size on N31 and N48 (probably a bad idea) then I got roughly 10.6 x 21.5 = 228mm2.
Throw some MOE(3%) for each measurement and that bumps up to ~241mm2.
I also double checked using the bigger caps/resistors on the edge (again, assuming they are the same size on both) and came to a pretty similar measurement.

Edit- Using a N32 picture from TPU and the same method with the same N48 picture, I get 260-280mm2 on N48. So not exactly reliable.
 
Last edited:
Also, they might have finally added dedicated HW for the BVH tree traversal, not going it on software in shader cores anymore.
Yes, but also no. The trick about BVH traversal is not an independently running traversal unit, but a decent LRU cache which supports bounding box queries directly and that's also somewhat aware of L1 cache contents and give you a good estimate intermediate node straight from the cache rather than a full traversal starting at the root node. Likewise also the option to traverse the BVH out of order, based on queries against the local cache and the scoreboard of already requested cache lines. (I.e. probe multiple addresses and be told the which one - if any - is closest to be being accessible next. Rather than only being able to issue a blind load, and then being required to stall until completion when it inevitably has missed the cache.)

That's key to enabling deeper BVH trees with narrower nodes by default, which would be otherwise counterproductive due to the increased number of sequential memory dependencies. Bundling of coherent rays simply happens as a desirable side effect of proper cache probes.

The real question is if RDNA4 will already have cache probes or even spatial queries against a dedicated cache. But I guess that's still very much unlikely.
 
Last edited:
The trick about BVH traversal is not an independently running traversal unit, but a decent LRU cache which supports bounding box queries directly and that's also somewhat aware of L1 cache contents and give you a good estimate intermediate node straight from the cache rather than a full traversal starting at the root node.
The new temporal hints in RDNA4 seem to do this to some extent — in theory the traversal kernel has more control over how each individual read will interact with different levels of the cache hierarchy & what their seeding LRU order would be.

So e.g. the TLAS read could be “high temporal” (keep in L0?); reading first couple level of BLAS could be “regular” (normal LRU in L0?); and reading the rest/leaves could be “non-temporal” (read no allocate in L0?).
 
The new temporal hints in RDNA4 seem to do this to some extent — in theory the traversal kernel has more control over how each individual read will interact with different levels of the cache hierarchy & what their seeding LRU order would be.

So e.g. the TLAS read could be “high temporal” (keep in L0?); reading first couple level of BLAS could be “regular” (normal LRU in L0?); and reading the rest/leaves could be “non-temporal” (read no allocate in L0?).
That's not even close to what's needed. It's not enough to have better cache retention strategies (which is catching up to a somewhat related technology Nvidia has - even though they do it via a programmable address range table rather than per instruction).

While there is a theoretical benefit in being able to provide feedback of the form "this node was in the chain up to a hit, keep it in cache", that's only partially sufficient if you can't also say "I got 4 siblings to pick from, give me the best choice" and thereby accept the result of the feedback loop. You need cache content aware out-of-order execution capabilities.

And an ASIC that implements spatial queries over a LRU is yet a different thing, and can't be substituted with pure cache retention intrinsics.

In the end, ray tracing is at the core a tree traversal with a lower bound of logarithmic steps in the best case, and actually a multiple of that as the BVH often isn't overlap free. Effectively that's a somewhat constant cost of 6-100 memory dependent indirections you need to follow for each traced ray. The arithmetic fraction of the cost is low, but every shortcut that can cut at least a fraction of the number of memory dependencies pays off manifold due to the reduced need for latency hiding by thread overcomissioning.
 
Last edited:
FSR 4 Games for RDNA 4 GPUs

  • ARK: Survival Ascended
  • Bellwright
  • Beyond Hanwell
  • Call of Duty: Black Ops 6
  • Caravan SandWitch
  • Creatures of Ava
  • DON'T SCREAM
  • DUCKSIDE
  • Enotria: The Last Song
  • EVERSPACE 2
  • Farming Simulator 25
  • FINAL FANTASY XVI
  • FlipScapes
  • Frostpunk 2
  • Funko Fusion
  • Ghost of Tsushima DIRECTOR'S CUT
  • God of War RagnarÜk
  • Gori: Cuddly Carnage
  • GreedFall II: The Dying World
  • Horizon Forbidden West Complete Edition
  • Horizon Zero Dawn Remastered
  • Hunt: Showdown 1896
  • Incursion Red River
  • Lost Records: Bloom & Rage
  • Manor Lords
  • Marvel Rivals
  • Marvel's Spider-Man Remastered
  • Marvel's Spider-Man: Miles Morales
  • MechWarrior 5: Clans
  • Microsoft Flight Simulator 2024
  • New Home: Medieval Village
  • No More Room in Hell 2
  • Pine Harbor
  • Predator: Hunting Grounds
  • Providence
  • Ratchet & Clank: Rift Apart
  • Rem Survival
  • REMNANT II
  • S.T.A.L.K.E.R. 2: Heart of Chornobyl
  • Satisfactory
  • SILENT HILL 2
  • Supermoves
  • Tactical Vengeance: Play The Games
  • Test Drive Unlimited Solar Crown
  • The Axis Unseen
  • The First Descendant
  • The Last of Us Part I
  • Tiny Glade
  • Until Dawn
  • Warhammer 40,000: Darktide
  • Warhammer 40,000: Space Marine 2

Read more:

 
Back
Top