RDNA4

Frenetic Pony · Jan 21, 2024

Known:
This year
GDDR7
Raytracing Traversal (similar performance to Intel/Nvidia)
Powerdraw is fixed, very high clockspeeds

Leaked?(by you know who, plus some dumb AMD engineer on a youtube comment)
Announcement in H1, maybe Computex
Big chip is in this time, likely stacked SRAM and such
Big chip is really expensive, stupid big, tons of power draw, there solely for PR purposes, >$1k
$1k >= chip is more mainstream high end.

Bondrewd · Jan 21, 2024

Frenetic Pony said:
GDDR7

Maybe.

Frenetic Pony said:
Raytracing Traversal (similar performance to Intel/Nvidia)

Yes but not quite.

Frenetic Pony said:
Powerdraw is fixed, very high clockspeeds

Seems so.

Frenetic Pony said:
Announcement in H1, maybe Computex

Yea.

Frenetic Pony said:
Big chip is in this time, likely stacked SRAM and such

Dead.

Frenetic Pony said:
Big chip is really expensive, stupid big, tons of power draw, there solely for PR purposes, >$1k

It's dead, jim.

Frenetic Pony said:
$1k >= chip is more mainstream high end.

Oh, any stacked Si spam part will be far above $1k.
You'll have to wait for Navi50 for that, which also gonna double as a DC GDDR-based PCIe stick.

Frenetic Pony · Jan 21, 2024

Guesses:

12WGP, 64bit bus, monolithic, 8gb ram, 32mb SRAM LLC, 3.2ghz =<, 4060, $199

32WGP, 256bit bus, 16gb ram (28gbps), 64mb LLC, 3.1ghz, 4080 $699
28WGP, 192bit bus, 12gb ram, 48mb LLC, 2.9ghz, 4070ti, $499
24WGP, 128bit bus, 16gb ram, 32mb LLC, 2.7ghz, 4070, $399
64WGP, 384bit bus, 24gb ram, 192mb LLC, 3.1ghz, 50% faster than 4090, $1599
56WGP, 320bit bus, 15gb ram, 80mb LLC, 2.8ghz, 20% faster than 4090, $1k

Bondrewd said:
Maybe.

Yes but not quite.

Seems so.

Yea.

Dead.

It's dead, jim.

Oh, any stacked Si spam part will be far above $1k.
You'll have to wait for Navi50 for that, which also gonna double as a DC GDDR-based PCIe stick.

Big chip is from our favorite resident leaker, and a second source, who is a confirmed AMD engineer that got stupidly mad on a youtube comment. MLID is just some youtuber after clicks. That being mr resident leaker has kinda been circumspect around a "big chip" other than it being in the cards somewhere soonish.

Bondrewd · Jan 21, 2024

Frenetic Pony said:
Big chip is from our favorite resident leaker, and a second source, who is a confirmed AMD engineer that got stupidly mad on a youtube comment.

N4C's do not exist anymore.
It was a chtonic 120+ WGP something on the upper end that was knifed in favour of just fast tracking RDNA5.
4 is like the orphan family now where only 2 out of 7 parts planned survived.

Frenetic Pony · Jan 22, 2024

Bondrewd said:
N4C's do not exist anymore.
It was a chtonic 120+ WGP something on the upper end that was knifed in favour of just fast tracking RDNA5.
4 is like the orphan family now where only 2 out of 7 parts planned survived.

120??? What would that even be, watercooled 600w at 2.2ghz, maybe more?
I mean, some of the big integrators have gotten air cooling past 450w now, I can see Nvidia pulling up 525w (PCIE5 75w direct and a full 450w connector) for Blackwell 5090, just to flex again ($1800 card yaaay)... but who would try anything more, seems out of control.

Bondrewd · Jan 22, 2024

Frenetic Pony said:
120???

Kinda lowballing really.

Frenetic Pony said:
What would that even be, watercooled 600w at 2.2ghz, maybe more?

Faster and yes, 600W.

Frenetic Pony said:
seems out of control.

Well it's not a mainstream solution at all.

pTmdfx · Jan 22, 2024

By the looks of the GFX12 LLVM patches so far:

1. No patch having mentioned hardware traversal (yet?); still only image bvh intersect ~~+ DS traversal stack instructions~~ * from GFX11.

2. The architecture has the concept of multiple AIDs, though whether any models with multiple AIDs has survived the alleged chop is a separate question…

3. Lots of changes to memory access policy control (SLC/DLC/GLC -> 3-bit temporal reuse hint + 2-bit for scope a la shader-engine/device/system)

4. New WMMA instructions and SWMMAC instructions. Not sure what S stands for in the latter… sparsity? super?

Edit: Somehow `ds_bvh_stack_*` (traversal stack in LDS) is left out for GFX12. An indication of some changes, at least.

techuse · Jan 23, 2024

Will RDNA 4 be topping out at the 400$ tier?

Bondrewd · Jan 23, 2024

techuse said:
Will RDNA 4 be topping out at the 400$ tier?

Higher (shouldn't be by much, though, those are mainstream/mobile parts).

mboeller · Jan 24, 2024

Based on the rumors/leaks only 2 GPU remain for the RDNA4 lineup, the N44 (IMO most likely the successor of the N33) and the N48. Both are said to be monolithic designs in the lowend (N44) or midrange (N48) class.

But, what if the leaks/rumors about the monolithic designs are not entirely correct?

N41 for example was a design with 3 x [chiplets with chiplets], with a two level chiplet-design. Based on some patent drawings the inner level with the /SED/, /memory/ and /CP/ chiplets on top of a bigger base-chiplet and then the outer level with 3 of these base-chiplets forming the N41 GPU, or 2 of these chiplets forming the N42 GPU and 1 chiplet used for the N43 GPU.

Link: https://www.forum-3dcenter.org/vbulletin/showpost.php?p=13374402&postcount=573
the same patent drawing was discussed here in the forum, but I was not able to find the link.

Therefore, for me it seems possible that the inner level of [chiplets with chiplets] was abandoned due to capacity problems but not the outer level with 1-3 chiplets.

N48 could be a “monolithic” chiplet/GPU to replace the N43 chiplet-based design but from my point of view this does not rule out that 2-3 N48 could still be used for some sort of highend GPU’s as long as the distributed command processor and architecture, designed for RDNA4 work as intended.

any thoughts?

Bondrewd · Jan 24, 2024

mboeller said:
N48 could be a “monolithic” chiplet/GPU to replace the N43 chiplet-based design but from my point of view this does not rule out that 2-3 N48 could still be used for some sort of highend GPU’s as long as the distributed command processor and architecture, designed for RDNA4 work as intended.

No you can't, N4m's aren't made for that at all.
Very simple products overall.

mboeller · Jan 25, 2024

damn...

Thanks

Frenetic Pony · Jan 25, 2024

Is there even a difference between RDNA3.5 and 4? If there is it seems like a trivial one.

Kaotik · Jan 25, 2024

Frenetic Pony said:
Is there even a difference between RDNA3.5 and 4? If there is it seems like a trivial one.

Older rumors said RDNA 3.5 would miss RT updates

Bondrewd · Jan 26, 2024

Frenetic Pony said:
Is there even a difference between RDNA3.5 and 4

Yeah they're just completely unrelated.

fellix · Jan 29, 2024

RDNA 4 continues AMD’s GPU ISA evolution. Software prefetch and more flexible scalar loads continue a trend of GPUs becoming more CPU-like as they take on more compute applications. AI gets a nod as well with FP8 and sparsity support. Better cache controls are great to see as well, and more closely match the ISA to RDNA’s more complex cache hierarchy.

Finally, remember nothing is final until a RDNA 4 product is released. All the information here is preliminary.

Examining AMD’s RDNA 4 Changes in LLVM

As 2024 continues on, because time never stops, AMD has been working on their upcoming RDNA 4 architecture. Part of this involves supporting open source projects like LLVM. If done right, merging t…

chipsandcheese.com

Frenetic Pony · Jan 29, 2024

What I don't get about the new chiplet patent (cancelled arch) is why go for the hardest thing first, which is work distribution and coordination independence for each chiplet?

Why not just split out shader engines to their own chiplets then have unique command processor/etc. etc. per SKU. You still get most of the benefits of chiplets cost wise, even if you need to plan/build each specific GPU separately. Why go for the hardest possible arch first?

pTmdfx · Jan 30, 2024

Frenetic Pony said:
What I don't get about the new chiplet patent (cancelled arch) is why go for the hardest thing first, which is work distribution and coordination independence for each chiplet?

Why not just split out shader engines to their own chiplets then have unique command processor/etc. etc. per SKU. You still get most of the benefits of chiplets cost wise, even if you need to plan/build each specific GPU separately. Why go for the hardest possible arch first?

You see CP itself as a separable freestanding block. But the graphics pipeline is a monolithic state machine spanning from the central CPs & geometry processor, shader dispatch network, to per shader engine resources like rasterisers, export buses/caches and render backends. It is a blend of control and data paths and plenty of intermediate state, all of which are outside of the R/W cache hierarchy. Many of these do not naturally enjoy a clean chop like how memory channels or screen-space partitions do.

Imagine trying to break a CPU core front-end and several execution clusters off as individual chiplets. Might as well do the hardest thing, aka going multi-core with complete core(s) as chiplets & inevitably calling for complicated things like cache coherency and DVFS management.

Edit: Doing the “hardest thing” first, you get scalable clean chops (chiplets) using solely standard IP interfaces, i.e., blocks can coordinate themselves solely through the system/device memory hierarchy. (e.g., IMG “Multi-Core” GPU) Meanwhile, going for a compromise and one might end up with chiplets having lots of one-off custom interfaces to serve graphics pipeline internals.

trinibwoy · Jan 30, 2024

pTmdfx said:
You see CP itself as a separable freestanding block. But the graphics pipeline is a monolithic state machine spanning from the central CPs & geometry processor, shader dispatch network, to per shader engine resources like rasterisers, export buses/caches and render backends. It is a blend of control and data paths and plenty of intermediate state, all of which are outside of the R/W cache hierarchy. Many of these do not naturally enjoy a clean chop like how memory channels or screen-space partitions do.

Imagine trying to break a CPU core front-end and several execution clusters off as individual chiplets. Might as well do the hardest thing, aka going multi-core with complete core(s) as chiplets & inevitably calling for complicated things like cache coherency and DVFS management.

Edit: Doing the “hardest” thing, you get scalable clean chops (chiplets) using solely standard IP interfaces, i.e., blocks can coordinate themselves solely through the system/device memory hierarchy. (e.g., IMG “Multi-Core” GPU) Meanwhile, going for a compromise and one might end up with chiplets having lots of one-off custom interfaces to serve graphics pipeline internals.

I’m no hardware engineer but it seems the path of least resistance would be to prove out CP work distribution on die first before splitting into multiple dies. If it doesn’t work on-die it’s definitely not going to work on-package.

Kaotik · Jan 30, 2024

AMD already explained why shader engine chiplets is a bad (expensive) idea when they first brought chiplets to GPUs

RDNA4

Frenetic Pony

Bondrewd

Frenetic Pony

Bondrewd

Frenetic Pony

Bondrewd

pTmdfx

techuse

Bondrewd

mboeller

Bondrewd

mboeller

Frenetic Pony

Kaotik

Drunk Member

Bondrewd

fellix

Examining AMD’s RDNA 4 Changes in LLVM

Frenetic Pony

pTmdfx

trinibwoy

Meh

Kaotik

Drunk Member

Similar threads