Speculation and Rumors: AMD RDNA4 ...

Architecture disappointing?

I think you got things totally mixed up.

The architecture got much more improvements than expected, but because the leaked numbers were misinterpreted and bad speculation was based on those misintepreted numbers, people had unrealistic expectations.
And then when that speculation did prove to be false, people were disappointed.

To me, the disappointing part is not the architecture. The disappointing part is exactly the other way: that because of all those changes in the architecture, they just did not increase their core counts enough.
The expectation was a completely revamped shader core, not just a tacked on dual issue and some minor RT efficiency tweaks. What exactly are the significant changes?
 
That certainly looks like the future. It would be a little bizarre for chip packaging to advance this far and still have to settle for off package VRAM.
Well, in theory, it's now possible to connect HBM to a GPU more cheaply as there's no need for a silicon interposer just to make that happen.

Fury X brought 512GB/s in 2015 (and wasted that bandwidth) and we got to ~1TB/s in 2020. While Hopper H100) has vastly more bandwidth, consumer GPUs aren't really pushing bandwidth like they were. Also, the way capacity has developed in HBM revisions (2E and 3), useful bandwidths beyond 1TB/s come with far too much capacity for consumer applications...

So, even if it's now cheaper to put HBM on a GPU's package it doesn't look like it's going to happen.
 
Well, in theory, it's now possible to connect HBM to a GPU more cheaply as there's no need for a silicon interposer just to make that happen.
That hurdle was left behind years ago when Intel released Kaby Lake-G, which used EMIB to connect HBM to the VegaM-GPU
 
"The GDDR7 memory technology is not expected to be used in graphics cards any time soon, furthermore, Samsung did not specify the timing and plans for the standard adoption yet...."
 
RDNA4 is not going to be anytime soon either.

If there is no next generation memory with increased speeds next time (~2 years) around it's going to be interesting to see how that gets worked around.
 
The expectation was a completely revamped shader core, not just a tacked on dual issue and some minor RT efficiency tweaks. What exactly are the significant changes?

What do you mean by "completely revamped"? If you mean totally new architecture, then you were just wrong. YOU WERE WRONG. Not AMD.

But dual-issue of FMA is anyway a significant change. It's not a thing that can be just "tackled" in. It does mean huge changes, even though it does not mean totally new architecture.

Also, the RT units got MUCH more improvements than just "minor efficiency tweaks", for example ray sorting and traversal is a BIG change.

So, you seem to be quite clueless about the actual technical changes AMD made with RDNA3 but you just rant because the wild speculation based on some misunderstood rumours ended up not being true and your expectations were way too optimistic.
 
RDNA4 is not going to be anytime soon either.

If there is no next generation memory with increased speeds next time (~2 years) around it's going to be interesting to see how that gets worked around.
At the same time, both AMD and Nvidia also seem to have more tools in the chest for improving effective bandwidth, so aren't quite as wholly reliant on memory chip speeds as before.

But yea, the timing could maybe still work out for GDDR7 in late 2024. Not much to go on, though.
 


Lol, only WCCFtech could get away with saying RGT “has a reliable track record”.
 
I guess that's possible, e.g. by having some sub command processor per compute tile, with one main processor on the top. Not much new data flow besides the commands seems needed across the tiles then?

But initially i had assumed chiplets means the same revolution which happened to CPUs when going multi core, so we would see indeed multiple small GPUs.
Harder to saturate, but not impossible. I would still prefer it if it helps with pricing. I don't think this would be as disruptive as many people think. You still have only one VRam for all GPUs, and driver can help with translation of old games.

Though, it's still a revolution. AMDs market share is too small to establish this, and single device is their only option. Maybe this adds some thing to the respect they deserve for pioneering chiplets.
If you think about mathematics, if a ray hits a pixel on GDC 1 and is reflected to a pixel to GDC 2 this could lead to high bandwidth between GDC.
 
That's something that really found weird they didn't opt for a single big unified L3 as in their ryzen cpu line to avoid latency, specially on RT with several bounces required a big unifed L3 would avoid for better efficiency by avoiding wasting power from all the memory related ops and wasted bandwidth.
They just seem too focused on cost reduction but in the end performance seems to suffer and im not sure that trade off is worth it.
 
That's something that really found weird they didn't opt for a single big unified L3 as in their ryzen cpu line to avoid latency, specially on RT with several bounces required a big unifed L3 would avoid for better efficiency by avoiding wasting power from all the memory related ops and wasted bandwidth.
They just seem too focused on cost reduction but in the end performance seems to suffer and im not sure that trade off is worth it.
I think this an issue more related to L1 and L2 cache. I think there are the most values transfered.
 
That's something that really found weird they didn't opt for a single big unified L3 as in their ryzen cpu line to avoid latency, specially on RT with several bounces required a big unifed L3 would avoid for better efficiency by avoiding wasting power from all the memory related ops and wasted bandwidth.
They just seem too focused on cost reduction but in the end performance seems to suffer and im not sure that trade off is worth it.
The GPU L3 is memory side cache, and is in a different clock domain from the shaders. (either fabric or memory clock)

Whether it is “unified” or not has no significant bearing alone on the latency. If anything, the Ryzen CCX L3 Cache you mentioned is multi-banked (or “sliced” as they called it) similarly to provide higher bandwidth and parallelism. Just that it is architecturally private to the CCX, and clocked & placed so accordingly.

(edit: likely lower complexity and power usage as well, compared to a hypothetical “unified”, astronomically high associativity counterpart)
 
Last edited:
Back
Top