AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
I don't get the ARM craze, it's like suddenly everyone decided to believe Apple's marketing team (remember their "stellar" GPU presentation) and some toy tests from an ARM fan at Anandtech. Pretty sure if it was so good, the big boys would already be doing it (and it seems Keller's K12.3 didn't quite pan out, so the magic ARM performance and efficiency wasn't really there).

ARM isn't interesting. It's the CPUs (and now GPUs) designed by Apple's internal silicon design team that are causing jaws to hit the floor.

The writing is on the wall, the benchmarks are in, and products are actually in consumers' hands. The performance-per-watt and performance-per-core of Apple's chips are astonishing, even when running x86 apps through an emulation layer.

Well, it's good for what it is, but it's too wide to be scalable to proper HEDT/enterprise level, it seems.

?
Their transition is going according to plan. I believe Apple stated that by the end of 2021, their consumer Macs will all be on Apple Silicon. So far they've held their end of the bargain:
MacBook Air
MacBook Pro 13"
Mac mini
iMac (24" replacing the 21.5")
are all currently on Apple Silicon.

The only consumer class Macs that need the Apple Silicon treatment are:
MacBook Pro 16"
iMac (27")

Mac Pro will of course need some customised silicon -- most likely because having only integrated RAM would be untenable for the market which the Mac Pro addresses. I suspect we may see an Apple Silicon for the Mac Pro which resembles Sapphire Rapids: integrated cache/memory similar to HBM, backed by access to external DRAM. This chip will also need a truckload of PCIE lanes for expansion cards and Thunderbolt.


Kinda meh that ATi again abandons the middle-range market, hopefully the potentially successful RDNA3 won't be followed by R600 2.0...

Few chips could ever be as disappointing as R600. Vega comes close though.
 
I don't get the ARM craze, it's like suddenly everyone decided to believe Apple's marketing team (remember their "stellar" GPU presentation) and some toy tests from an ARM fan at Anandtech. Pretty sure if it was so good, the big boys would already be doing it (and it seems Keller's K12.3 didn't quite pan out, so the magic ARM performance and efficiency wasn't really there). Well, it's good for what it is, but it's too wide to be scalable to proper HEDT/enterprise level, it seems.

M1 and other Apple CPUs aren't "good" because they are ARM ISA. It's simply good and strong CPUs that happen to use ARM ISA, the M1 is proving that it's possible and someone is actually succeeding in leaving behind all the x86 legacy (remembering to be fair that Apple in the fortunate position that allows this).
It's not that ARM is "the way", more that's an viable alternative. There are other makes betting on ARM CPUs and soon x86 may lose it's supremacy in the marketing. How long until Intel and AMD start making ARM CPUs to not lose market share? And when that happens how long will it take to x86 become niche?
 
Why yes aarch64 toddlers are insufferable why do you ask?

more that's an viable alternative.
Wut.
Even the best reference ARM core can be at best called a bad Zen3.
There are other makes betting on ARM CPUs
wut
soon x86 may lose it's supremacy in the marketing.
how?
How long until Intel and AMD start making ARM CPUs to not lose market share?
You're a fuckton years late since Seattle 'shipped' like 5 years ago.
And when that happens how long will it take to x86 become niche?
major hopium moment
 
Probably both but more of the latter? The 6700 XT clocks ~12% higher than the 6900 XT on average. The VRAM bandwidth-per-WGP and LLC-amount-per-WGP (and probably the LLC bandwidth too) are all 50% higher on Navi 22 vs. Navi 21.
OTOH, it doesn't look like Navi 22 is losing all that much from halving the number of Shader Engines, which might be an indicator why Navi 3x is reducing the SEs in general (or increasing the WGPs per SE).

Yeah that's expected though given the rasterizer and triangle setup isn't the bottleneck the vast majority of the time. Navi 22 is still doing 2 tris per clock at a very high clock speed which is plenty geometry crunching power. It's a mystery why Nvidia continues to scale up their tessellation and triangle throughput. GA102 has 42 primitive setup units and 7 rasterizers while Navi 21 does just fine with only 4 of each.
 
I was contrasting a workgroup of 64 pixels (work items) which are allocated to a single SIMD with a non-pixel-shader workgroup of 64 work items that will be allocated across both SIMDs. I was pointing out that the pixel shader allocation to SIMDs was unusual. And that LDS appears to these two SIMDs in a CU the same as in GCN (though shared by four SIMDs). A half of LDS (one array out of the two) is dedicated to a CU in CU mode.

Ok, I got you. Why limit the conversation to just pixel shaders though? AMD's description of 64-item "wavefronts" appears to apply to compute workloads as well.

But note a workgroup of 128 work items bound to an RDNA CU necessarily results in 2 work items sharing a SIMD lane. With increasing multiples for larger workgroups up to those of size 1024. So the pixel shader configured as wave64 appears to be a special case of workgroup. As a special case it's designed explicitly to gain from VGPR, LDS and TMU locality for pixels whose quad work item layout and use of quad-based derivatives is a special case of locality.

Locality doesn't seem to be a major motivating factor to keep it on the same SIMD as you get much of the same benefits as long as you're on the same CU. AMDs whitepaper only has this to say on the matter: "While the RDNA architecture is optimized for wave32, the existing wave64 mode can be more effective for some applications.". They don't mention which applications benefit from wave64.

This then leads me to believe that "wave64" for pixel shading is a hack that ties together two hardware threads on a SIMD. Not only is there a clue in the gotcha that I described earlier, but one mode of operation (alternating halves per instruction) is just like two hardware threads that issue on alternating cycles - which is a feature of RDNA. You can see the problem AMD has introduced with RDNA: two different wavefront sizes. Are they both hardware threads? Or is wave64 an emulated hardware thread, simulated by running two hardware threads?

Is it important that it's one hardware thread vs two? I guess my only point is that from a software perspective it doesn't matter.

All this came up because I'm trying to work out how RDNA 3 with no CUs inside a WGP configures TMUs, RAs and LDS. Bearing in mind that a "WGP" seems as if it would actually be a "compute unit", a compute unit generally has a single TMU and a single LDS. So my question was whether a CU with 8 SIMDs can have adequate performance with a single TMU and a single LDS.

Yep, that's the million dollar question. Does a big fat WGP scale up everything or just the SIMDs and cache.
 
One major caveat of this 2.5X~3X raster scaling is that 1440p and lower is going to irrelevant, our current CPUs are not strong enough to produce, a 3090 is already CPU limited at 1440p, and 4K is going to be CPU limited in a significant way as well.
 
One major caveat of this 2.5X~3X raster scaling is that 1440p and lower is going to irrelevant, our current CPUs are not strong enough to produce, a 3090 is already CPU limited at 1440p, and 4K is going to be CPU limited in a significant way as well.

On last gen games sure. Current gen console games targeting upscaled 4K at 30fps should take advantage of the extra horsepower on PCs shooting for native 4K at 120fps+.
 
Yeah that's expected though given the rasterizer and triangle setup isn't the bottleneck the vast majority of the time. Navi 22 is still doing 2 tris per clock at a very high clock speed which is plenty geometry crunching power. It's a mystery why Nvidia continues to scale up their tessellation and triangle throughput. GA102 has 42 primitive setup units and 7 rasterizers while Navi 21 does just fine with only 4 of each.
Perhaps the die space/power requirements are small enough that they don’t deem it worthwhile to change the SM structure yet.
 
Anybody can answer me the question about the imbalance of rasterizer and Scan Converter?

My second question is that also the Frontend is a black box for me. In driver you find always the hint that you have 4 Rasterizer but 8 Scan Converter. So Scan Converter is the main Part which transforms Polygons into pixels. So when 1 Polygon comes from Rasterizer but you have 2 Scan Converter, 1 Scan Converter is running empty?[
 
Anybody can answer me the question about the imbalance of rasterizer and Scan Converter?

My second question is that also the Frontend is a black box for me. In driver you find always the hint that you have 4 Rasterizer but 8 Scan Converter. So Scan Converter is the main Part which transforms Polygons into pixels. So when 1 Polygon comes from Rasterizer but you have 2 Scan Converter, 1 Scan Converter is running empty?[

It would run empty on polygon covering less pixels than one converter can generate.
 
One major caveat of this 2.5X~3X raster scaling is that 1440p and lower is going to irrelevant, our current CPUs are not strong enough to produce, a 3090 is already CPU limited at 1440p, and 4K is going to be CPU limited in a significant way as well.
Just in time for Zen4 V-cache CPUs!
 
I think it is possible to check it now with N21 xtxh SKUs, which apparently have memclock limit of 2450 mhz instead of 2150 mhz (although it seems that either memory chips themselves or IMC can't do much more than 2170 mhz or so)
All the air-cooled 6900XTs have 2000MHz memory. Some (all?) of the liquid cooled cards have 2250MHz, according to this list:

AMD Radeon RX 6900 XT Specs | TechPowerUp GPU Database

I admit my card

XFX USA (xfxforce.com)

isn't on that list, so TechPowerUp's list is not comprehensive. I think you've misunderstood the memory clocks.

I can't find an AIB specification page for a liquid cooled card that shows memory faster than 2000MHz (ASUS and Sapphire just say 16Gbps).

The default assumption is that an RDNA 3 WGP is at least as fast as 2x RDNA 2 WGPs in flops dependent workloads. How games will scale though is a different matter. Ampere doubled flops and L1 bandwidth per SM but that didn’t result in 2x gaming performance. The 46SM 3070 is only 30% faster than the 46SM 2080.

RDNA 2 scaled very well with clock speed vs RDNA 1. Comparing the similar 40CU configs of the 6700xt and 5700xt there was a 35% improvement on paper due to higher clocks and actual results in games were pretty close to that number. This is a great result especially considering the lower off-chip bandwidth on the 6700xt. Scaling up RDNA 3 didn't quite hit the same mark. Comparing the 40CU 6700xt and 80CU 6900xt there was a 75% improvement on paper but only 50% in actual gaming numbers. This leads me to believe the 6700xt is benefiting from higher clocks on its fixed function hardware or the 6900xt is hitting a bandwidth wall. As mentioned earlier in the thread it's going to be interesting to see how AMD feeds such a beast.
One way to read it: 6900XT has 33% more bandwidth and 33% more power than 6700XT and is ~50% faster...

Navi 23's crazy ROP configuration might up-end some of these scaling comparisons. Maybe some reviewer will kindly lock GPU clocks on Navi 23, 22 and 21 and we can compare.

Well, RDNA 2 is looking like an "intermediate" architecture (bit like Cayman TeraScale (microarchitecture) - Wikipedia) so maybe it just isn't worth the effort to project from RDNA 2 to 3.

I don't expect anyone to stand still.
I'm fully expecting for Nvidia to use their richer developer influence to push for #2 as hard as they can because that's where they have an architectural advantage, and for AMD to focus on "console-multipliers" in the expectation that #1 is widely adopted.
#1 is GPU-specific, so it's not coming to PC.

One major caveat of this 2.5X~3X raster scaling is that 1440p and lower is going to irrelevant, our current CPUs are not strong enough to produce, a 3090 is already CPU limited at 1440p, and 4K is going to be CPU limited in a significant way as well.
Once upon a time 1070/Vega 56 was "enough" for 1440p gaming.
 
Raytracing is easily scalable, PCs won't have any issues in applying all the performance they have. And once AMD will get their RT up to speed they'll begin to promote it's heavier usage in PC versions of multiplatform titles too. So I wouldn't worry about the overabundance of GPU power really.
 
All the air-cooled 6900XTs have 2000MHz memory. Some (all?) of the liquid cooled cards have 2250MHz, according to this list:
I meant the OC limit in the Wattman, not the stock memory clocks. Of course, it's possble to check how how the GPU scales in supposedly memory-bound scenarios by simply decreasing the core clocks, but it might be confounded by things like some parts of the chip not really scaling with clocks and so on. So that's why I thought memory OC is the best way to check it, although IC and some form of in-built error correction (as far as I understood) makes this hard to gauge.
 
#1 is GPU-specific, so it's not coming to PC.
Just thought about this right before.
AMD could patch custom traversal shaders into specific per game driver updates.
Not what i want, not sure how much sense it makes, but there are options in theory.
 
One major caveat of this 2.5X~3X raster scaling is that 1440p and lower is going to irrelevant, our current CPUs are not strong enough to produce, a 3090 is already CPU limited at 1440p, and 4K is going to be CPU limited in a significant way as well.
You mean higher res gfx needs more CPU power up to the point the CPU throttles?
Which gfx CPU workloads depend on resolution? Animation stays the same, culling (if still on CPU) should have no big effect either?
 
Just thought about this right before.
AMD could patch custom traversal shaders into specific per game driver updates.
Not what i want, not sure how much sense it makes, but there are options in theory.

Have they the ressources to pull that out ? I imagine it takes time, tests, etc...
 
Status
Not open for further replies.
Back
Top