AMD: RDNA 3 Speculation, Rumours and Discussion

Malo · Sep 1, 2022

Nvidia will be releasing mid tier GPUs on Ada with only 8Gb.

Seanspeed · Sep 1, 2022

Bondrewd said:
You can always check their quarterly results.

Yes that's always the argument of getting a midrange GPU over an actually futureproof higher-end one.

We all literally just went over explaining how this isn't the case with a direct example from AMD themselves.

Bondrewd · Sep 1, 2022

Malo said:
Nvidia will be releasing mid tier GPUs on Ada with only 8Gb.

Everyone will be until 24Gbit ICs come out.

bearmoo · Sep 2, 2022

Kaotik said:
Well, Ponte Vecchio gets away gluing 63 chips together (of which 47 are functional and 16 to spread thermal load better), so no too many

Well, not sure Ponte Vecchio has actualloy gotten away. The jury is still out on this one.

gamervivek · Sep 2, 2022

Sapphire put out an 8GB 6500XT and it'd be quite interesting if it's without having chips on both sides of PCB inflating the cost.

PULSE AMD Radeon™ RX 6500 XT 8GB

<p>Up to 2855 MHz, 8GB/64 bit DDR6. 18 Gbps Effective</p>

www.sapphiretech.com

Jawed · Sep 3, 2022

Kaotik said:
Unless it was a slip up, Su said in the Ryzen 7000 launch show that RDNA3 uses "5 nanometer chiplets", which contradicts most of the recent rumors (1x5nm GCD + 6x 6nm MCD)
edit: in theory it could just mean there's more than 1 RDNA3 N5 chiplet, just not in same GPU

Cache and PHYs are a really bad use of 5nm chiplets, that's for sure. I can't help thinking the "leakers" have gone soft, realising they know nothing, hence the deafening silence.

Frenetic Pony · Sep 4, 2022

Jawed said:
Cache and PHYs are a really bad use of 5nm chiplets, that's for sure. I can't help thinking the "leakers" have gone soft, realising they know nothing, hence the deafening silence.

Those patch notes mentioning 6 compute chiplets and 2 memory ones could prove right.

32 simd 32 per compute chiplet. 192bit bus per memory. It sounds right.

Jawed · Sep 4, 2022

Frenetic Pony said:
Those patch notes mentioning 6 compute chiplets and 2 memory ones could prove right.

32 simd 32 per compute chiplet. 192bit bus per memory. It sounds right.

I can't remember those patch notes, to be honest.

It would need to be 64x SIMD-32 per compute chiplet to get to 12288 ALU lanes, with 8x WGPs per chiplet and 8x SIMDs per WGP.

With the rumour of an RDNA 3 WGP being the same size in 7nm as an RDNA 2 WGP, despite having twice the ALU lanes and with a 60% size on the 5nm node, that would be around 20mm² for the WGPs. Add on to that the fine-rasteriser, ROPs and L2 to make up a complete shader engine and I suppose we get to around 30mm² for the compute chiplet...

Notice, a shader engine in Navi 22 is about 54mm², on 5nm that's about 32mm².

Centralised control, coarse rasterisation, PCI-Express and other non-GDDR PHYs need to go somewhere, preferably a central 6nm chiplet in my opinion. That would be 120mm² I suppose.

Two memory/cache chiplets would each be around 100mm² with 192-bit GDDR6 and 96MB of L3 cache.

The six compute chiplets and two memory/cache chiplets would then be arranged around the periphery of the centralised chiplet...

I can't help thinking that ~30mm² chiplets are too small, that's 5x6mm roughly. Fiddly lickle things whose cost to sort, package and integrate into a GPU assembly seems to be much worse than with Ryzen or Epyc whose smallest chiplets have been about 74mm² at their smallest.

Kaotik · Sep 6, 2022

[PATCH] drm/amdkfd: Match GC 11.0.1 cache info to yellow carp

L0ベクタキャッシュと GL1データキャッシュが増やされる RDNA 3/GFX11 APU | Coelacanth's Dream

www.coelacanth-dream.com

via

AMD RDNA 3 "Navi 3X" GPUs Feature Double The Cache Per Compute Unit & Shader Array

AMD has listed down the latest information regarding the Cache sizes of their upcoming RDNA 3 "Navi 3X" GPUs within Linux patches.

wccftech.com

Bigger L0 Registers and L1 caches at least

Seanspeed · Sep 7, 2022

Kaotik said:
Unless it was a slip up, Su said in the Ryzen 7000 launch show that RDNA3 uses "5 nanometer chiplets", which contradicts most of the recent rumors (1x5nm GCD + 6x 6nm MCD)
edit: in theory it could just mean there's more than 1 RDNA3 N5 chiplet, just not in same GPU

I dont think it's a slipup. I think she just means they consider the GCD to be a chiplet part as well. Basically, all the dies are 'chiplets'.

Jawed · Sep 7, 2022

Pressure said:
OREO (Opaque Random Export Order) sounds interesting, essentially replacing the re-order buffer (ROB) with a smaller skid buffer allowing things to be received and executed in any order before being exported to the next stage in-order.

So I think OREO is required to support distributed vertex shading combined with coarse rasterisation.

My theory:

Vertices are distributed by a central scheduler, in groups of hardware threads, to any WGP that's available. Using a cut-down vertex shader, which only exports position, the resulting triangles are then coarse-rasterised. Only after this has been done and the screen-space tiles covered by a triangle have been identified, is the full vertex shader evaluated for each triangle's vertices (to generate all relevant attributes).

To perform the full evaluation of the vertex shader, each triangle is sent to the shader engine that owns the screen space tile touched by the triangle. So the shader engine has to construct hardware threads for the vertices received and assign them to WGPs.

If a triangle touches more than one screen space tile then each shader engine will separately evaluate the full vertex shader, for the triangle's vertices.

Once each shader engine has evaluated the full vertex shader, the triangles can be finally assembled and fine-grain rasterised.

As a result of the varying workloads of shader engines, fully-assembled triangles will be pixel shaded in an ordering that no longer corresponds with developer intent. This is because adjacent or overlapping triangles will have originally been position-only shaded by any shader engine and only arrived at the final shader engine for pixel shading after a journey that takes an indeterminate amount of time, versus other relevant triangles.

I believe this is the problem OREO solves, it allows the GPU to pixel shade triangles in an arbitrary order but the result in the render target (and depth buffer) is in agreement with developer intent.

All of this rests upon "next gen geometry" ("primitive shaders") which is something that has been confirmed for RDNA 3: the DirectX/OpenGL vertex processing pipeline is no longer executed in the set of shaders separated by fixed-function hardware that we've known for decades.

Naturally, this makes tessellation and geometry shading more complex, as both of these techniques generate vertices as output from shaders. AMD has solved that problem.

In theory, distributed final vertex shading takes us back to the old problem of multi-GPU rendering (alternate line, split frame. or screen-space tiled rendering): the vertex shader has to be ran by multiple shader engines for some vertices, so there is an overhead to distributed final vertex shading when triangles span screen space tiles.

Once you've got a combination of:

next gen geometry shading
vertex-position-only shading
coarse grained rasterisation
multiple shader engines each aligned to an exclusive set of screen space tiles
final vertex shading
fine-grained rasterisation
opaque random export order

You then have, in my opinion, all the ingredients required to support a GPU that consists of multiple compute chiplets, each functioning as a shader engine, each aligned with a set of screen space tiles.

PSman1700 · Sep 7, 2022

Jawed said:
("primitive shaders") which is something that has been confirmed for RDNA 3

Has been around since rdna1.

Newguy · Sep 7, 2022

Kaotik said:
Bigger L0 Registers and L1 caches at least

I was thinking larger L2 because of the supposed smaller RDNA3 WGP vs RDNA2, L2 is outside WGPs from memory (1MB per SE?) and the smaller/same size L3. Having larger L0/L1 and still reducing WGP size would be impressive

Kaotik · Sep 7, 2022

PSman1700 said:
Has been around since rdna1.

Actually already in Vega

PSman1700 · Sep 7, 2022

Kaotik said:
Actually already in Vega

Yes so it should certainly be in rdna3.

pTmdfx · Sep 7, 2022

Kaotik said:
Actually already in Vega

I think what @Jawed meant in that context is that NGG (primitive shader) becomes the only primitive processing path in RDNA 3. GFX11 driver patches indicate that all the legacy vertex/geometry/etc paths would be removed in RDNA 3, in favour of NGG.

TESKATLIPOKA · Sep 9, 2022

Not sure how accurate his info is.

The question is which Geforce will It compete against.

DegustatoR · Sep 9, 2022

Design is great but it can't compete?

Krteq · Sep 9, 2022

RV770 vibes again?

TESKATLIPOKA · Sep 9, 2022

DegustatoR said:
Design is great but it can't compete?

He didn't specify against which SKU. Probably meant full Ada102, and that is not so surprising.
N31 has only 20% more WGPs than N21. It has 2x more shaders per WGP, true, but who knows how much faster RDNA3 WGP vs RDNA2 WGP will be, certainly not twice as fast.
On the other hand 4090Ti has supposedly 142 SMs, that's 69% more than 3090 TI, clockspeed increase from Ampere⇒Ada should be also higher than RDNA2⇒RDNA3, then separating INT32 from FP32 units should also increase performance.

AMD: RDNA 3 Speculation, Rumours and Discussion

Malo

Yak Mechanicum

Seanspeed

Bondrewd

bearmoo

gamervivek

PULSE AMD Radeon™ RX 6500 XT 8GB

Jawed

Frenetic Pony

Jawed

Kaotik

Drunk Member

L0ベクタキャッシュと GL1データキャッシュが増やされる RDNA 3/GFX11 APU | Coelacanth's Dream

AMD RDNA 3 "Navi 3X" GPUs Feature Double The Cache Per Compute Unit & Shader Array

Seanspeed

Jawed

PSman1700

Newguy

Kaotik

Drunk Member

PSman1700

pTmdfx

TESKATLIPOKA

DegustatoR

Krteq

TESKATLIPOKA

Similar threads