AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
You can always check their quarterly results.

Yes that's always the argument of getting a midrange GPU over an actually futureproof higher-end one.
We all literally just went over explaining how this isn't the case with a direct example from AMD themselves.
 
Unless it was a slip up, Su said in the Ryzen 7000 launch show that RDNA3 uses "5 nanometer chiplets", which contradicts most of the recent rumors (1x5nm GCD + 6x 6nm MCD)
edit: in theory it could just mean there's more than 1 RDNA3 N5 chiplet, just not in same GPU
Cache and PHYs are a really bad use of 5nm chiplets, that's for sure. I can't help thinking the "leakers" have gone soft, realising they know nothing, hence the deafening silence.
 
Cache and PHYs are a really bad use of 5nm chiplets, that's for sure. I can't help thinking the "leakers" have gone soft, realising they know nothing, hence the deafening silence.
Those patch notes mentioning 6 compute chiplets and 2 memory ones could prove right.

32 simd 32 per compute chiplet. 192bit bus per memory. It sounds right.
 
Those patch notes mentioning 6 compute chiplets and 2 memory ones could prove right.

32 simd 32 per compute chiplet. 192bit bus per memory. It sounds right.
I can't remember those patch notes, to be honest.

It would need to be 64x SIMD-32 per compute chiplet to get to 12288 ALU lanes, with 8x WGPs per chiplet and 8x SIMDs per WGP.

With the rumour of an RDNA 3 WGP being the same size in 7nm as an RDNA 2 WGP, despite having twice the ALU lanes and with a 60% size on the 5nm node, that would be around 20mm² for the WGPs. Add on to that the fine-rasteriser, ROPs and L2 to make up a complete shader engine and I suppose we get to around 30mm² for the compute chiplet...

Notice, a shader engine in Navi 22 is about 54mm², on 5nm that's about 32mm².

Centralised control, coarse rasterisation, PCI-Express and other non-GDDR PHYs need to go somewhere, preferably a central 6nm chiplet in my opinion. That would be 120mm² I suppose.

Two memory/cache chiplets would each be around 100mm² with 192-bit GDDR6 and 96MB of L3 cache.

The six compute chiplets and two memory/cache chiplets would then be arranged around the periphery of the centralised chiplet...

I can't help thinking that ~30mm² chiplets are too small, that's 5x6mm roughly. Fiddly lickle things whose cost to sort, package and integrate into a GPU assembly seems to be much worse than with Ryzen or Epyc whose smallest chiplets have been about 74mm² at their smallest.
 
Unless it was a slip up, Su said in the Ryzen 7000 launch show that RDNA3 uses "5 nanometer chiplets", which contradicts most of the recent rumors (1x5nm GCD + 6x 6nm MCD)
edit: in theory it could just mean there's more than 1 RDNA3 N5 chiplet, just not in same GPU
I dont think it's a slipup. I think she just means they consider the GCD to be a chiplet part as well. Basically, all the dies are 'chiplets'.
 
OREO (Opaque Random Export Order) sounds interesting, essentially replacing the re-order buffer (ROB) with a smaller skid buffer allowing things to be received and executed in any order before being exported to the next stage in-order.
So I think OREO is required to support distributed vertex shading combined with coarse rasterisation.

My theory:

Vertices are distributed by a central scheduler, in groups of hardware threads, to any WGP that's available. Using a cut-down vertex shader, which only exports position, the resulting triangles are then coarse-rasterised. Only after this has been done and the screen-space tiles covered by a triangle have been identified, is the full vertex shader evaluated for each triangle's vertices (to generate all relevant attributes).

To perform the full evaluation of the vertex shader, each triangle is sent to the shader engine that owns the screen space tile touched by the triangle. So the shader engine has to construct hardware threads for the vertices received and assign them to WGPs.

If a triangle touches more than one screen space tile then each shader engine will separately evaluate the full vertex shader, for the triangle's vertices.

Once each shader engine has evaluated the full vertex shader, the triangles can be finally assembled and fine-grain rasterised.

As a result of the varying workloads of shader engines, fully-assembled triangles will be pixel shaded in an ordering that no longer corresponds with developer intent. This is because adjacent or overlapping triangles will have originally been position-only shaded by any shader engine and only arrived at the final shader engine for pixel shading after a journey that takes an indeterminate amount of time, versus other relevant triangles.

I believe this is the problem OREO solves, it allows the GPU to pixel shade triangles in an arbitrary order but the result in the render target (and depth buffer) is in agreement with developer intent.

All of this rests upon "next gen geometry" ("primitive shaders") which is something that has been confirmed for RDNA 3: the DirectX/OpenGL vertex processing pipeline is no longer executed in the set of shaders separated by fixed-function hardware that we've known for decades.

Naturally, this makes tessellation and geometry shading more complex, as both of these techniques generate vertices as output from shaders. AMD has solved that problem.

In theory, distributed final vertex shading takes us back to the old problem of multi-GPU rendering (alternate line, split frame. or screen-space tiled rendering): the vertex shader has to be ran by multiple shader engines for some vertices, so there is an overhead to distributed final vertex shading when triangles span screen space tiles.

Once you've got a combination of:
  • next gen geometry shading
  • vertex-position-only shading
  • coarse grained rasterisation
  • multiple shader engines each aligned to an exclusive set of screen space tiles
  • final vertex shading
  • fine-grained rasterisation
  • opaque random export order
You then have, in my opinion, all the ingredients required to support a GPU that consists of multiple compute chiplets, each functioning as a shader engine, each aligned with a set of screen space tiles.
 
Bigger L0 Registers and L1 caches at least
I was thinking larger L2 because of the supposed smaller RDNA3 WGP vs RDNA2, L2 is outside WGPs from memory (1MB per SE?) and the smaller/same size L3. Having larger L0/L1 and still reducing WGP size would be impressive
 
Design is great but it can't compete?
He didn't specify against which SKU. Probably meant full Ada102, and that is not so surprising.
N31 has only 20% more WGPs than N21. It has 2x more shaders per WGP, true, but who knows how much faster RDNA3 WGP vs RDNA2 WGP will be, certainly not twice as fast.
On the other hand 4090Ti has supposedly 142 SMs, that's 69% more than 3090 TI, clockspeed increase from Ampere⇒Ada should be also higher than RDNA2⇒RDNA3, then separating INT32 from FP32 units should also increase performance.
 
Last edited:
Status
Not open for further replies.
Back
Top