AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
Ampere hasn't muddled anything though. It's people who don't understand how it was different from Turing who muddled something.
There was a recent(?) example of muddling with units similarly, getting sued and losing the suit. Many had argued for years that it was just people not understanding the brilliance of Bulldozer sharing resources between cores... Though I cannot and will not attest to how much this is applicable to the situations with these SIMT lanes. :mrgreen:

Either way, I have made an edit to appreciate your dedication.
 
Ampere hasn't muddled anything though. It's people who don't understand how it was different from Turing who muddled something.

Yes people seem very confused by Turing and Ampere for some reason. For example:

The GA104 chip offers 6,144 FP32 ALUs of which half can also execute INT32 instructions (i.e. 3,072 INT32 ALUs). With Turing all shaders could still execute FP32 or INT32 instructions.


This is going to muddle the units for marketing a bit, pretty much like Ampere pretty much like how some people perceive Ampere's double FP32 support. On paper each CU does gain 2x the "Stream Processors", now having 128 SPs. Now depending on how AMD GPU marketing wants to draw the rectangles, it can either be "two dual-issue SIMD32s" or "four SIMD32s" inside each CU...

It’ll be the same number of peak FP32s either way so shouldn’t be as confusing as Ampere.
 
"[...] instruction scoreboard for the ALU pipeline that is implemented in software to reduce power consumption and area consumed by hardware in the GPU":


I'm guessing this is using the delay instructions that I listed from the code earlier, which uses a format to specify two instruction IDs (InstId0 and InstId1) and skip codes:

Code:
    static const std::array<const char *, 12> InstIds = {
      "NO_DEP",        "VALU_DEP_1",    "VALU_DEP_2",
      "VALU_DEP_3",    "VALU_DEP_4",    "TRANS32_DEP_1",
      "TRANS32_DEP_2", "TRANS32_DEP_3", "FMA_ACCUM_CYCLE_1",
      "SALU_CYCLE_1",  "SALU_CYCLE_2",  "SALU_CYCLE_3"};
[...]
    static const std::array<const char *, 6> InstSkips = {
      "SAME", "NEXT", "SKIP_1", "SKIP_2", "SKIP_3", "SKIP_4"};

specifically those instruction IDs with "DEP" in their name.

The possible syntax of these instructions gets hairy because they want to "compress" the delay instructions (because instruction cache space is valuable) so you get figure 3 with "Control Word 308":

  • DataDependency instID0=3, Skip=2, instID1=1
    in which instID0=3 declares that the instruction word that immediately follows the control word 308 (instruction-4 310) is dependent on the instruction that issued 3 cycles before instruction-4 310 (instruction-1 302), Skip=2 means that the next delay is not for the next instruction, but for the instruction after the next instruction (instruction-6 314), and instID1=1 declares that instruction-6 314 is dependent on the instruction that issued 1 cycle before instruction-6 314 (instruction-5 312). In response to receiving the control word 308, the ALU control module 216 adds delays before issuing each of instruction-4 310 and instruction-6 314.

which might read as:

Code:
delay VALU_DEP_3, SKIP_2, VALU_DEP_1

The text in the picture actually makes it easier to understand: "Delay instruction-4 until instruction-1 completes and delay instruction-6 until instruction-5 completes". So the "compression" works by using three operands, implicitly defining a delay (from 1 to 4) for the following instruction coupled with a delay for a later instruction, apparently no more than four instructions later (SKIP_1 ... SKIP_4)

"SAME", I'm going to guess, defines a single instruction as dependent upon two prior instructions, e.g. instruction 4 is dependent upon instruction A and instruction B. I think this would specify that instruction A is VALU and instruction B is TRANS (or vice-versa). It wouldn't make sense to specify that two instructions from the same pipeline are antecedents for an instruction simply because the latter of those two instructions is sufficient to enforce the required delay.

Assuming that this delay instruction is always 3-operand, then the "NO_DEP" instruction ID specifies that the second portion (the skip and the instruction ID) is "null", i.e. the second portion of the instruction can be ignored. So maybe that's how "NEXT" is used, it's just an explicit code word in the compression scheme which always has three code words, e.g.:

Code:
delay VALU_DEP_4, NEXT, NO_DEP

Still trying to understand why CYCLE is being used for some instruction IDs. Could it be due to the TRANS SIMD being multi-cycle, such that it can consume either SALU or FMA results with an issue-offset? I have a suspicion that RDNA (1 and 2) can only issue to TRANS on certain clock cycles, e.g. only on even-numbered clock cycles. So perhaps the issue limitations for RDNA 3 TRANS are more nuanced and offsets can be applied depending upon the dependency that applies to a TRANS instruction.

FMA_ACCUM_CYCLE_1 might be used solely when 64-bit FMA is being done? I assume that a dedicated 64-bit ALU (SIMD?) is part of each compute unit. But can TRANS operate on float64?

Bank conflicts for reads from LDS will still cause the hardware to encounter pipeline stalls that cannot be predicted, so the compiler can't handle situations where LDS addresses are not known in advance. So the GPU still needs hardware to monitor variable latencies and stall the relevant dependent pipeline.
 
FMA_ACCUM_CYCLE_1 might be used solely when 64-bit FMA is being done?
Might be a bit far fetched, but it reminds me of Bulldozer which supports -1 clk in latency with special bypass mode for consecutive FMA instructions (i.e., takes 5 clk instead of 6).
 
Might be a bit far fetched, but it reminds me of Bulldozer which supports -1 clk in latency with special bypass mode for consecutive FMA instructions (i.e., takes 5 clk instead of 6).
Where FMA instruction 1's result is consumed by FMA 2 as an operand, presumably as the addend?

I suppose you could chain like this:

Code:
fma r0, r1, r2, r3
delay FMA_ACCUM_CYCLE_1, SKIP_1, FMA_ACCUM_CYCLE_1
fma r0, r4, r5, r0
fma r0, r6, r7, r0
[etc]
 
Yeah. I would guess that “_dep_N” declares a dependency on the Nth instruction issued before the control word, while “_cycle_N” declares a dependency on a particular pipeline stage of the previous instruction?
 
Last edited:
Yeah. I would guess that “_dep_N” declares a dependency on the Nth instruction issued before the control word,
Figure 3 in the patent document makes this explicit, in my opinion.

while “_cycle_N” declares a dependency on a particular pipeline stage of the previous instruction?
SALU instructions are always a single cycle, aren't they?
 
Figure 3 in the patent document makes this explicit, in my opinion.


SALU instructions are always a single cycle, aren't they?
The scalar ALU pipeline itself is probably single cycle execution with 1 extra stage for writeback (or a dest cache like VRF), but the results might not be ready in some edge cases with special scalar registers. MI200/GCN ISA manual gives a rough idea of what some of these cases are (in the Manually Inserted NOP section), which are all currently managed by hardware scoreboarding on RDNA 1/2.
 
The scalar ALU pipeline itself is probably single cycle execution with 1 extra stage for writeback (or a dest cache like VRF), but the results might not be ready in some edge cases with special scalar registers. MI200/GCN ISA manual gives a rough idea of what some of these cases are (in the Manually Inserted NOP section), which are all currently managed by hardware scoreboarding on RDNA 1/2.
So it seems to me that a VALU instruction of four clocks can be delayed from 1 to 3 clocks by specifying SALU_CYCLE_x and correspondingly a FMA_ACCUM_CYCLE_1 requires a 1 clock delay. I suppose the use of the CYCLE keyword implies that instruction issue occurs on the basis that an in-flight operand will be delivered in time for the referenced cycle, so the hardware knows that an operand is in flight, as opposed to DEP which specifies that an instruction has completed and so nothing is in flight.

I suppose we can dig further into the reasons why a NOP is issued by the compiler (for RDNA or CDNA) and assume that RDNA 3's compiler will never issue a NOP and will always use DELAY instead.

There seems to be a new concept of LDS Direct in RDNA 3:


Detect LDS direct WAR/WAW hazards and compute values for wait_vdst (va_vdst) parameter. Where appropriate this raises wait_vdst from the default 0 to allow concurrent issue of LDS direct with VALU execution.

Also detect LDS direct versus VMEM source VGPR hazards and insert vm_vsrc=0 waits using s_waitcnt_depctr.
So I'm wondering whether all previous cases of variable latency for LDS operations have been eliminated in RDNA 3.

I think with this, RDNA would truly have no hardware dedicated to monitoring the validity of operands, per lane, in order for an instruction to be issued. It would seem this is the basis of a substantial saving in power consumption by each CU.

"strict_wqm" seems to mean "strict whole quad mode":


as opposed to "(strict) whole wavefront mode" for *_wwm, but that's to do with predication of lanes.
 
There seems to be a new concept of LDS Direct in RDNA 3:

So I'm wondering whether all previous cases of variable latency for LDS operations have been eliminated in RDNA 3.
LDS Direct Read is not new to RDNA3. It allows one dword (32-bit value) at the LDS offset held by the M0 register to be always available as a uniform/scalar operand to the VALU.
 
AMD Navi 3X GPU with even more cores than Navi 31

While Navi 31 was initially rumored to feature two GCD (Graphics Compute Dies) and up to 16384 Stream Processors, those specs have now been updated to single GCD and 12288 cores. However, according to Greymon55 the 16384SP configuration may not be completely off the table just yet.

According to RedGamingTech, AMD could be developing a Radeon Pro (workstation) graphics card with dual GCD configuration, which might explain where this fourth Navi 3X GPU could find its use. The dual GCD configurations are now expected to debut with Navi 4X (RDNA4) architecture, rather than Navi 3X, the YouTuber points out.
 

Looks like VOPD supports only co-issuing the core f32 instructions (add/mul/fma/etc) plus a few utility instructions (see: VOPDXPseudos and VOPDYPseudos), or in other words 16x13 combinations.
No co-issuing for integer things and packed maths...

The interesting bit is that patents were indicating potentially supporting single-cycle wave64 execution as well. Does this mean only this same subset of 13 instructions will be executed single-cycle in wave64 mode? :unsure:

Feels like the design intent is more about making better use of idle slots/resources (well… while competing with vector memory IO) in existing data paths through some small bets.
 
Last edited:
AMD Navi 3X GPU with even more cores than Navi 31

While Navi 31 was initially rumored to feature two GCD (Graphics Compute Dies) and up to 16384 Stream Processors, those specs have now been updated to single GCD and 12288 cores. However, according to Greymon55 the 16384SP configuration may not be completely off the table just yet.

One giant compute die leaves way, way too much of a price gap for 3 GPUs to fill for any company to be totally comfortable with. 192 CUs (RDNA2 terms) might be able to hit 2.5x perf, but what's your middle tier look like? One GPU can't fill the entire $350-$1k range, which is the way higher volume range. It'd be a weird strategy for this to happen, and two compute dies really, really helps with that whole gap filling thing.

Also no one should bother posting more tweets from Greymon, the emperor of pulling it out theirs. Yes there's a $20k HPC sized die config that would need multiple stacks of HBM to feed it as a consumer GPU, sure.
 

Looks like VOPD supports only co-issuing the core f32 instructions (add/mul/fma/etc) plus a few utility instructions (see: VOPDXPseudos and VOPDYPseudos), or in other words 16x13 combinations.
No co-issuing for integer things and packed maths...

The interesting bit is that patents were indicating potentially supporting single-cycle wave64 execution as well. Does this mean only this same subset of 13 instructions will be executed single-cycle in wave64 mode? :unsure:
Are you suggesting that the second SIMD in Super SIMD is for 13 (or 16) instructions only? That would mean this second SIMD would be idle quite a lot.

Feels like the design intent is more about making better use of idle slots/resources (well… while competing with vector memory IO) in existing data paths through some small bets.
It seems vertex shaders (geometry, in general) are typically issued in wave32 mode, similarly compute shaders, while wave64 is primarily for pixel shaders.

So VOPD looks like it's probably not applicable to pixel shaders ("VOPD is a new encoding for dual-issue instructions for use in wave32"), reinforcing the idea that pixel shaders require wave64 mode merely to full utilise the Super SIMD, doing single-cycle 64 work item issue. But, that assumes that both SIMDs in the Super SIMD support the full instruction set in dual-issue, non-VOPD, mode.
 
Are you suggesting that the second SIMD in Super SIMD is for 13 (or 16) instructions only? That would mean this second SIMD would be idle quite a lot.


It seems vertex shaders (geometry, in general) are typically issued in wave32 mode, similarly compute shaders, while wave64 is primarily for pixel shaders.

So VOPD looks like it's probably not applicable to pixel shaders ("VOPD is a new encoding for dual-issue instructions for use in wave32"), reinforcing the idea that pixel shaders require wave64 mode merely to full utilise the Super SIMD, doing single-cycle 64 work item issue. But, that assumes that both SIMDs in the Super SIMD support the full instruction set in dual-issue, non-VOPD, mode.
I think there are two branches of possibilities…

The 2nd ALU pipeline does only the listed 13 instructions.There will be Wave64 single-cycle execution, but only for these 13 instructions.
The 2nd ALU pipeline does only the listed 13 instructions.Wave64 remains double issuing over 2 cycles
The 2nd ALU pipeline does only the listed 13 instructions.The second ALU pipeline is dynamically shared between the two SIMDs in a CU (!???!??!)
Both ALU pipelines are (mostly?) symmetrical.
  • For Wave64s, most VALU instructions will be single cycle execution.
  • For Wave32s, the hardware will try to pick up to 2 independent wavefronts to issue, if no VOPD is the oldest in buffer.
  • For VOPD, the available opcodes are deliberately limited to a selected subset.
    • Maybe they are trying to stick with a 64-bit encoding [1], by supporting only a subset empirically known to benefit from co-issuing?

[1] Edit: This is indeed the case — VOPD encoding is 64-bit, with only 4/5 bits for each of the 2 opcodes. There is also a 96-bit VOPD variant with a 32-bit immediate (shared?) for fmaak and fmamk.
 
Last edited:
Plausible configs:

16 WGP, 64bit bus, 16mb LLC, monolithic. Small form factor/notebook, =<$199-ish, basically the standard RDNA3 igpu in a standalone package.
32 WGP compute die.
96 WGP compute die.
I/O dies, 128-196-256-384bit bus? and media engines/etc. Quite plausibly 6nm. Unsure of how many designs, more design cost and inventory concerns per number of designs, but more cost optimization per design offsets that, especially with memory sizes as 4gb cost => $50 a pop.
1-2 compute dies per board with 1 I/O die. LLC goes in some sort of multiple of 64/96mb stacked chiplets, maxes out at 384mb for highest end.

Scales from 12-60 something teraflops for compute die versions. Maxes out at around 450watts and air cooled. 600+ watt, >=70 teraflop version plausible next year, either limited liquid cooling version like with 6900LC or air cooled mass market if that new copper clad cooling solution can be licensed and brought up to speed fast/well enough.
 

I found this amusing comment:

Code:
    // TODO: In wave64 mode, double the number of cycles for VALU and VMEM
    // instructions on the assumption that they will usually have to be issued
    // twice?

So they're unclear on how the hardware tracks antecedent instructions in the case of wave64. The implication is that wave64 is still two-cycle issue, like RDNA 1/2. There is a problem with "wave64" as being a blanket term here, though. The compiler (at least for RDNA 1/2) can choose not to issue lo/hi halves of wave 64 on alternating cycles, instead it can issue a sequence of lo halves followed by hi halves... It may be that this alternate compilation is disallowed for RDNA 3.

We can see that two dependencies can be raised for a single instruction (as I predicted, based on the "SAME" code):

Code:
    ; CHECK-LABEL: {{^}}valu_dep_1_same_trans32_dep_1:
    ; CHECK: %bb.0:
    ; CHECK-NEXT: v_exp_f32_e32 v0, v0
    ; CHECK-NEXT: v_add_nc_u32_e32 v1, v1, v1
    ; CHECK-NEXT: s_delay_alu instid0(TRANS32_DEP_1) | instid1(VALU_DEP_1)
    ; CHECK-NEXT: v_add_nc_u32_e32 v0, v0, v1

here TRANS and VALU are antecedents for the final line. The count is 1 for both, indicating that counting is per ALU type, not per the literal instruction stream.

So the cycle counts for an instruction appear to be:
  • SALU = 3
  • TRANS = 3
  • VALU = 4
but this code does not use SALU_CYCLE_2 or _3, so not totally sure if these are purely ALU cycle counts or related to data latencies too. I'm still puzzled why SALU has "CYCLE" but VALU and TRANS have "DEP". It may be, as suggested earlier by @pTmdfx, simply because SALU has data latencies, the example in the code hints as much:

Code:
# There's no need for SALU_CYCLE_2 here because the s_mov will have completed
# already.
[...]
    ; CHECK-LABEL: {{^}}salu_cycle_2:
    ; CHECK: %bb.0:
    ; CHECK-NEXT: s_mov_b32 s0, 0
    ; CHECK-NEXT: v_add_nc_u32_e32 v1, v1, v1
    ; CHECK-NEXT: v_add_nc_u32_e32 v0, s0, v0

The code does not show use of FMA_CYCLE_1, either. Nor does it show use of NO_DEP.

Finally, NEXT means the instruction following the successor of the delay instruction, such that the SKIP_ codes specify beyond that point:

Code:
; GCN:       ; %bb.0: ; %main_body
; GCN-NEXT:    s_mov_b32 s3, exec_lo
; GCN-NEXT:    s_wqm_b32 exec_lo, exec_lo
; GCN-NEXT:    s_mov_b32 m0, s2
; GCN-NEXT:    lds_param_load v1, attr0.x wait_vdst:15
; GCN-NEXT:    s_mov_b32 exec_lo, s3
; GCN-NEXT:    v_mov_b32_e32 v0, s0
; GCN-NEXT:    v_mov_b32_e32 v2, s1
; GCN-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GCN-NEXT:    v_interp_p10_f16_f32 v3, v1, v0, v1
; GCN-NEXT:    v_interp_p10_f16_f32 v0, v1, v0, v1 op_sel:[1,0,1,0] wait_exp:7
; GCN-NEXT:    v_interp_p2_f16_f32 v3, v1, v2, v3 wait_exp:7
; GCN-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GCN-NEXT:    v_interp_p2_f16_f32 v0, v1, v2, v0 op_sel:[1,0,0,0] wait_exp:7
; GCN-NEXT:    v_add_f16_e32 v0, v3, v0
; GCN-NEXT:    ; return to shader part epilog

so here we can see a SKIP_1 and a NEXT, specifying that:
  • v_interp_p2_f16_f32 should wait for the 2nd backwards instruction (SKIP_1 combined with VALU_DEP_2)
  • v_add_f16 should wait for the result of v_interp_p2_f16_f32 (NEXT combined with VALU_DEP_1)
It does seem like these delay instructions are going to add a lot of bloat to shader code. Historically AMD has made great pains to point out that instruction-cache pressure is a problem for performance. Somewhere in my rummagings I've seen that the instruction cache now uses 128-byte lines, so erm, perhaps the instruction cache is at least twice as large as before?
 
Last edited:
I think there are two branches of possibilities…
I'd like to add a third: TRANS ALU seems to be very high in its throughput, seemingly 3 cycles, which implies it's 32 lanes I reckon. Add a fourth cycle and it can become the second VALU.

So the Super SIMD consists of a plain VALU and a TRANS/VALU?

Good use of the delay instruction: it can cater for both TRANS and "Y" VALU :)
 
Status
Not open for further replies.
Back
Top