AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
Yes, that other stuff has been an unknown to me. But, it might be 3R2W on the basis that 3-operand reads in VALU are rare, so just make these other instructions wait or read slowly when reading. <snip>
Having re-read the GCN and RDNA whitepapers, they seemed to indicate that the VRF has been multi-banked (at least 4, I reckon?) with simple bank design (probably 1R1W), and operand gathering/collecting logic does exist at least since RDNA 1. A research paper that has simulation machine modelled after GCN/RDNA also seemed to support the theory. This as well explain how they managed to have transcendental (8 lanes) and DPFP (2 lanes) as separate narrower execution units since RDNA 1 — full 32-lane input operands are read out of the VRF altogether, held in the operand buffer, and spoon fed to these narrower execution units.

In this case, if they do need extra VRF bandwidth, they have the options in both upping the # of banks, or amping up the bank design (e.g., +1 read port).

Though it is still unclear to me what role VOPD serves though, considering it seems to be for wave32 only too. A no-dependency cue for operand gatherer? But it could also deduce it by itself, couldn't it? Heh. A theory I can come up with is that the SIMD frontend still defaults to prioritize issuing from two different wave32s. So VOPD exists to force a co-issuing of (most) VALU instructions from the same wave, eh?
 
Last edited:
Though it is still unclear to me what role VOPD serves though, considering it seems to be for wave32 only too.
A non-wave32 only variant might appear? Well, I doubt it.

A no-dependency cue for operand gatherer? But it could also deduce it by itself, couldn't it? Heh. A theory I can come up with is that the SIMD frontend still defaults to prioritize issuing from two different wave32s. So VOPD exists to force a co-issuing of (most) VALU instructions from the same wave, eh?
I believe RDNA will happily issue from one hardware thread as eagerly as possible over many consecutive cycles until forced to switch due to a latency-incurring instruction. RDNA's design priority is to minimise the wall-clock lifetime of any given hardware thread, so it only gives up on a thread when forced to. It doesn't need a "hint" about sticking to a single thread.

My theory is that "dual-issue" is merely a side effect, in a similar fashion to the way that packed-math has a side-effect of "dual-issue" on pairs of FP16 sub-ranges of VGPRs.

There's a lot of bitwise, boolean and add/mul operations that are sub-operations of other VALU instructions, see the CUBE and SAD instructions for examples. My theory is that it doesn't take much effort to form new pairs of instructions from these sub-operations, and make these pairs be dual-issued.

Really these would be co-issued, from the same hardware thread, so "dual-issue" is in my opinion likely to be a misnomer.

But the result could be a set of say 50 pairs of these instructions which could consume 3 operands in total. Currently there's about 75 instructions that are VOP3* with 3 operands. Basically, we'd be looking for VOP1 and VOP2 instructions (as well as 1/2 operand VOP3 instructions) that boil down to co-issuable pairings.

For example, a VOPD pair could be FLOOR and MAX, where FLOOR takes one operand and MAX takes two. If that pairing clashes inside the pipeline, how about FLOOR and MUL as a co-issuable pair? It should be possible to come up with a table of valid pairings. Yes, the compiler suffers a whole new level of pain, but that seems to be par for the course these days.

I'll spend some time with that paper you linked at some point. I have a feeling I've read it before.
 
OK, this discussion of recent patches mentions VOPD, and perhaps suggests what's going on around it, too:


I interpret this to mean that we're merely looking at fp16 instructions being arbitrarily paired as a single instruction, rather than the packed math requirement where both halves are the same instruction.

It's also notable in the LLVM patches that decoding of longer instructions:

⚙ D125316 [AMDGPU] gfx11 Decode wider instructions. NFC (llvm.org)

is coming, which would support "dual-issue".

VGPR indexing and LDS (and other memory mechanics) look like they're changing:

⚙ D125319 [AMDGPU] gfx11 BUF Instructions (llvm.org)

It seems that the register file would need to be reorganised to support this stuff, if we assume that 16-bit operands are first class citizens.

And then there's:


which is a nice twist and not necessarily on the topic of VOPD.

I would be tempted to interpret that, though, as fused SIMD16s to make a SIMD32. In my opinion "wave64" gets too much attention. RDNA didn't make wave32 intrinsic to the hardware for no good reason...
 
OK, this discussion of recent patches mentions VOPD, and perhaps suggests what's going on around it, too:

<snip>

And then there's:


which is a nice twist and not necessarily on the topic of VOPD.
The "Super-SIMD" patent did describe that:

* "VLIW2 includes two regular instructions in a larger instruction word."
* "At step 630, the instruction scheduler selects one VLIW2 instruction from the highest priority wave or two single instructions from two waves based on priority. "
* "In an implementation, super-SIMD 200 is provided instructions from a hardware instruction sequencer (not shown) in order to issue two VALU instructions from different waves when one wave cannot feed the ALU pipeline."

So regardless of the SIMD lane width, VOPD seems to fit the bill for such VLIW2 bundle.

I would be tempted to interpret that, though, as fused SIMD16s to make a SIMD32. In my opinion "wave64" gets too much attention. RDNA didn't make wave32 intrinsic to the hardware for no good reason...
TBF "wave32/64 intrinsic" does still exist in the form of DS_Permute, though AMD did remove 32-lane DPP since RDNA, leaving only DPP8 and DPP16 modes.

Anyway, for dual-issue wave32 to happen with 16-lane SIMDs, that would require 4 SIMD16s. Then either it does quad-issuing with doubled VRF banks, or it does some sort of upper-lower arrangement. Say, each CU has two clusters of dual-issue SIMD16 + VRF, where each cluster can work independently & has their own pool of waves, but can be joined together for wave32/64 (at the cost of 1/2 of wave pool, where one cluster yields control to another).

... this does coincidentally mean 4 SIMDs per CU, even though each CU will stay 64 lanes. :oops:

If this is true, can't quite imagine how the current 32-lane VMEM & LDS pipeline (which is lined up with 128B cache lines) would work though. :???:
 
The key to making a pair (or more) of compute chiplets work is in how geometry workloads and results are distributed, collated and re-despatched.

Another option is to just have one big compute chiplet with all the other chiplets being memory controllers and L3 cache. Would be a reasonable first step.
 
The "Super-SIMD" patent
"VGPR file can provide 4Read-4Write (4R4 W) in four cycles, but profiling data also shows that VGPR bandwidth is not fully utilized as the average number of reads per instruction is about two. Since an ALU pipeline can be multiple cycles deep and have a latency of few instructions, a need exists to more fully utilize VGPR bandwidth."

Hence my original idea: constrain the register file to support 2R2W and make the ALUs run simpler instructions and use macros that take 2 cycles as needed, i.e. only those instructions that fetch 3 operands from VRF would take 2 cycles.

Banks/ports don't help with VRF read bandwidth for the VALUs, since it's quite likely that all operands come from the same bank. You have to do more, e.g. with an operand collector and with Do$ (destination operand cache).

As soon as you implement the operand collector and Do$ you have just reduced the amount of VRF bandwidth you are going to use... Ironic.

Figure 3 here:

https://www.amd.com/system/files/documents/rdna-whitepaper.pdf

certainly shows how Do$ would provide a benefit, as both Navi examples (wave32 and wave64) have "dependency stalls" when consuming VGPR v0.

It might even be valid to say that wave32 is hamstrung by not having a Do$. But, these dependency stalls should be hidden, in general, with another hardware thread.

... this does coincidentally mean 4 SIMDs per CU, even though each CU will stay 64 lanes. :oops:
It's amusing.

Worth remembering: the 2-issue might solely be for:
  • general math to the VALU
  • transcendental math to a transcendental-SIMD
 

I don't know if this leaker is trolling or whatever, but IF RDNA3 uses a "VLIW2" (dual issue FP32) as a base unit, having only 6144 units for NAVI 31 (that's 12288 FP32 ops peak per cycle) has absolutely no sense, due to the process node change and going MCM. It could have sense only if it's referred to a single GCD, with 2 on package, or maybe for NAVI32 (still unlikely for me). Especially when the move of decoupling the second FP32 unit from INT in ADA was easily predictable.
 

I don't know if this leaker is trolling or whatever, but IF RDNA3 uses a "VLIW2" (dual issue FP32) as a base unit, having only 6144 units for NAVI 31 (that's 12288 FP32 ops peak per cycle) has absolutely no sense, due to the process node change and going MCM. It could have sense only if it's referred to a single GCD, with 2 on package, or maybe for NAVI32 (still unlikely for me). Especially when the move of decoupling the second FP32 unit from INT in ADA was easily predictable.

What do the options even mean? How does VLIW on or off change the total number of FP32 lanes?

Was there confirmation that the INT pipe was decoupled in Ada? That would require additional dispatch hardware.
 
The "VLIW2 "option in his opinion seems to indicate that in RDNA3 it would mean co-issuing two FP32 instructions on the same ALU - having two linked ALUs (he is the leaker, not me btw).
There were also discussion between leakers recently about how the second FP32 in ADA was decoupled from INT (said by Kopite and IIRC confirmed by Kepler and another leaker I don't remember at the moment).
 
Last edited:
There were also discussion between leakers recently about how the second FP32 in ADA was decoupled from INT
This rumor makes little sense though. What would be the point of that?
I feel like these "leakers" are trying to figure out the h/w internals of Lovelace going off the drawings provided for Hopper.
But these drawings rarely show the actual h/w arrangement of units inside the chip.
 
The "VLIW2 "option in his opinion seems to indicate that in RDNA3 taht would mean co-issuing two FP32 instructions on the same ALU - having two linked ALUs (he is the leaker, not me btw).

Oh his numbers are IPC not total ALUs. 6144 IPC at 3Ghz would be a 70% increase over Navi 21 which is a decent bump. VLIW2 would be icing on the cake. Does sound too low for multiple GCDs though.

There were also discussion between leakers recently about how the second FP32 in ADA was decoupled from INT (said by Kopite and IIRC confirmed by Kepler and another leaker I don't remember at the moment).

Ok, decoupled doesn’t necessarily mean INT and FP32 can co-issue though.
 
Oh his numbers are IPC not total ALUs. 6144 IPC at 3Ghz would be a 70% increase over Navi 21 which is a decent bump. VLIW2 would be icing on the cake. Does sound too low for multiple GCDs though.

Too low considering the process node change, rumored TDP, use of MCM and competition (144 SM for the full ADA chip and increased clocks are known since a lot of time). A monolithic chip would have achieved the same without all the packaging mess.

Ok, decoupled doesn’t necessarily mean INT and FP32 can co-issue though.

The leakers spoke specifically abouth co-issuing. that's the point (But sorry, I am trying to retrieve the thread but for some reason I'm unable to - in any case everything's a rumor until release).

Update: found this:

 
Last edited:
I looked at:

AMDGPU.td · llvm-github

to compare:

def FeatureGFX10

versus:

def FeatureGFX11

And found the following differences:

New features:
  • FeatureGFX10_AEncoding,
  • FeatureGFX10_BEncoding,
  • FeatureGFX10_3Insts,
  • FeatureGFX11Insts,
  • FeatureVOPD,
  • FeatureTrue16BitInsts
Removed features:
  • FeatureSMemRealTime,
  • FeatureSDWA,
  • FeatureSDWAOmod,
  • FeatureSDWAScalar,
  • FeatureSDWASdst,
  • FeatureSMemTimeInst,
  • FeatureImageInsts

FeatureTrue16BitInsts - my theory is that this is the basis of VOPD. This also makes SDWA redundant as a concept.

EDIT: made a mistake with FeatureUnalignedDSAccess
 
Last edited:
Too low considering the process node change, rumored TDP, use of MCM and competition (144 SM for the full ADA chip and increased clocks are known since a lot of time). A monolithic chip would have achieved the same without all the packaging mess.

I wouldn’t be surprised if Navi 31 is still monolithic.

The leakers spoke specifically abouth co-issuing. that's the point (But sorry, I am trying to retrieve the thread but for some reason I'm unable to - in any case everything's a rumor until release).

Update: found this:


There’s no mention of co-issue there. Ampere is also “true double FP32” when not issuing INT.

It’s amusing how these same folks don’t acknowledge the fact RDNA also runs INT and FP32 on the same hardware. It’s like INT consumes FP32 issue bandwidth on Ampere but it’s free on every other architecture.
 
There’s no mention of co-issue there. Ampere is also “true double FP32” when not issuing INT.

It’s amusing how these same folks don’t acknowledge the fact RDNA also runs INT and FP32 on the same hardware. It’s like INT consumes FP32 issue bandwidth on Ampere but it’s free on every other architecture.

Well the tweet is pointing to a difference in how FP32 instructions are executed - and we all know that the second mixed INT/FP32 in Ampere cannot execute concurrently a FP32 instruction and a INT instruction.
While RDNA2 can without losing FP power and apparently this seems the case for Ada too. That's what is pointed out in that tweet, no need to blame Kepler for trying to simplify instead of giving lenghty explanations about bandwidth and registry pressure.
 
Last edited:
MY bad, I was remembering it wrongly, I got confused between Scalar units and vector units. In any case, the point about Ada/Ampere differences is still valid.
 
RDNA has integer instructions on SALU and VALU. SALU instructions are shared by all work items in a work group.
 
RDNA has integer instructions on SALU and VALU. SALU instructions are shared by all work items in a work group.

Yes I’m obviously not talking about shared scalar INT instructions. Vector INTs run on the same hardware as FP32 on RDNA.

MY bad, I was remembering it wrongly, I got confused between Scalar units and vector units. In any case, the point about Ada/Ampere differences is still valid.

Yeah I’m not sold on the Ada speculation. I agree with Degustator that people are guessing based on the Hopper picture.
 
Status
Not open for further replies.
Back
Top