AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
Correct, in the current market AMD mostly exists as a pricing lever in the eyes of consumer population.
So they're not gonna actively compete on price or try to push major GPU volumes etc.

Consumers are so dim that a 6700xt can currently be had for $3xx on ebay. That's fantastic deal, Metro EE @1440p max settings and over 60fps, you're not getting that again for over a year at least.

Anyway, arch. If the current specs are "real", big if, looking at a mid range that's as fast as a 3090. That's probably $600 with a decent profit margin for what Nvidia wanted to charge $900 for, no wonder they canceled.
 
Consumers are so dim that a 6700xt can currently be had for $3xx on ebay. That's fantastic deal, Metro EE @1440p max settings and over 60fps, you're not getting that again for over a year at least.

Anyway, arch. If the current specs are "real", big if, looking at a mid range that's as fast as a 3090. That's probably $600 with a decent profit margin for what Nvidia wanted to charge $900 for, no wonder they canceled.
now that you mention it, October 12th I found a brand new 6700XT for 479€ on a well known PC retailer when looking for the new A770 16GB and I was soooo tempted to buy the 6700XT that I waited a few minutes before taking a decision. In the end, I went with the A770 16GB for several reasons, specially RT performance, video memory and the fact that I want to support Intel this time around and had decided to buy the A770 after reading/watching many reviews.

On a different note, the expected launch date of the new RX 7000 GPUs seem to be December.

 
Anyway, arch. If the current specs are "real", big if, looking at a mid range that's as fast as a 3090. That's probably $600 with a decent profit margin for what Nvidia wanted to charge $900 for, no wonder they canceled.
8GB Navi 33 at 1080p is probably going to be about as fast as 3090 at 1080p. $400? $500?

Is that what you mean by mid range?
 
It bugs the hell out of me that the sets of available co-issuable OPs per VOPD-half are not even symmetric.
So I was glancing through the Super-SIMD patent application:


and I discovered something new (to me, at least). There are two aspects to this discovery:
  1. "side ALU"
  2. "full ALU"
A full ALU combines a "normal" ALU (SIMD-32 capable of FP32, called "core ALU" in the document) with a side ALU. The side ALU is described as being for "non-essential operations like conversion instructions" and does not have a multiplier. Additionally a side ALU can co-work with a normal ALU to finish complex operations like transcendental instructions.

Generally, a side ALU and core ALU have different functions and an instruction can be executed in either the side ALU or the core ALU. There are some instructions that can use the side ALU and core ALU working together—the side ALU and core ALU working together is a full ALU.

So here we have a new kind of ALU, the full ALU, which is formed by combining two ALUs to work together. (Reminiscent of G80's MUL which also worked with transcendentals).

So I think the asymmetry in the VOPD-halves may be a side-effect of each half being different, with one half being a full ALU while the other half is merely a core ALU.

There isn't much talk about transcendental ALUs:

In certain implementations, ½, ¼, or ⅛ of N ALUs use transcendental ALUs (T-ALUs) with multiple cycle execution to save area and cost.

which sounds like Navi 1 and Navi 2. But then the water gets pretty muddy with the following three scenarios:

Several common implementations of super-SIMD blocks 200 can be utilized. These include the first ALU 220 and second ALU 230 both being a full ALU, ...

which I believe is not Navi 3, because of the asymmetry of VOPD

... first ALU 220 being a full ALU and second ALU 230 being a core ALU or vice versa, ...

which I suppose is possible but there's no explicit mention of transcendentals.

... and coupling multiple super-SIMD blocks 200 in an alternating fashion across the super-SIMD blocks 200 utilizing one pair of core ALUs in a first block for first ALU 220 and second ALU 230, one set of side ALUs in a next block for first ALU 220 and second ALU 230, and one set of T-ALUs in a last block for first ALU 220 and second ALU 230.
which is effectively a 3-set of:
  • core
  • side
  • transcendental
ALUs. But this 3-set doesn't have the asymmetry that VOPD implies...

Overall, I think the description is trying to hint at a wide range of configurations centred-upon the Super-SIMD concept, reaching down (at a guess) to low-cost, low-area variants suitable for APUs:

FIG. 4 illustrates a small compute unit 500 with two super-SIMDs 500a,b, a texture unit 530, a scheduler 510, and an LDS 520 connected with an L1 cache 540. The component parts of each super-SIMD 500a,b, can be as described above with respect to super-SIMDs of FIG. 1B and the specific example shown in FIG. 2 and super-SIMD of FIG. 3. In small compute unit 500, two super-SIMDs 500a,b replace the four single issue SIMDs. In CU 500, the ALU to texture ratio can be consistent with known compute units. Instruction per cycle (IPC) per wave can be improved and a reduced wave can be required for 32 KB VGPRs. CU 500 can also realize lower cost versions of SQ 510 and LDS 520.

That is a guess, though...
 
how plausible it is that amd is gonna use the neural networking that xillinx has developed ?
They will, but in CPUs, not GPUs. If they'd be putting something in GPUs they'd put their own matrix cores in there, but they chose not to at least in case of RDNA2
 
So I was glancing through the Super-SIMD patent application:


and I discovered something new (to me, at least). There are two aspects to this discovery:
  1. "side ALU"
  2. "full ALU"
A full ALU combines a "normal" ALU (SIMD-32 capable of FP32, called "core ALU" in the document) with a side ALU. The side ALU is described as being for "non-essential operations like conversion instructions" and does not have a multiplier. Additionally a side ALU can co-work with a normal ALU to finish complex operations like transcendental instructions.



So here we have a new kind of ALU, the full ALU, which is formed by combining two ALUs to work together. (Reminiscent of G80's MUL which also worked with transcendentals).

So I think the asymmetry in the VOPD-halves may be a side-effect of each half being different, with one half being a full ALU while the other half is merely a core ALU.

There isn't much talk about transcendental ALUs:



which sounds like Navi 1 and Navi 2. But then the water gets pretty muddy with the following three scenarios:



which I believe is not Navi 3, because of the asymmetry of VOPD



which I suppose is possible but there's no explicit mention of transcendentals.


which is effectively a 3-set of:
  • core
  • side
  • transcendental
ALUs. But this 3-set doesn't have the asymmetry that VOPD implies...

Overall, I think the description is trying to hint at a wide range of configurations centred-upon the Super-SIMD concept, reaching down (at a guess) to low-cost, low-area variants suitable for APUs:



That is a guess, though...
Is there anything mentioned about the scalar ALU? To me, without having read this patent in full, this sounds like it would do away with the scalar ALU, it's separate registers, caches etc. and move it to a wide SIMD like the primary vector ALU. In such a way, it could share the register space (maybe increased capacity) and finish transcendentals, tex coords and the like faster, freeing up the register space for use by the main ALU, maybe even helping it with some simpler tasks (add).
 
Is there anything mentioned about the scalar ALU? To me, without having read this patent in full, this sounds like it would do away with the scalar ALU, it's separate registers, caches etc. and move it to a wide SIMD like the primary vector ALU. In such a way, it could share the register space (maybe increased capacity) and finish transcendentals, tex coords and the like faster, freeing up the register space for use by the main ALU, maybe even helping it with some simpler tasks (add).
This document makes no mention of the scalar ALU.

The scalar ALU does stuff that's shared by all work items in a hardware thread. So ideally, for example, a loop counter is computed and tested by the scalar ALU.

The scalar ALU also does stuff to compute predicates, the way that individual work items are turned on/off for computation, usually with an IF statement. Additionally, when the ELSE section of code needs to be executed, the predicate that controls individual work items has all of its bits flipped, so work items that were set to 1 are flipped to 0 and vice-versa and now lanes that didn't execute the IF section now execute the ELSE section. If the predicate is all zeroes, then the section (IF or ELSE) is entirely skipped.

Removing the scalar ALU would, in my opinion, make the hardware worse.

Now, it could be argued that one lane of the SIMD can be "privileged" to function as the scalar ALU. This would reduce overall compute throughput, but then it can also be argued that one ALU is no longer spending a lot of time idle. In terms of wiring, the complexity of connecting the scalar register file to the ALU lanes (so that they can read computed results or be predicated on/off) is not going to be radically different from the privileged design. Finally, the scalar ALU has 128 registers, which at 32 bits per register is a tiny amount of die space, even when multiplied by the potential set of, say, 20 hardware threads per quarter-WGP (Navi uses four scalar ALUs per WGP, one per SIMD). In the privileged design, these registers (2560, say) need to be spread across the four banks of registers, making each bank a little bigger. Sort of like a set of registers for a 21st hardware thread.

In current Navi GPUs stuff that's per-work-item is computed at full SIMD speed, there is never a scalar ALU bottleneck for transcendentals or texture coordinates. Sure, transcendentals currently are "slow", the RDNA whitepaper:


says that there's an 8-lane transcendental SIMD, taking four cycles to perform one instruction. Once an instruction has been issued to this SIMD, then on the following cycles instructions can be issued to the main fp32 SIMD, providing overlapped transcendental computation.

Whether Super SIMD is capable of making transcendentals faster at lower cost than would have been the case in RDNA 1 and 2, I don't know. It would be nice if it did, I suppose. I don't know why the full ALU can be used to "finish complex operations like transcendental instructions". I can guess at scaling for the exponent in the floating point result (bit-shifting it). Some transcendentals, such as SIN and COS, require that the operand is re-computed to fit within a range of values, before the transcendental is computed. But that's before, not after...
 
Is there anything mentioned about the scalar ALU? To me, without having read this patent in full, this sounds like it would do away with the scalar ALU, it's separate registers, caches etc. and move it to a wide SIMD like the primary vector ALU. In such a way, it could share the register space (maybe increased capacity) and finish transcendentals, tex coords and the like faster, freeing up the register space for use by the main ALU, maybe even helping it with some simpler tasks (add).
Removing scalar ALU is a fundemental change of the CU programming model. No LLVM patches has indicated so, rather than it being an evolution over GFX10. There was also a patch for SGPR related bug workaround, if you need more definitive signs of "not dead".

Besides, the point of scalar ALU, SRF and scalar DCache (K$) is to support wave-uniform operations & operands, notably things like control flow (incl. the 32/64-bit VALU EXEC/VCC masks) and texture/buffer descriptors. i.e., "simple" tasks that are fundementally a waste to executed/replicated on a wide vector. You don't have to get rid of the scalar ALU pipeline to go wide or have a larger VRF. If anything, an LLVM patch indicated that some models will have 50% larger VRF (+64KB, which is... 6 times larger than the 10KB SRF on RDNA 2).
 
Last edited:
Status
Not open for further replies.
Back
Top