AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
Not directly related to RDNA 3, but a newly published patent described in detail about operand fetching and caching on a GPU device, and does match many clues in the RDNA whitepaper.

Given asynchronous operand fetching especially, I would not be surprised if the ALU doubling is really happening inside an SIMD32 (1 -> 2 VALU instructions/clk) while still having its vector register file unwidened at four 1R1W banks. The primary architectural bet might be on running VALU instructions from 2 wavefronts at a time to maximize VRF port and operand network usage, with the expectations that most kernels do not need sustained availability of 3 VRF read ports (either because FMA is not the dominant opcode, and/or result forwarding takes a heavy lifting).

This might explain why VOPD ended up having a rather limited scope, i.e., probably more a secondary bet that aims to help the execution latency (ILP) in targeted scenarios, rather than general throughput.

:unsure:
 
Last edited:
I reckon "inside" would be to help routing of operands, e.g. ALU lanes 3a and 3b are next to each other, both having access to bits 96-127 of the fetched register (where a fetched register is 32x 32-bit = 1024 bits).

The cost relates to instruction decode, since that's high-bandwidth data too (a new instruction every clock) that has to be communicated across the full width of the SIMD (or two SIMDs if not an "inside" configuration). Instructions for dual-issue (from a pair of hardware threads) will generally be two 32-bit instructions, but can be two 64-bit instructions I suppose. The VOPD three-operand instructions (FMA with constant as multiplier or adder) are a maximum of 96-bits, I believe, when two FMAs are dual-issued. Which does call in to question whether two 64-bit instructions from a pair of hardware threads can actually be dual-issued. That is a worrying thought...

In truth, after instruction decode has occurred, many less bits need to be transmitted across the width of the SIMD, since a large portion of the decoded bits describe operand sources and destination (as well as literals). So that does at least help with bandwidth (power).

So on that basis I probably shouldn't be worried about dual-issue of a pair of 64-bit instructions.

The other wrinkle here is whether the lanes of the transcendental unit are woven amongst the a/b lanes of the "inside SIMD-pair". e.g. every fourth a/b lane-pair also comes with a transcendental lane.

There's another wrinkle for double-precision, but I suspect at that point the very small instruction set and limited throughput make a woven configuration less compelling. There are patent documents on the subject of routing double-precision operands for double-precision ALUs, re-using single-precision paths as I understand it. I didn't read them closely though...
 
looking forward to seeing the first chiplet GPU. Exciting times ahead.
If it's "only" 75TF, I will be pissed off by AMD. IMHO if you are using chiplet and your're not going full blast to take the undisputed crown against Nvidia is a huge marketing mistake and missed opportunity.
 
If it's "only" 75TF, I will be pissed off by AMD. IMHO if you are using chiplet and your're not going full blast to take the undisputed crown against Nvidia is a huge marketing mistake and missed opportunity.
it's not just about the raw performance but the price at which we obtain it!
 
Getting into a halo part dick swinging contest isn't necessary and not cost effective for AMD.
 
In long-term, AMD has been sub-20% of the market. Spending resources on beating the top end (400-600W)? That doesn't sound reasonable.
 
it's not just about the raw performance but the price at which we obtain it!

Getting into a halo part dick swinging contest isn't necessary and not cost effective for AMD.

In long-term, AMD has been sub-20% of the market. Spending resources on beating the top end (400-600W)? That doesn't sound reasonable.
Sorry guys but I disagree.
Having the undisputed crown brings massive brand awardness down the range from the halo SKU. A very well know benefit taught in all marketing schools.
 
There's another wrinkle for double-precision
I suspect that these narrower execution units are meant to stage their operands in full width (32 lanes) into their own buffers, and then block further issuing until they are done working on the operands iteratively.

This would keep instruction & operand issuing simpler, since all execution units would be presented as having a uniform 32-lane width & potentially a “busy don’t issue” signal, such that the multi-cycle execution is left as a variable blackbox detail.

Costs some kilobits of SRAM for staging, but it is a drop in the dark silicon ocean, and can be clock gated with the execution unit they linked to anyway.
 
Last edited:
If it's "only" 75TF, I will be pissed off by AMD. IMHO if you are using chiplet and your're not going full blast to take the undisputed crown against Nvidia is a huge marketing mistake and missed opportunity.
Sorry guys but I disagree.
Having the undisputed crown brings massive brand awardness down the range from the halo SKU. A very well know benefit taught in all marketing schools.

The crown doesn’t necessarily depend on having the biggest raw flop number (see GCN). Spending transistors to use those flops more effectively is also important (see Infinity Cache).
 
Sorry guys but I disagree.
Having the undisputed crown brings massive brand awardness down the range from the halo SKU. A very well know benefit taught in all marketing schools.
Well it worked out well for AMD in the CPU space. I don't think Ryzen processors were necessarily Halo/crown taking over Intel, but they effectively got similar performance at significantly lower prices which is really what vaulted them forward.
If they can repeat that here, obtaining 4090 near performance at significantly cheaper pricing would be a no brainer I think. That's like saying you can pay 4080 price to get just below 4090 performance. That's a huge win, I would switch over in a heartbeat because at that 799 price point, I'd rather have more performance than less.
 
True, but Nvidia are very good at putting their flops to good use as well. It would be quite surprising to see AMD take the performance crown with a notably lower FP32 figure.

Not so sure about that. The 3090 Ti has 90% more flops but is only 20-40% faster than the 6900xt. With RT it goes up from there but there’s no way to tell how much of that is due to flops or just better RT hardware.
 
Not so sure about that. The 3090 Ti has 90% more flops but is only 20-40% faster than the 6900xt. With RT it goes up from there but there’s no way to tell how much of that is due to flops or just better RT hardware.
Well if you forget massive Ampere compute advantage (dual FP32), massive rendering advantage (Optix on RT cores), massive AI/ML advantage (tensor cores), while being made in nearly 2 nodes behind RDNA2 and using nearly same amount of transistors, yeah I guess then you're right, it's similar... /s

edit: more seriously, anybody here thinks RDNA2 is a better arch ? Ampere does so much more than RDNA2 and use their transistor budget so much more effectively, it's not even open for discussion
 
Not so sure about that. The 3090 Ti has 90% more flops but is only 20-40% faster than the 6900xt. With RT it goes up from there but there’s no way to tell how much of that is due to flops or just better RT hardware.
The 75 TFLOPS figure is based on doubling ALUs per WGP, which won't equate to doubled performance. We just saw this with Turing->Ampere where the doubled throughput equated to ~30% more performance. I suspect RDNA3 will be similar.
 
The 75 TFLOPS figure is based on doubling ALUs per WGP, which won't equate to doubled performance. We just saw this with Turing->Ampere where the doubled throughput equated to ~30% more performance. I suspect RDNA3 will be similar.
While it sounds similar there are enough practical differences to make such comparisons inherently flawed. The best we can say is "RDNA3 is unlikely to scale as well with TFs changes as RDNA2 did in comparison to their predecessors". The scaling itself though is likely to be completely different to how it's been between Turing and Ampere.
 
Well if you forget massive Ampere compute advantage (dual FP32), massive rendering advantage (Optix on RT cores), massive AI/ML advantage (tensor cores), while being made in nearly 2 nodes behind RDNA2 and using nearly same amount of transistors, yeah I guess then you're right, it's similar... /s

edit: more seriously, anybody here thinks RDNA2 is a better arch ? Ampere does so much more than RDNA2 and use their transistor budget so much more effectively, it's not even open for discussion
Can you explain how 8nm is nearly 2 nodes behind 7nm? The massive fp32 advantage doesn't seem to do anything for games.
 
Status
Not open for further replies.
Back
Top