AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
I'd like to add a third: TRANS ALU seems to be very high in its throughput, seemingly 3 cycles, which implies it's 32 lanes I reckon. Add a fourth cycle and it can become the second VALU.


:)
It could very well stick to quarter rate execution, while the issuing logic auto applies the 4x multipler in hardware to TRANS delay.

Still don't understand how e.g. 1/16 FP64 (if it remains so) will work though. Eh, perhaps the hardware isn't completely hands-off, and s_delay_alu will peek issuing history (which probably has to be maintained until the result is written to destination cache anyway) to check instruction types and amplify the delay accordingly? This might explain why delay can be counted per instruction type.
 
Last edited:
It could very well stick to quarter rate execution, while the issuing logic auto applies the 4x multipler in hardware to TRANS delay.

Still don't understand how e.g. 1/16 FP64 (if it remains so) will work though. Eh, perhaps the hardware isn't completely hands-off, and s_delay_alu will peek issuing history (which probably has to be maintained until the result is written to destination cache anyway) to check instruction types and amplify the delay accordingly? This might explain why delay can be counted per instruction type.
Or, maybe the hardware tracks "completion of issue", such that long-latency instructions (let's say TRANS is very slow, and FP64 varies between 1/32 and 1/2 throughput depending upon chip, for the sake of argument) are tracked from the point at which the final work item is issued to the pipeline.

This would presume that no instruction type takes longer than either 3 or 4 cycles to produce a result for at least one work item.

For what it's worth the delay code seems "draft" in its nature, seemingly missing cases (no mention of co-issue/VOPD!), so it looks like a while before it provides us with as many clues as it theoretically could.
 
This VOPD related revision looks juicy:


We form VOPD instructions in the GCNCreateVOPD pass by combining back-to-back component instructions. There are strict register constraints for creating a legal VOPD, namely that the matching operands (e.g. src0x and src0y, src1x and src1y) must be in different register banks. We add a PostRA scheduler mutation to put possible VOPD components back-to-back.

PostRA sounds like code that's specific to handing ray accelerator query results, i.e. hits and misses. I can't actually find any code that's specifically related to that. Work-in-progress code...

I get the sense that this is a well-known technique:

Code:
/// Adapts design from MacroFusion
/// Puts valid candidate instructions back-to-back so they can easily
/// be turned into VOPD instructions
/// Greedily pairs instruction candidates. O(n^2) algorithm.

The other thing that's notable about this code is there is no concept of the destination operand cache (Do$) that we've talked about as intrinsic to Super SIMD. So the checks upon sources being from different register banks (GCNVOPDUtils.cpp) enforce those constraints and nothing more. Again, work-in-progress code... Or maybe Do$ and VOPD are just banned from interacting with each other in the hardware. Seems strange, because that's a scenario requiring high register read bandwidth.

Also, it seems puzzling to me that this coding effort is happening now and not years ago when hardware design options were being evaluated. Naively, I'd expect AMD to have a hardware simulator which runs code according to various design options. That code can either be hand-written, or compiled. These efforts imply that if AMD does have a simulator, the options were evaluated with hand-written code... I'd say that's a pretty serious design-cycle gap, putting a very low ceiling on the complexity of design-option evaluation. OK, this is tricky stuff (compiler evolution seems glacial), but we are talking about the fundamental internal practices of a megacorp...
 
This VOPD related revision looks juicy:




PostRA sounds like code that's specific to handing ray accelerator query results, i.e. hits and misses. I can't actually find any code that's specifically related to that. Work-in-progress code...

I get the sense that this is a well-known technique:

Code:
/// Adapts design from MacroFusion
/// Puts valid candidate instructions back-to-back so they can easily
/// be turned into VOPD instructions
/// Greedily pairs instruction candidates. O(n^2) algorithm.

The other thing that's notable about this code is there is no concept of the destination operand cache (Do$) that we've talked about as intrinsic to Super SIMD. So the checks upon sources being from different register banks (GCNVOPDUtils.cpp) enforce those constraints and nothing more. Again, work-in-progress code... Or maybe Do$ and VOPD are just banned from interacting with each other in the hardware. Seems strange, because that's a scenario requiring high register read bandwidth.

Also, it seems puzzling to me that this coding effort is happening now and not years ago when hardware design options were being evaluated. Naively, I'd expect AMD to have a hardware simulator which runs code according to various design options. That code can either be hand-written, or compiled. These efforts imply that if AMD does have a simulator, the options were evaluated with hand-written code... I'd say that's a pretty serious design-cycle gap, putting a very low ceiling on the complexity of design-option evaluation. OK, this is tricky stuff (compiler evolution seems glacial), but we are talking about the fundamental internal practices of a megacorp...
PostRA sounds more like post register allocation in this context.
 
Given:

Many processors include general purpose registers (GPRs) for storing temporary program data during execution of the program. The GPRs are arranged in a memory device, such as a register file, that is generally located within the processor for quick access. Because the GPRs are easily accessed by the processor, it is desirable to use a larger register file. Additionally, some programs request a certain number of GPRs and, in some cases, a system having fewer than the requested number of GPRs affects the system's ability to execute the program in a timely manner or, in some cases, without erroneous operation. Further, in some cases, memory devices that include more GPRs are more area efficient on a per-bit basis, as compared to memory devices that include fewer GPRs. However, power consumption of memory devices as part of read and write operations scales with the number of GPRs. As a result, accessing GPRs in a larger memory device consumes more power as compared to accessing GPRs in a smaller memory device.


The result is a register hierarchy, which does not use cache to back registers, e.g. when spilled to memory (memory mapped locations that happen to be cached):

However, unlike a cache hierarchy, for example, in some embodiments, redundant data is not stored at slower memory devices and memory devices are not accessed in the hope that a GPR stores the requested data. [...] Further, in some embodiments, GPRs are directly addressed, as compared to caches, which are generally searched to find desired data because of how data moves between levels of a cache hierarchy. In embodiments where the GPRs have different designs, other advantages, such as faster read times or differing heat properties, are leveraged.

So a few registers that are accessed many times can be put in a "small, low power consumption" memory device, and others can be assigned to one of the other hierarchy levels that suits access-patterns and/or quantity of registers used by the shader.

There's also discussion of a remapping event, moving data from one level of the hierarchy to another.

I dare say I would expect the compiler to emit data associated with a shader to specify how the hierarchy is used. The question is then whether the hardware can take any decisions independently, and what measurements it would use to make those decisions.

If this is for RDNA 3 and RDNA 3 is supposed to be about saving power by removing hardware that measures, decides and controls fine-grained aspects of shader execution, then it would seem that all of this would be compiler driven. Yet more crazy complexity for the compiler team to conquer.

Again, the runtime problem: the compiler doesn't know what shaders are sharing a compute unit, so there's a lot of fuzziness in performance due to co-habiting shaders. One way to avoid that problem is to say that a compute unit is locked to process only one shader at a time, with any number of hardware threads being assigned, over time, for that shader. It might be said this is yet another way to simplify the CU (or WGP) hardware, reducing the count of shaders that occupy the instruction cache and the size of the set of hardware that tracks the state of a shader. The cost would then be a more coarse-grained execution behaviour as one shader's lifetime expires and another starts up - contrary to the way that async compute has been promoted (where shaders overlap, perhaps for extended periods and across many CUs).

The more CUs a GPU has, the less problematic this coarse-grained execution might seem. I'm not convinced that's good enough, but we shall see.
 

AMD seems to be incorporating ML acceleration into their architecture!

This is great news. We should see wider adoption of ML models in gaming now that AMD also has ML acceleration.
 

AMD seems to be incorporating ML acceleration into their architecture!

This is great news. We should see wider adoption of ML models in gaming now that AMD also has ML acceleration.
https://wccftech.com/amd-fsr-3-0-gfx11-rdna-3-gpus-hardware-acceleration-wmma-instructions/

AMD RDNA 3 ‘GFX11’ GPUs May Feature Hardware-Accelerated FSR 3.0 Tech Thanks To WMMA ‘Wave Matrix-Multiply-Accumulate’ Instructions​

 
Have been saying this since a long time, that amd will follow suit with hw RT and hw ML with rdna3+. This was met with bitter responses.

Good to see amd coming along.
 
This looks like non-dedicated hardware, for what it's worth, running on the same ALU as other instructions.

The formats are packed fp16, bf16 and int8/4 with 32 or 16 bit results.

It is exclusive to RDNA3+ though, acceleration is allegedly being done to speed things up and to come closer to the competition.
 
It is exclusive to RDNA3+ though, acceleration is allegedly being done to speed things up and to come closer to the competition.
Unless there will be MM h/w ("tensor cores") these are unlikely to speed up much when compared to already available DP instructions (which XeSS will supposedly use).
They would make it easier to port/run AI s/w written for Nvidia h/w though I imagine.
 
Unless there will be MM h/w ("tensor cores") these are unlikely to speed up much when compared to already available DP instructions (which XeSS will supposedly use).
They would make it easier to port/run AI s/w written for Nvidia h/w though I imagine.
Isn't the AI s/w at the driver level? and receiving input based on Selene's training results? I don't know how far game devs get access to that part of the black box.
 
Isn't the AI s/w at the driver level? and receiving input based on Selene's training results? I don't know how far game devs get access to that part of the black box.
Not sure what you mean but I imagine it will be a lot easier to port/translate CUDA code for tensor cores to whatever AMD will be using as s/w layer with such instructions.
 
Not sure what you mean but I imagine it will be a lot easier to port/translate CUDA code for tensor cores to whatever AMD will be using as s/w layer with such instructions.
Was thinking about the question of FSR and AI in a gaming scenario. Like Intel's XeSS solution it will likely have it's own NN source.
 
This looks like non-dedicated hardware, for what it's worth, running on the same ALU as other instructions.

The formats are packed fp16, bf16 and int8/4 with 32 or 16 bit results.
From RoCm 5.2 :
rocWMMA provides a C++ API to facilitate breaking down matrix multiply accumulate problems into fragments and using them in block-wise operations that are distributed in parallel across GPU wavefronts. The API is a header library of GPU device code, meaning matrix core acceleration may be compiled directly into your kernel device code. This can benefit from compiler optimization in the generation of kernel assembly and does not incur additional overhead costs of linking to external runtime libraries or having to launch separate kernels.Toi
Typical AMD fashion, running everything in their ALUs. As Tensor/matrix cores become very important in pro market, I think it's a mistake
 
From RoCm 5.2 :

Typical AMD fashion, running everything in their ALUs. As Tensor/matrix cores become very important in pro market, I think it's a mistake
Why,

What makes a tensor core better?

and i mean actually be specific. because we all know x87 is totally better then SSE


for example If they can forward from execution to execution without writing back to registers why is a dedicated unit better?

We know AMD is quite focused on efficiency so i expect something better then dumb read and write to registers after every FMA/MULL/FMAC
 
From RoCm 5.2 :

Typical AMD fashion, running everything in their ALUs. As Tensor/matrix cores become very important in pro market, I think it's a mistake
They have matrix cores where they see they're needed: HPC markets
 
So, we may end up having AI accelerated super sampling without specific AI inference hardware? Who would have thought it :rolleyes:
Like I said in another thread, that was bound to happen as long as we don't increase the target beyond 4K, as the complexity should stay relatively constant while non AI inference hardware improves and gets "good enough" to perform it.
 
We know AMD is quite focused on efficiency so i expect something better then dumb read and write to registers after every FMA/MULL/FMAC
The revision being discussed shows that registers are being used explicitly for read/write:

Code:
v_wmma_f32_16x16x16_f16 v[16:23], v[0:7], v[8:15], v[16:23]

So here we can see the instruction is accumulating into registers v[16:23] forming a sub-block of the result matrix, with v[0:7] and v[8:15] also being two sub-blocks of the matrices being multiplied.

From: https://www.tomshardware.com/features/gpu-chiplet-era-interview-amd-sam-naffziger

We asked whether AMD would include some form of tensor core or matrix core in the architecture, similar to what both Nvidia and Intel are doing with their GPUs. He responded that the split between RDNA and CDNA means stuffing a bunch of specialized matrix cores into consumer graphics products really isn't necessary for the target market, plus the FP16 support that already exists in previous RDNA architectures should prove sufficient for inference-type workloads. We'll see if that proves correct going forward, but AMD seems content to leave the machine learning to its CDNA chips.
There's a strong implication that RDNA will continue to merely accelerate inference, implying that the performance won't be particularly high.
 
Status
Not open for further replies.
Back
Top