AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
So, we may end up having AI accelerated super sampling without specific AI inference hardware? Who would have thought it :rolleyes:
Like I said in another thread, that was bound to happen as long as we don't increase the target beyond 4K, as the complexity should stay relatively constant while non AI inference hardware improves and gets "good enough" to perform it.
We will have to wait and see if something like that will be feasible. The first one which will try this is still XeSS on DP4a.
 
The revision being discussed shows that registers are being used explicitly for read/write:

Code:
v_wmma_f32_16x16x16_f16 v[16:23], v[0:7], v[8:15], v[16:23]

So here we can see the instruction is accumulating into registers v[16:23] forming a sub-block of the result matrix, with v[0:7] and v[8:15] also being two sub-blocks of the matrices being multiplied.
On a block level, yes. That's not necessarily the same as the register bandwidth an equivalent sequence of FMA instructions would use.

I'm not sure many people are aware here, but Apple's M1 GPU does have simdgroup based matrix multiply_accumulate/load/store instructions (using 8x8 tiles for both F16 and F32), and while there are no extra FLOPS to accelerate those instructions it's far easier to get close to theoretical peak performance when using them.
 
AMD GPU chiplet distributed rendering patent:

So both AFR and SFR are could be used depending on the graphics workloads. Or first GPU performs rendering while the other performs compute.

Doesn't AFR has a latency problem? For example, 33ms frametime at 60fps. And distribution and sync overhead could be problematic for SFR.
 
So both AFR and SFR are could be used depending on the graphics workloads. Or first GPU performs rendering while the other performs compute.

Doesn't AFR has a latency problem? For example, 33ms frametime at 60fps. And distribution and sync overhead could be problematic for SFR.
Glanced over it but I'm not sure where you're getting "AFR" from the patent? It's pretty obvious in describing a system which either does all work on one ("main") chiplet or splits it (via binning) between several of such.

Also I'd expect such system to have a pretty high level of overhead. But hey that's my general expectation from a >1 rendering chiplet design anyway.
 
Huh, multiple chiplets listed. I mean that's perfectly expected for a patent, there's no guarantee we'd be seeing more than 1-2 for RDNA3. Also seems like there's an entire phase for hardware geometry where the whole thing is bottlenecked by "first" chiplet, even if they're trying to keep that phase minimal. Guessing as pixel/compute shaders are less linearly dependent than visibility (you do it in pre-defined waves anyway so whatever) that no hardware geo engines like UE5 is trying to be might more efficient on this arch.

I guess the question for multiple chiplets rests on the chiplet placement error rate and how well you can recover from a badly placed chip. If you can just consider/solder off bad chiplet placement it might not make any sense to go with "big" chiplets. Not only do they cost more but if you place one bad you lose more. Whereas if you get an error on a small chiplet "oh no, anyway" and the rest of the package is good, you just bin down. Of course it's the opposite for if one chiplet placement ruins the whole package. And plenty of games are still hardware geo and still bottlenecked by that "first chiplet" work distribution scheme.
 
Glanced over it but I'm not sure where you're getting "AFR" from the patent?

I think I misinterpreted this part as AFR.
Once all the coarse bins assigned to GPU chiplet 106-1 have been processed in the coarse bin rendering phase, GPU chiplet 106-1 is made available to receive instructions for rendering a next frame (i.e., a second pass) and begins processing the geometry of the next frame while the other GPU chiplets 106 are still rendering the coarse bins assigned to them during the first pass.
Thanks for pointing it out.
 
Given:

A typical application executing on a GPU relies on function inlining and static reservation of registers to a workgroup. Currently, GPUs expose registers to the machine instruction set architecture (ISA) using flat array semantics and statically reserve a specified number of physical registers before the wavefronts of a workgroup begin executing. This static array-based approach underutilizes physical registers and effectively requires in-line compilation, making it difficult to support many modern programming features. In some cases, the register demands of the wavefronts leave the compute units underutilized. Alternatively, applications which limit register use must often spill to memory leading to performance degradation and extra contention for memory bandwidth.


Measuring thrashing of registers, i.e. spilling data out of VGPRs to memory and then returning the data to VGPRs later and controlling how many hardware threads are assigned to a compute unit based upon the measured thrashing.

The document also talks about varying register allocations over time for hardware threads and being able to "steal" VGPRs from some hardware threads. Increased register allocations can be provided when function calls are made and so spilling/returning data isn't hurting the hardware thread that is executing.

Apart from anything else this should help with ray tracing performance since ray tracing makes heavy use of functions. The functions that are collected together for ray tracing would historically be compiled together to make an uber-shader, requiring static register allocation. Such a static register allocation would be excessive for much of the lifetime of the hardware thread, meaning that many less hardware threads can be assigned to the compute unit. Such a low allocation then hurts the compute unit's ability to hide latencies, e.g. the latencies incurred in traversing a BVH.

So for ray tracing it is preferable to have a minimal VGPR allocation while rays are traversing a BVH and then expanding the VGPR allocation once hit or miss functions need to be called.

There's still going to be a problem with the incoherence of rays bundled into a hardware thread, where some rays intersect a triangle "early" in BVH traversal, while others might have a few more traversal steps to perform. Additionally, some rays want to run a hit function while others in a hardware thread want to run a miss function. So this means the period of increased allocation of VGPRs will spread over more time than if the rays all coherently follow the exact same BVH nodes and all run the same function once traversal is complete.
 
Looks like FMAC co-issue is so far supported only if <= 4 input VGPRs are used in total. The extra input operands must come either from an immediate or an SGPR. Having said that, making the VOPD bundling algorithm aware of bypass results cannot be ruled out as a WIP, I suppose.
 
Yes, I think "WIP" creates such a massive set of caveats that we can only asses a trend and need to be cautious.

Looking at the sections of test code related to VOPD it's remarkable how few "dual" instructions there are and I find myself wondering why many candidates for transformation into "dual" instructions are not being transformed (the "mov" examples seem very deficient). It really looks like a slim improvement in throughput, 10% as a typical upper bound with special cases being 20%?

I suppose we can look at VOPD as an opportunistic nice-to-have on top of dual-issue from a pair of hardware threads, within a Super SIMD.
 
I will take some time to digest your post. In the meantime I can suggest another possibility, since reading your post prompted it.

A dual-configuration SIMD: SIMD32 for 32-bit resultants OR SIMD64 for 16-bit resultants, which uses VOPD to issue a pair of instructions to use the doubled lanes.

[EDIT: this is a stupid post because it's not saying anything new - I've been alluding to VOPD for 16-bit resultants for ages, so this just expresses that in another way, and is pointless.]

A GPU driver patch seems to point to 4 SIMD per CU instead of 2 like in RDNA2, as far as I can understand (plus a ton of geek stuff in the link to the doc)

 
Last edited:
Some new 'confirmation' of sorts from videocardz regarding Navi 31 being 1x GCD, 6x MCDs, from some of AMD's latest AMDGPU commits:

 
Something isn't adding up here.
If you increase the interface by 50% to increase bandwidth doesn't that decrease the importance of a huge Infinity Cache?
Unless the stacked v-cache is meant for future designs that have 2 GCDs.

I've put on my silly season hat and playing around with numbers in excel a bit. Something like this seems interesting-
2x Navi31 +$2000
2x Navi32 +$1500
Navi31 & binned $1200 and $999
Navi32 & binned $799 and $649
Navi33 & binned $549 and $479 and $399

I forsee some issues with the naming scheme for these products, but they obviously made the right move last year to price stuff as high as possible.

Obviously pricing is going to heavily depend on the competition, so who launches first is going to set the tone.
 
If you increase the interface by 50% to increase bandwidth doesn't that decrease the importance of a huge Infinity Cache?

You're still talking several magnitudes difference between cache and memory.
 
Something isn't adding up here.
If you increase the interface by 50% to increase bandwidth doesn't that decrease the importance of a huge Infinity Cache?
They are increasing the compute side by a lot more than 50%. The non-cached memory bandwidth per flop is still going down a lot, even if the interface is getting 50% wider and 33% faster.
 
Last edited:
They are increasing the compute side by a lot more than 50%. The non-cached memory bandwidth per flop is still going down a lot, even if the interface is getting 50% wider and 33% faster.
True. I guess I wasn't really paying much attention to that side of the equation, I was assuming a 50% bump to the interface and 50% bump to the Infinity Cache would have been solid for 4k.
I guess I got hung up on the Infinity Cache hit rate chart. When I was dissecting it, it seemed like 192MB would get up near 75-80% hit rate for 4k which matched the hit rate for Navi21 at 1440p/1080p.
It seemed like moving to 256/384MB cache wouldn't really do that much.
 

Attachments

  • infinity cache hit rate.jpg
    infinity cache hit rate.jpg
    496.5 KB · Views: 31
Status
Not open for further replies.
Back
Top