AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
Hmmm...

384bit bus, 192mb cache, twin 96CU GCDs, 2.3ghz clocks, 24gb 24gbps memory, $1599.
384bit bus, 192mb cache, 2 80CU GCD, 2.2ghz clockspeed, 12gb 20gbps memory, $999.
384bit bus, 96mb cache, 1 96CU GCD, 3.2ghz clockspeed, 12gb 18gbps, $749
256bit bus, 128mb cache, 1 60(64?)CU GCD, 3.4ghz clockspeed, 16gb 20gbps, $549
256bit bus, 64mb cache, 1 52(56?)CU GCD, 2.8ghz clockspeed, 8gb 16gbps, $400
128bit bus, 32mb cache, 32CU, 3ghz, 8gb 20gbps, $300
128bit bus, 32mb cache, 28CU, 2.5ghz, 8gb 16gbps, $249.

?
Are we really gonna get more than one GCD in a product? I expected that even the top card will have one GCD and bunch of cache chiplets.
 
Maybe this fits in here

Patent:
DUAL VECTOR ARITHMETIC LOGIC UNIT
So, I've been reading this more closely recently...

I commented, before about the swizzling and not really understanding it. Well, it turns out that swizzling isn't in the claims, so it can only be taken as a possible embodiment.

There's two parts to the swizzling, as far as I can tell:
  1. A single 2048-bit register (64x 32-bit, for 64 work items in a "wave 64" VGPR allocation) is spread across two banks - each bank is 1024-bit wide
  2. The bank address used for a register that spans two banks varies by bank
I can't work out why 2. is done. Perhaps there's a hidden interaction with the operand cache, something to do with associativity?

Swizzling only applies when "wave 64" mode is effective. Otherwise "wave 32" deals in 1024-bit registers.

Either way, four banks can only provide 4 operands, best-case, per clock if there's only one read port per bank. So that's only 2 operands per instruction-half in a dual-issue.

Maybe there's six banks?...

Oddly, in claims paragraphs 13 and 14 there's reference to a third wavefront. I can't work out what's going on there and can't help thinking that instead of bad English, it's a mistake.

I find the three "chains" of claims hard to disentangle (1-7, 8-14, 15-20). I don't know why seemingly the same thing needs to be listed 3 times.
 
There's two parts to the swizzling, as far as I can tell:
  1. A single 2048-bit register (64x 32-bit, for 64 work items in a "wave 64" VGPR allocation) is spread across two banks - each bank is 1024-bit wide
  2. The bank address used for a register that spans two banks varies by bank
On (2), it could be that mixing lower and higher Wave64 register halves within the same bank can lead to higher sustained VGPR bandwidth based on simulation/modelling results. The swizzling appears to be limited within each pair of banks (0/1 and 2/3), and the operand cache has to be aware of the swizzling scheme so that it can swizzle the halves back into the expected Wave64 order when necessary (with respect to VCC/EXEC masks, read thread id, etc).

Otherwise, regardless of (2) being a thing or not, (1) makes perfect sense on a microarchitecture that is designed to enable single-cycle Wave64 execution, versus RDNA 1 and 2 probably hashing both halves of a virtual Wave64 register to the same VGPR bank (since they both do 2-cycle execution).

Either way, four banks can only provide 4 operands, best-case, per clock if there's only one read port per bank. So that's only 2 operands per instruction-half in a dual-issue.

Maybe there's six banks?...

Having an operand cache can expand operand bandwidth in two ways:

1. asynchronous operand fetching — the fetcher has visibility into the window of instructions to be issued, so that it can maximise VGPR read port usage by opportunistically prefetching operands within the window that are not in a dependency chain. Or to put it in another perspective, an instruction can "steal" VGPR read ports from earlier ones. This was somewhat described in a different patent.

2. result forwarding — it can load from the destination cache/buffer, so short dependency chains can avoid using VGPR read ports.

So 4 1R1W banks are still a possibility, even if we "know" that GFX11 will have 50% more physical registers. It is a question of a modulo before hitting the banks (i.e. more banks), or a modulo within each bank (i.e., more entries). Edit: I don't see allocation granularity or even the ISA-level addressability to be a strong indicator of either way — we can really know only when the hardware drops and people get their hands on it... probably.
 
Last edited:
On (2), it could be that mixing lower and higher Wave64 register halves within the same bank can lead to higher sustained VGPR bandwidth based on simulation/modelling results. The swizzling appears to be limited within each pair of banks (0/1 and 2/3), and the operand cache has to be aware of the swizzling scheme so that it can swizzle the halves back into the expected Wave64 order when necessary (with respect to VCC/EXEC masks, read thread id, etc).
I don't see how the lo/hi mixing helps directly with sustained situations, but I've thought of some scenarios where swizzling might be relevant:
  1. LDS moves to/from VGPRs will presumably have a maximum bandwidth of 32 work items per clock, 1024 bits, the full width of a bank.
  2. Sometimes a whole half of a wave64 will be predicated off.
  3. A super SIMD supports both wave 32 and wave 64 simultaneously.
I still don't see why banks that are individually addressable care about the address used in the other bank of the pair.
 
I don't see how the lo/hi mixing helps directly with sustained situations, but I've thought of some scenarios where swizzling might be relevant:
  1. LDS moves to/from VGPRs will presumably have a maximum bandwidth of 32 work items per clock, 1024 bits, the full width of a bank.
  2. Sometimes a whole half of a wave64 will be predicated off.
  3. A super SIMD supports both wave 32 and wave 64 simultaneously.
I still don't see why banks that are individually addressable care about the address used in the other bank of the pair.
Wave64 skip-half mode would be the my guess, especially since it is an existing RDNA 1/2 hardware feature which shaders assume to exist & have taken advantage of (in concert with cross-lane ops and threadgroup mem, etc). Swizzling would reduce the possibility of bank conflict stalls in this mode, especially given the illustrated granularity.

Every 2 address-adjacent logical Wave64 register (= 4 address-adjacent physical 32-lane registers) is a swizzle pair, and the swizzle pairs are interleaved across the two pairs of VRF banks. For example, a fma v4, v5, v6 or fma v4, v5, v7 on a wave64 with higher-half disabled is conflict free, allowing them to execute at full rate as if they are a wave32:

* v4's lower-half is in Bank 0 (same w/o swizzling)
* v5's lower-half is in Bank 1 (versus Bank 2 w/o swizzling)
* v6's lower-half is in Bank 2 (versus Bank 0 w/o swizzling)
* v7's lower-half is in Bank 3 (versus Bank 2 w/o swizzling)

Without the register bank swizzling, it will be capped at 2 VGPR reads per cycle, since lower halves would all be bound to even banks, and likewise higher halves to the odd banks.
 
Last edited:
Wave64 skip-half mode would be the my guess, especially since it is an existing RDNA 1/2 hardware feature which shaders assume to exist & have taken advantage of (in concert with cross-lane ops and threadgroup mem, etc). Swizzling would reduce the possibility of bank conflict stalls in this mode, especially given the illustrated granularity.

Every 2 address-adjacent logical Wave64 register (= 4 address-adjacent physical 32-lane registers) is a swizzle pair, and the swizzle pairs are interleaved across the two pairs of VRF banks. For example, a fma v4, v5, v6 or fma v4, v5, v7 on a wave64 with higher-half disabled is conflict free, allowing them to execute at full rate as if they are a wave32:

* v4's lower-half is in Bank 0 (same w/o swizzling)
* v5's lower-half is in Bank 1 (versus Bank 2 w/o swizzling)
* v6's lower-half is in Bank 2 (versus Bank 0 w/o swizzling)
* v7's lower-half is in Bank 3 (versus Bank 2 w/o swizzling)

Without the register bank swizzling, it will be capped at 2 VGPR reads per cycle, since lower halves would all be bound to even banks, and likewise higher halves to the odd banks.
Won't putting the lo and hi halves of a wave 64 VGPR into the same bank produce the same result in your examples?
 

Hmm, wave32 with VOPD has rules about mixing odd and even bank IDs (effectively mapped from VGPR ID):
  • one dst register must be even and the other odd
  • operands must use different VGPR banks
  • Src0 operands must use different VGPR banks
  • Src1 operands must use different VGPR banks
  • Src2 operands must use different VGPR banks
  • srcX0 and srcY0 must use different VGPR banks
  • srcX1 and srcY1 must use different VGPR banks
  • srcX2 and srcY2 must use different VGPR banks
For all we know, there's only two register banks now! Teehee.

This further reduces the chance that the compiler can produce VOPD code, I reckon.
 
Dang it! A month to launch and all we have are some information "leaks" from AMDGPU/LLVM commits.
It was 10 September 2020 when we got a board pic:


and this time, almost a month closer to the launch, nothing...
 

Hmm, wave32 with VOPD has rules about mixing odd and even bank IDs (effectively mapped from VGPR ID):
  • one dst register must be even and the other odd
  • operands must use different VGPR banks
  • Src0 operands must use different VGPR banks
  • Src1 operands must use different VGPR banks
  • Src2 operands must use different VGPR banks
  • srcX0 and srcY0 must use different VGPR banks
  • srcX1 and srcY1 must use different VGPR banks
  • srcX2 and srcY2 must use different VGPR banks
For all we know, there's only two register banks now! Teehee.

This further reduces the chance that the compiler can produce VOPD code, I reckon.
Well, the number of banks available for DST, SRC0, SRC1, and SRC2 operands (where no conflict is allowed between X and Y components of a VOPD instruction) are 2, 4, 4, and 2, respectively. The operand collection probably can't read arbitrarily from all register file banks (subsets of banks are grouped and there is a limited swizzle). Interestingly, the sum of the number of banks for all operands matches the register allocation granularity (12 regs).
 
Well, the number of banks available for DST, SRC0, SRC1, and SRC2 operands (where no conflict is allowed between X and Y components of a VOPD instruction) are 2, 4, 4, and 2, respectively. The operand collection probably can't read arbitrarily from all register file banks (subsets of banks are grouped and there is a limited swizzle). Interestingly, the sum of the number of banks for all operands matches the register allocation granularity (12 regs).
I'm glad you engaged your brain looking at the code, because my eyes were glazing over.

I don't think swizzle is what's going with "wave 32" though, the swizzle was described as being for "wave 64", lo-/hi- halves.

Perhaps SRC2 would only be used for fma and would be the same operand as the destination, hence the choice of 2 for SRC2 matches for DST.

Overall, though, I'm struggling to care. The high degree of complexity here is worse than I feared and, in my view, makes VOPD practically useless. VOPD will probably see its use principally in dual-issued MOV instructions or a MOV coupled with math. MOVs have been a drag on instruction throughput for well over a decade, so anything to defray that cost is welcome.
 

Unfortunately the PDF is not available at this website right now, so it's not possible to look at the diagrams. I haven't been able to find a site that holds this patent application with the diagrams.

The base active interposer die (AID) 404 (similar to the first base IC die 204 of FIG. 2) of the graphics processing stacked die chiplet 402 includes an inter-die interconnect structure 408 along at least a first edge of the base active interposer die 404 (commonly referred to as a “beachfront”). Additionally, the graphics processing stacked die chiplet 402 includes a plurality of shader engine dies (SEDs) 412 (similar to the virtual compute dies 212 of FIG. 2, but in various embodiments includes any appropriate parallel processing unit) formed over the active interposer die 404. Although illustrated as including two SEDs 412, those skilled in the art will recognize that any number of processing units may be positioned in the processing unit layer stacked above the active interposer die 404. In this configuration, a portion of a conventional graphics complex die (GCD) is pushed up to a second floor based on 3D die stacking methodologies by positioning the plurality of shader engine dies 412 in a layer on top of the active interposer die 404.


The coupling of multiple graphics processing stacked die chiplets (e.g., first graphics processing stacked die chiplet 402a to the second graphics processing stacked die chiplet 402b) together in a single package results in a device that effectively operates as a single large graphics complex die (GCD) but is constructed out of smaller, modular die components. In various embodiments, the graphics processor MCM 502 is communicably coupled to one or more external system memory modules 506 via the memory controller PHYs 414 of the graphics processing stacked die chiplets. Additionally, in some embodiments, the graphics processor MCM 502 also includes input/output (I/O) logic in a multimedia and I/O die (MID) 508 separate from the graphics processing stacked die chiplets 402.


In various embodiments, each shader engine die 612 includes a share (often an equal share) of the resources and graphics processing capabilities of a GPU but does not contain the entire graphics pipeline. In particular, a shader engine die 612 includes at least a portion of the graphics processing pipeline microarchitecture. For example, in some embodiments, the shader engine die 612 includes the shader system (not shown), pixel pipes (not shown), geometry logic (not shown), and the like. However, at least a portion of the graphics processing pipeline, such as a command processor 606, is positioned in the underlying base active interposer die 604. Additionally, in various embodiments, the base active interposer die 604 includes one or more levels of cache memory 610 and one or more memory controller PHYs 614 for communicating with external system memory (not shown), such as dynamic random access memory (DRAM) module. The memory controller (not shown) and memory controller PHYs 614 are, in other embodiments, provided on a separate die from the base active interposer die 604.

So we have:
  • active interposer die (AID)
  • shader engine die (SED)
  • multimedia and I/O die (MID)
  • graphics complex die (GCD)
 
Status
Not open for further replies.
Back
Top