Would the pipes be independent controllers that can reference a shared microcode store, or is there some kind of execution element separate from the pipes that runs the microcode and interacts with the pipes?Correct... each pipe has its own fixed-function hardware. The microcode mostly does PM4/AQL packet decoding.
Calling back to the command processor lineage that was previously discussed, all the eras of GPU architecture in the RDNA whitepaper had a command processor or its relatives working in the background.
Some items that I came across that may not have been discussed in this thread or provide more detail on earlier discussions:
Fixed-function resources appear to scale with shader arrays rather than shader engines. This sort of offloads to the question of the per-SE limit of CUs and RBEs to what the limit is for shader arrays. I've noted that GCN has always had some concept of shader arrays, and the option of having one or two per shader engine. What the shader engine means architecturally isn't clear if it's not tied to the hardware resources it once was. The old GCN limits are not exceeded on a per-array basis at this time, and the number of shader arrays per chip has not exceeded past GCN limits. Southern Islands mentions shader arrays in its hardware IDs, and Arcturus-related changes note that Vega GPUs have one or two shader arrays per SE.
In that vein, the L1 is per shader array and helps limit congestion on the L2 by being a single client with 4 links to the L2.
Question: Prior to this, I feel like it would have been practical if GCN didn't have 64+:16 crossbars between the L2 and its various clients. Would the shader array or shader engine count in prior GCN chips have played a role in determining how many links the L2 had to deal with? How much did Navi change?
A few items in the CU look like merged versions of the prior SIMD hardware. Each SIMD32 supports 2x the number of wavefront buffers as the SIMD16, and the WGP supports as many workgroups as two GCN CUs.
Some earlier possibilities about the hardware that were discussed: export and messaging buses are shared between the two CUs, although I'm curious how different that is from earlier GCN--since there was arbitration for a single bus anyway.
The instruction cache seems to have the same line size and capacity, although in a sign of the times comparing the GCN whitepaper to RDNA shows this same cache supplies ~2-4 instructions now versus the ~8 then.
GCN could issue up to 5 instructions per clock to the SIMD whose turn it was for issue, with the requirement that they'd be of different type and wavefront.
RDNA doubles the number of SIMDs per CU are actively issuing instructions, but they issue up to 4 instructions per clock from different types. It's not clearly stated that they'd come from different wavefronts, but I didn't see a same-wavefront example.
The L1 is read-only, so I'm not sure at this point how many write paths there are to the L2, though this wasn't explicitly stated in prior GCN ISA guides either. It was clarified that the L1 can supply up to 4 accesses per clock.
Oldest-workgroup scheduling and clauses do seem to point to a desire to moderate GCN's potentially overly thrash-happy thread switching in certain situations.
128 byte cache lines do have some resemblance to some competing GPUs, although there may be differing levels of commitment to that granularity at different levels of cache.
Wave32 hardware and dropping the 4-cycle cadence have brought the SIMD and SFU model closer to some Nvidia implementations as well.
RDNA has groups of 4 L2 slices linked to a single 64-bit memory controller--which for GDDR6 is 4 16-bit channels (another of the "did RDNA change something" items).
The driver and LLVM code changes reference the DSBR and primitive shaders. There's GPU profiling showing primitive shaders running.RDNA still rasterizes only 4 triangles per clock though. And there is no mention of Primitive Shaders or the DSRB anywhere.
The RDNA ISA doc points out primitive shaders specifically--and not in an accidental way like a few references in the Vega ISA doc that AMD failed to erase.
The triangle references seem to be what AMD has done for the fixed-function pipeline irrespective of whether primitive shaders are running.
What we don't have a clear picture on is how consistently these are working, or how successful they are versus having them off. The DSBR generally had modest benefits, and if it's generally unchanged it might not be newsworthy. If primitive shaders are not considered fully baked, or are also of limited impact, it's possibly not newsworthy or may dredge up the memory of the failed execution with Vega.
The whitepaper goes into some detail of what the primitive units and rasterizers are responsible for, though that central geometry processor's specific duties aren't addressed.
The Vega whitepaper gave a theoretical max that probably represented the best-case with a primitive shader, which Vega wound up seeing none of. The RDNA whitepaper may be going the other way and gives what the triangle setup hardware can do as a baseline. I'm a little unclear on whether it can cull two triangles and submit a triangle to the rasterizer in one clock, or if it's a more complex mixture of culling and/or submitting.In whitpaper ther is written that one prime unit can cull 2 primitives per clock. That means 8 primitive per clock for navi.
I remembered that AMD Vega had 17 Primitive per clock. What the hell went wrong?
One of the justifications for primitive shaders with Vega was avoiding losing a cycle per triangle that reached the fixed-function pipeline only to be culled. Enhancing the primitive pipeline may somewhat reduce the impact of this argument, though it might not go as far as where Nvidia's task and mesh shaders tend to assume their triangle setup hardware will be capable enough to do a fair amount of culling on its own.