AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
Correct... each pipe has its own fixed-function hardware. The microcode mostly does PM4/AQL packet decoding.
Would the pipes be independent controllers that can reference a shared microcode store, or is there some kind of execution element separate from the pipes that runs the microcode and interacts with the pipes?


Calling back to the command processor lineage that was previously discussed, all the eras of GPU architecture in the RDNA whitepaper had a command processor or its relatives working in the background.

Some items that I came across that may not have been discussed in this thread or provide more detail on earlier discussions:

Fixed-function resources appear to scale with shader arrays rather than shader engines. This sort of offloads to the question of the per-SE limit of CUs and RBEs to what the limit is for shader arrays. I've noted that GCN has always had some concept of shader arrays, and the option of having one or two per shader engine. What the shader engine means architecturally isn't clear if it's not tied to the hardware resources it once was. The old GCN limits are not exceeded on a per-array basis at this time, and the number of shader arrays per chip has not exceeded past GCN limits. Southern Islands mentions shader arrays in its hardware IDs, and Arcturus-related changes note that Vega GPUs have one or two shader arrays per SE.

In that vein, the L1 is per shader array and helps limit congestion on the L2 by being a single client with 4 links to the L2.
Question: Prior to this, I feel like it would have been practical if GCN didn't have 64+:16 crossbars between the L2 and its various clients. Would the shader array or shader engine count in prior GCN chips have played a role in determining how many links the L2 had to deal with? How much did Navi change?

A few items in the CU look like merged versions of the prior SIMD hardware. Each SIMD32 supports 2x the number of wavefront buffers as the SIMD16, and the WGP supports as many workgroups as two GCN CUs.
Some earlier possibilities about the hardware that were discussed: export and messaging buses are shared between the two CUs, although I'm curious how different that is from earlier GCN--since there was arbitration for a single bus anyway.
The instruction cache seems to have the same line size and capacity, although in a sign of the times comparing the GCN whitepaper to RDNA shows this same cache supplies ~2-4 instructions now versus the ~8 then.

GCN could issue up to 5 instructions per clock to the SIMD whose turn it was for issue, with the requirement that they'd be of different type and wavefront.
RDNA doubles the number of SIMDs per CU are actively issuing instructions, but they issue up to 4 instructions per clock from different types. It's not clearly stated that they'd come from different wavefronts, but I didn't see a same-wavefront example.

The L1 is read-only, so I'm not sure at this point how many write paths there are to the L2, though this wasn't explicitly stated in prior GCN ISA guides either. It was clarified that the L1 can supply up to 4 accesses per clock.

Oldest-workgroup scheduling and clauses do seem to point to a desire to moderate GCN's potentially overly thrash-happy thread switching in certain situations.

128 byte cache lines do have some resemblance to some competing GPUs, although there may be differing levels of commitment to that granularity at different levels of cache.
Wave32 hardware and dropping the 4-cycle cadence have brought the SIMD and SFU model closer to some Nvidia implementations as well.

RDNA has groups of 4 L2 slices linked to a single 64-bit memory controller--which for GDDR6 is 4 16-bit channels (another of the "did RDNA change something" items).

RDNA still rasterizes only 4 triangles per clock though. And there is no mention of Primitive Shaders or the DSRB anywhere.
The driver and LLVM code changes reference the DSBR and primitive shaders. There's GPU profiling showing primitive shaders running.
The RDNA ISA doc points out primitive shaders specifically--and not in an accidental way like a few references in the Vega ISA doc that AMD failed to erase.
The triangle references seem to be what AMD has done for the fixed-function pipeline irrespective of whether primitive shaders are running.

What we don't have a clear picture on is how consistently these are working, or how successful they are versus having them off. The DSBR generally had modest benefits, and if it's generally unchanged it might not be newsworthy. If primitive shaders are not considered fully baked, or are also of limited impact, it's possibly not newsworthy or may dredge up the memory of the failed execution with Vega.

The whitepaper goes into some detail of what the primitive units and rasterizers are responsible for, though that central geometry processor's specific duties aren't addressed.

In whitpaper ther is written that one prime unit can cull 2 primitives per clock. That means 8 primitive per clock for navi.

I remembered that AMD Vega had 17 Primitive per clock. What the hell went wrong?
The Vega whitepaper gave a theoretical max that probably represented the best-case with a primitive shader, which Vega wound up seeing none of. The RDNA whitepaper may be going the other way and gives what the triangle setup hardware can do as a baseline. I'm a little unclear on whether it can cull two triangles and submit a triangle to the rasterizer in one clock, or if it's a more complex mixture of culling and/or submitting.

One of the justifications for primitive shaders with Vega was avoiding losing a cycle per triangle that reached the fixed-function pipeline only to be culled. Enhancing the primitive pipeline may somewhat reduce the impact of this argument, though it might not go as far as where Nvidia's task and mesh shaders tend to assume their triangle setup hardware will be capable enough to do a fair amount of culling on its own.
 
Last edited:
So the main differentiators from GCN5/Vega are:

-the increased cache bandwidth to ALUs (minimizing their idle state and working them harder)
-the massive increase in triangle culling rate.

RDNA still rasterizes only 4 triangles per clock though. And there is no mention of Primitive Shaders or the DSRB anywhere.

Isnt turing/pascal rasterization rate .5 per GPC per clock? How likely is this to be a bottleneck in
Thanks for sharing.

I don’t get the emphasis on “dual” compute unit. There are 4 32-wide SIMDs per CU each with their own scheduler, registers and workgroups. All 4 SIMDs share the LDS and caches.

So what exactly is dual meant to describe? Is it just that each pair of SIMDs in the CU share a TMU block and there are 2 such blocks?

I thought it was describing the optional wave sizes but someone more knowledgeable than me should def chime in.
 
My understanding of the scope of a CU/SM in recent architectures is that it’s the unit of hardware that owns the execution of a workgroup/block of threads and has its own pool of LDS/shared memory.

AMD is counting a dual compute unit as 2 CUs even though all 4 SIMDs appear to share the LDS. I must be missing something.
 
My understanding of the scope of a CU/SM in recent architectures is that it’s the unit of hardware that owns the execution of a workgroup/block of threads and has its own pool of LDS/shared memory.

AMD is counting a dual compute unit as 2 CUs even though all 4 SIMDs appear to share the LDS. I must be missing something.
From the RDNA ISA doc, the LDS itself is implemented as two 64 KB halves, one half is considered local to one CU. The arrangement of the local half of the LDS matches a GCN CU's capacity and banking, so the big LDS lists for the WGP is two "classic" LDS arrays. The big difference is that there is some kind of link between the two so that a wavefront in one CU can read from the more distant LDS in WGP mode--subject to potential performance penalties not otherwise specified.
From my reading of the ISA doc, the WGP mode allows for a workgroup's LDS allocation to be split across the two halves without changing the maximum allocation a workgroup can make.
At some point there may be some disclosure for why two variations for GFX10 have a bug flag for LLVM for LDS usage in workgroup mode, and how significant that is. There is for some reason one variant that does not have that flag.

Aside from that link between the LDS halves, much of the CU layout that the RDNA whitepaper goes into looks a lot like two independent CUs.
There are some differences like how the dual-CU supports twice as many workgroups as a single GCN CU, though I'm not clear if that means there's a shared scheduling component that tracks twice as many workgroups or we're looking at two CUs with half the total in local hardware that is able to somehow query the status in the other CU. I'm unclear if WGP and CU mode affect this ceiling from the point of view of a CU.
 
They shipped it to hyperscale and you don't look like one so you're not supposed to see it.

Which one? I haven't seen any hyperscalers even mention they have a working cluster. Others like summit are very open with their designs and use cases. Regardless, if only hyperscalers see it, that likely means it's in short supply.
 
What we don't have a clear picture on is how consistently these are working, or how successful they are versus having them off. The DSBR generally had modest benefits, and if it's generally unchanged it might not be newsworthy. If primitive shaders are not considered fully baked, or are also of limited impact, it's possibly not newsworthy or may dredge up the memory of the failed execution with Vega.

The whitepaper goes into some detail of what the primitive units and rasterizers are responsible for, though that central geometry processor's specific duties aren't addressed.


The Vega whitepaper gave a theoretical max that probably represented the best-case with a primitive shader, which Vega wound up seeing none of. The RDNA whitepaper may be going the other way and gives what the triangle setup hardware can do as a baseline. I'm a little unclear on whether it can cull two triangles and submit a triangle to the rasterizer in one clock, or if it's a more complex mixture of culling and/or submitting.

One of the justifications for primitive shaders with Vega was avoiding losing a cycle per triangle that reached the fixed-function pipeline only to be culled. Enhancing the primitive pipeline may somewhat reduce the impact of this argument, though it might not go as far as where Nvidia's task and mesh shaders tend to assume their triangle setup hardware will be capable enough to do a fair amount of culling on its own.

My understanding of primitive shader is this:
It's like a workload balancing. If not enough pixel reach the cu's because not enough triangles are culled and the cu's are empty, why you don't give cu's culling work?

Now for me this looks like now primitive shader culling is a own fixed shader stage.
 
My understanding of the scope of a CU/SM in recent architectures is that it’s the unit of hardware that owns the execution of a workgroup/block of threads and has its own pool of LDS/shared memory.

AMD is counting a dual compute unit as 2 CUs even though all 4 SIMDs appear to share the LDS. I must be missing something.

I don't think you are missing anything. CU's have had 64 CUs since Cayman (VLIW4, 16-way SIMD) although all the GCN ones were organized as 4 16-way SIMDs with a separate scalar ALU, so arguably 65.

When we doubled the size of each SIMD we could either say "hey CU's are twice as big now" (which would be confusing) or we could talk about 2-CU blocks (which was felt to be a bit less confusing).

My understanding was that we still wanted to allow 4 SIMDs to collaborate via LDS in order to minimize impact on existing code, but didn't want to confuse customers by making CU's twice as big, so the remaining option was to keep a CU at 64 ALUs (66 now I guess) going from 4 SIMDs to 2, with LDS sharing between 2 CU's to maintain "4 SIMDs per LDS".

Not sure if that helps or just trowels on another layer of confusion :)
 
I don't think you are missing anything. CU's have had 64 CUs since Cayman (VLIW4, 16-way SIMD) although all the GCN ones were organized as 4 16-way SIMDs with a separate scalar ALU, so arguably 65.

When we doubled the size of each SIMD we could either say "hey CU's are twice as big now" (which would be confusing) or we could talk about 2-CU blocks (which was felt to be a bit less confusing).

My understanding was that we still wanted to allow 4 SIMDs to collaborate via LDS in order to minimize impact on existing code, but didn't want to confuse customers by making CU's twice as big, so the remaining option was to keep a CU at 64 ALUs (66 now I guess) going from 4 SIMDs to 2, with LDS sharing between 2 CU's to maintain "4 SIMDs per LDS".

Not sure if that helps or just trowels on another layer of confusion :)

Yeah I definitely read it as CUs being twice as big on the first pass through the paper.

I am actually a bit more confused now :) In what way could GCN code be optimized for the number of SIMDs per LDS that wouldn’t map well to RDNA? Is it due to changes in LDS bandwidth / latency with fewer SIMDs?
 
Any ideas how compute performance could differ from GCN?
Reading papers i only see potential improvements, but the few compute benchmarks show big differnces in both directions.

Not sure if that helps or just trowels on another layer of confusion :)
My confusion: VGPRs are not twice as much now, in a way register pressure problems are magically gone now, no?

The whitepaper lacks info on what affects occupancy now, and if there are serious changes from GCN behavior here. (I assume no, but did not really understand the sections that address occupancy.)

Sort of killer feature would be the option to double accessible LDS for certain workgroups, while running others that do not use any LDS on on the other half of one WGP. If technically possible, would be worth some work to make it happen!
 
Yeah I definitely read it as CUs being twice as big on the first pass through the paper.

I am actually a bit more confused now :) In what way could GCN code be optimized for the number of SIMDs per LDS that wouldn’t map well to RDNA? Is it due to changes in LDS bandwidth / latency with fewer SIMDs?

Sorry, I think I was a bit short of coffee on the last post... you get back compatibility anyways, but you aren't taking full advantage of the RDNA hardware because each workgroup is limited to a single CU. If you run in WGP mode you can spread a workgroup's waves across both CUs and the waves are still able to communicate via LDS etc...

Coming up with good names for things is harder than it looks :)
 
Sorry, I think I was a bit short of coffee on the last post... you get back compatibility anyways, but you aren't taking full advantage of the RDNA hardware because each workgroup is limited to a single CU. If you run in WGP mode you can spread a workgroup's waves across both CUs and the waves are still able to communicate via LDS etc...

Coming up with good names for things is harder than it looks :)

Get some children in a room and ask them to come up with names. Children are GREAT at coming up with random names for things. :)

Regards,
SB
 
Well Vega had squat in the end. Seems it was way too ambitious to exploit in the real world.

If Navi can do less on paper but really do it in real world, it's progress imo.

So in other words, Vega PS was just a marketing gimmick based on unrealistic numbers..... ;) Anyway, it would be interesting to watch how RDNA2 will evolve in the future.
 
My confusion: VGPRs are not twice as much now, in a way register pressure problems are magically gone now, no?

It is not magic, it is an effect of the shorter execution latency. Each instruction result has a reservation in the register file. Since latency in RDNA is one quarter of GCN, register file entries used for temporary results effectively have one quarter the footprint (measured in bytes times cycles).

For code with lots of long living temporaries the register file is virtually quadrupled (upper bound, grain of salt, mileage may vary etc...)

Cheers
 
I see, so the reduced instruction latency also reduces requiered registers. Did not realize this before, thanks :)
 
Status
Not open for further replies.
Back
Top