I don't see a reason not to do exactly the same with the HW_ID bits. And there is plenty of space for just adding a different set of bits for newer versions.
I wouldn't see a need to do that until the needs of the internals were expanded so that there needed to be extra representation externally.
Because AMD seems to value consistency across generations?
For a number of these context values, apparently yes. AMD went so far as to reserve bits inside the ID representation, which seems to indicate that there is value to them in having things positioned the way they are.
The main task of the needed rework is not changing a few bits in some registers. It's actually designing the work distribution and all the crossbars in the chip to the needed size.
The amount of work distribution is less variable, as wavefront allocation and launches are constrained to at most 1 wave per pipe capable of dispatch, and some limited number per shader engine (1-2?).
How the GPU tracks free resources for determining if a wavefront can launch, allocating them, and in what order looks like something that relies on some of these values, perhaps in a mix of hardware or microcode since sources like the Xbox One SDK indicate the strides used distributing work can be set by the developer.
Coordination from the command processors, shader engine fixed function regions, and the CU arrays also frequently use pathways independent of the vector memory path, such as the GDS and export buses.
The crossbars in GPUs have way more clients (just look at the connections between all the L1 caches [there 96 separate ones!] and L2 tiles [everything has a 512bit wide connection btw.]).
There's the crossbar or a cheaper approximation of one between the write-through L1s to the L2. No communication between L1s, no coherence outside of the trivial handling by the L2, limited RAS, and very weak memory model.
And isn't AMD on record, that AMD will use some infinity fabric derived thingy to connect everything on the chip to the L2 tiles/memory controllers in Vega?
The most AMD has said is that the infinity fabric is implemented as a mesh in Vega, and that it has the same bandwidth as the DRAM.
Putting that between the CUs and the L2s would be a regression in the usual ratio of L1-L2 to DRAM bandwidth from prior architectures, and in raw terms inferior to Hawaii's L2. This is why I am having doubts about the infinity data fabric going into the traditionally encapsulated portion of the GPU. For Zen, the infinity fabric doesn't inject itself between the cores and their local L3 either.
But a crossbar (and the effort needed for it) is not agnostic to the number of attached clients, quite the contrary actually.
That wouldn't have anything to do with work distribution. All of that decision making is done by logic on either side, the fabric may be responsible for routing payloads, but it wouldn't know what they are or decide on the endpoints. It wouldn't know what mode the GPU or CU groups have decided for kernel allocation, how two dispatchers arbitrate a collision, nor could it interpret the meaning of a wavefront's signalling its final export.
The data payload movement isn't what I was considering, although I think most of that portion is actually handled by bus connections, more unidirectional flow, and internal hierarchies--and given the other limits some of those might be streamlined down to have at most N clients or would scale very poorly if they had more.
Also, the vector memory path prior to Vega wouldn't put a limit on the RBE count or rasterizer width per SE, since those aren't L2 clients.
If you scale a chip from 10 CUs (with 16 L1$ and 4 L2-tiles, i.e. you have a 16x4 port crossbar between them [as mentioned, every port is capable of 512bit per clock]) to 64 CUs (96 L1$, 32 L2 tiles, i.e. a 96x32 port crossbar),
I accept that the memory crossbar can scale significantly between chips, although some of the elements I was discussing would only see a generic memory port with 1 access per cycle or hook into separate data paths.
I presume the the 32 L2 tile case is for Fury and its 4x8 channels? Bandwidth synthetics showed a distinct lack of bandwidth scaling over Hawaii until after accesses went outside the L2, for some reason. Also, I'm not entirely sure how many of the 96 clients can independently use the crossbar, and this may leave out some of the blocks that can also read from the L2.
Access arbitration as broad as 96x32 may be excessive, given the limits of the L2 and possibly other simplifying assumptions about distribution. It's possible that this could be a limiter in this pathway, or in access arbitration.
Each vALU instruction has a certain fixed latency usually identical to the throughput (4 cycles for full rate instruction, 8 cycles for half rate and so on, exceptions exist).
A lot of exceptions stem from VALU instructions that can source from the LDS, which for some reason don't make the list for a waitcnt. As documented in the ISA doc, VALU instructions cannot increment or decrement the counter anyway.
To give a specific example, the consumer parts for Hawaii (which is physically a half rate double precision chip) have a fuse blown (or potentially some µcode update done by the firmware) which sets this to 1/8 rate for the DP instructions. The chip effectively pauses the vALU for some cycles after each DP instruction.
Other chips have units which can do only 1/16 DP rate.
Perhaps if the GPU assembly were manually arranged or the compiler hacked, that pause be tested to see if it allows a shader to remove some the required wait states for various vector ops, if there's effectively 6 vector issue cycles of delay added this way.
This was part of my earlier speculation, where there's a smaller execution loop in the various domains that manages their pipelines and allows for them to coordinate despite varying latencies.
That allows for several smaller execution loops that modify the counters, which would then relieves the scalar pipeline of having to tightly coordinate with them. I think the scalar pipe itself is more closely linked to one of them, or is itself derived from a formerly unexposed sequencer in pre-GCN GPUs.
Something that awkwardly bridges multiple domains requires conservatively tracking multiple counters by setting a 0, like for flat addressing or scalar memory ops.
Why that might be interesting for instructions like vector ops that can have unexpected latencies or variable latencies is that there apparently is or was a VALU count, one which hasn't been fully expunged. However, hard-wiring an implicit requirement that it be 0 can allow for some of those variations or new vector functions to be handled transparently if there is similar logic in the SIMD block.
It's very much about economizing the effort. But you can't just build a large GPU out of the design of a small CP, some compute pipe, a CU, an RBE, and a L2 tile with a connected 32bit memory channel.
The relationship between a number of these is rather constrained. CP to SE has a few combinations with no clear dependence on CU count. SE front end is linked to RBE, not CU count or the command processor block. RBEs have previously not connected to the L2. Some of the wait counts mentioned earlier are about arbitrating access to a bus, rather than having an ever-changing interconnect.