You mean like changing a single line in some profiling tool?
That sounds like it would be one item, but I was thinking also of the microcode loaded by the GPU for internal load balancing parameters like the CU reservation and high-priority queue functionality that was retroactively applied to some of the prior generation GCN GPUs.
These bits just return where exactly (shader engine ID, shader array ID within that engine, CU ID, SIMD ID, threadgroup ID, wavebuffer ID; or also compute ring ID if applicable) the executing shader code is running on, nothing more. It's basically the equivalent of checking the local APIC ID with the CPUID instruction on an x86 multicore/multithreaded CPU.
To clarify, I am not saying the external bits are determining the fate of the internals, but that they can be reflective of peculiarities below the surface, such as evidenced by APIC IDs and Zen.
There's a desire for economy in the external encoding, so it would tend to use as few bits as necessary for a design in the ISA, while having fewer than the internal usage would compromise the purpose of the instructions that return the external bits.
So when items deviate from the most straightforward use case, I wonder if there's a reason.
How the units are wired internally changes with every different chip size anyway.
That runs counter to some of the goals for the GPU architecture for parallel scaling through copy-pasting more of the same resource. I wouldn't want to re-roll a CU sequencer based on whether it is going into a chip with 1 or 4 shader engines, or if a specific chip can have 1:4 DP in consumer and 1:2 in HPC. AMD marketed GCN's tunable DP rate, which should have been a trivial point if following the "any hardware can be redesigned to hit X rate", unless it's something they explicitly made tunable without that amount of effort.
In a compromise against infinite scalability, I wouldn't want to bloat the pipeline or resource scoreboards with 32-bit identifiers for all those items either.
It's unsubstantiated speculation on my part that there are elements of the architecture that were laid down earlier with an upper bound to how far they could go before changes had to propagate to other IP blocks. I have seen other commentary about Southern Islands in particular having a fixed-function portion closer to Northern Islands than the more flexible front ends sported by SI and CI.
I think then that GCN languished longer than expected and more clearly hit some of those limits with Fiji.
I don't know about Vega. The supposedly new CU architecture, new GFX level, and changes in interconnect and microcode make me think there is ample opportunity to raise the limits or choose a topology that can scale better, although Vega 10's starting point and Vega 20's apparent continuation don't push the specs outside of the old bounds.
Which is something completely different than how some ID bits are mapped. The hardware actually doesn't need to know/understand the mapping of these bits, it only has to mindlessly return the ID bits assigned to the executing hardware (and yes, they are at least partially configurable by the firmware) if a shader instruction wants to read them.
Like you noted earlier, I cannot think of a reason why HW_ID has a bit for shader array assignment when there's only one. I do not know why roughly two bits of IB_STS concern themselves with outstanding vector ALU operations per wavefront, at least no reason at the external ISA level or above. It should have been trivial at some point to remove them if they had no significance or cost to remove/repurpose.
Yes. There were patches preparing the LLVM compiler for Vega which included such details.
I think I found what you are referencing. Is it the following from February?
https://github.com/llvm-mirror/llvm/commit/83c857cd3ae73d71958ccee8c43d55dc40ba3cc1
It's true that Vega adds much more. It was just an example that such remapping of bits can be done if one wants it (and x86 adds some CPUID stuff all the time, basically with every new chip).
For VMCNT, it does a bit more than remapping. It adds two more bits to vmcnt by utilizing non-adjacent bits. This consumes the remaining bits in the 16-bit immediate while also preserving the valu_cnt bits that the ISA doc don't really talk about or keep an accurate count of.
I suppose the split at the ISA level is for the purposes of software compatibility, though I have been curious as to what might be depending on valu_cnt, since it officially doesn't merit mention for the instruction that should be using it (or for that matter an accurate bit count vs LGKM_CNT).
And in that case it is crucial for the core functioning of the hardware and not just for some runtime debug information gathering. But all other GCN iterations (besides Polaris maybe) also introduced new instructions, removed instructions, added features (and bits at some points), and also changed the encoding of instructions.
It's the flexibility of the GCN representation and the idiosyncrasies of the various wait states that make me wonder if there's additional microcode or a program store below the level of the fetched instructions, perhaps like earlier microcoded processors or the small array of sequencer instructions used by early ARM. Fiddling with some of the bits used by that may mean fiddling with the ucode/program store, should there be room left to do so.
To sum it up, I fail to see a convincing argument why there should be a hard limit for the number of shader engines. Adding a single bit to some IDs, modifying the work distribution or the interleaving pattern for the rasterizers and ROP partitions to accommodate more shader engines and all that related stuff is totally doable and should pose no fundamental problem.
My scenario isn't that it is impossible, just that it is past a threshold where the investment became non-trivial and in AMD's eyes not worth the effort.
AMD left GCN in something of a holding pattern for several cycles, so some economies of effort were made.