AMD seems to value consistency for the reported values, and Vega does go through some amount of extra work to split VMCNT the way it does. It even takes care to leave room for the other fields--or doesn't want to take up part of the range for VALUCNT, for some reason.
I don't see a reason not to do exactly the same with the HW_ID bits. And there is plenty of space for just adding a different set of bits for newer versions. Only 7 hardware registers are assigned in that manual (1 to 7, 0 is reserved), but the ISA leaves room for 64. AMD could simply define HW_ID_ext as register #8 with a completely new set of bits. I still see no argument there.
And the extra work is mainly necessary for the software to glue the bits together. A
fixed "swizzle" between some bits of a register just requires crossing the data lines. That effort is not worth mentioning.
If the internal distribution is only ever going to be sized to a certain limit, why should the CU be able to report any more than that?
Because AMD seems to value consistency across generations?
The formats seem to be sufficient to fit all the sizes AMD has put forward for generations, and apparently two generations more.
If there is reworking necessary per size, Vega10 and Vega20 would apparently not need it.
The main task of the needed rework is not changing a few bits in some registers. It's actually designing the work distribution and all the crossbars in the chip to the needed size. The effort for the bits are irrelevant in comparison.
The context I got from the discussion on the Infinity Fabric's role in the SoC was that it was focused on the interconnect between the various separate portions of an APU, rather than internally. The internal crossbar used for AMD's CPUs has a long lineage, and that particular element has been cited as a scaling barrier, and the later APUs that tried to fit different IP onto it were described as being more chaotic. There has been a limited range in client counts for that crossbar since K8, although the coherent domain is a more demanding use case.
The crossbars in GPUs have way more clients (just look at the connections between all the L1 caches [there 96 separate ones!] and L2 tiles [everything has a 512bit wide connection btw.]). And isn't AMD on record, that AMD will use some infinity fabric derived thingy to connect everything on the chip to the L2 tiles/memory controllers in Vega? AMD wants to go from the massive crossbar switches (which need an expensive redesign for doing a significant change to the number of clients) to a more easily scalable NoC configuration. That's quite an issue for the simplicity of scaling and one of the main efforts when going to larger GPUs.
I'm not sure that the fabric is going to cover the work distribution or arbitration facet I was focused on.
Not entirely of course. But all of them have to be adapted to the size of the chip (and the work/the associated data has to get somehow to the CUs in order to be distributed, right?).
The data fabric does not intrude into the CCXs in Zen, and my interpretation of the limited description of Vega's implementation has it hooked into the memory controllers and L2, outside of the engines. Even without that, the interconnect is supposed to be agnostic to the the particulars of the clients it is connecting.
But a crossbar (and the effort needed for it) is not agnostic to the number of attached clients, quite the contrary actually. The duplicated "building blocks", i.e. the individual CUs (or the groups sharing L1-sD$ and L1-I$), the RBEs and even the L2 tiles and memory controllers, are agnostic to the particulars of the connection between the parts (I mentioned that before). And it is ecactly this connection (and of course also how to distribute stuff between the duplicated units) which has to be reworked for a differently sized chip. This is supposed to be improved upon with the infinity fabric.
The controllers in the various blocks and their hardware or microcode settings would do the work, and those are populated by a number of proprietary cores or dedicated sequencers.
If you scale a chip from 10 CUs (with 16 L1$ and 4 L2-tiles, i.e. you have a 16x4 port crossbar between them [as mentioned, every port is capable of 512bit per clock]) to 64 CUs (96 L1$, 32 L2 tiles, i.e. a 96x32 port crossbar), there is no way the "controllers in the various blocks and their hardware or microcode settings would do the work". You have to put a significant effort in to design that stuff and to make it work.
Couldn't the flexibility for salvage be used to provide downward scalability for the same hardware block? If the GPU can transparently disable and remap units, couldn't the same block interface with hardware where there are units "disabled" due to their not existing?
Within reason of course. As the effort for large crossbars (that is what we have in GPUs up to now) goes up tremendeously with the number of clients, there are strong incentives to put smaller versions in smaller chips. This stuff is definitely not untouched between differently sized chips.
The command processors themselves are proprietary F32 cores, per some linked-in info and the PS4 hack. The graphics command processor is a set of multiple custom cores, which is an arrangement whose lineage goes back to the VLIW days. [..] The command front end versus back-end seems to vary pretty significantly, going by the ratio of front-end to back-end for Kaveri, the consoles, and Hawaii. The flexibility or size of the available microcode store seems to have a significant effect on what features they can offer.
As this was obviously extendable over the range you mentioned (starting with 3/4 SIMD engines all the way up to 64 CUs with 256 SIMD engines in total and from a single to 4 shader engines), I doubt that the number of shader engines is a hard wall on this front.
How the CUs for the same implementation "know" what to do, even with a fuse blown without either having the logic for the different rates or internal sequencing information on hand?
A fuse is basically a bit which can set certain things. And then it is simple. Each vALU instruction has a certain fixed latency usually identical to the throughput (4 cycles for full rate instruction, 8 cycles for half rate and so on, exceptions exist). After decoding an instruction, the scheduler in the CU "knows" the throughput number for the instruction and pauses the instruction issue on the respective port for the corresponding number of cycles. What happens, if the decoder is "told" by a set bit (blown fuse), that the throughput is different from the one of the physically implemented hardware?
To give a specific example, the consumer parts for Hawaii (which is physically a half rate double precision chip) have a fuse blown (or potentially some µcode update done by the firmware) which sets this to 1/8 rate for the DP instructions. The chip effectively pauses the vALU for some cycles after each DP instruction.
Other chips have units which can do only 1/16 DP rate. The respective throughput number attached to the instructions in the decoder reflect that of course. In other words: Units with different DP rates are physically different, also in the same generation. But to restrict the throughput down is pretty easy.
Or, if that is possible between two settings on one implementation, why not design a configurable sequencer with the additional rates and then reuse it, at least within the same generation?
What sequencer? The one sequencing instructions from the 10 instruction buffers within a CU? As I said, the CUs itself are of course reused. If you follow the logic above, no real hardware change is needed to allow for different executions rates of specific instructions (basically a change to some kind of LUT with the throughput numbers is enough, probably at least partially hardwired).
If you talk about something in the command processor/work distribution, these parts don't know anything at all about execution rates.
Perhaps I am overestimating how much AMD is willing to economize effort. I see benefits in having blocks with a certain size or expandability defined once, and then sharing them across the generation and semicustom products with some of the basic questions and interfaces fixed. (edit: And for purposes of further saved effort or continuity, some of those might carry across gens)
It's very much about economizing the effort. But you can't just build a large GPU out of the design of a small CP, some compute pipe, a CU, an RBE, and a L2 tile with a connected 32bit memory channel. All of these blocks have somewhat of a standardized interface to the outside (at least within a generation), one can multiply these blocks and AMD is indeed doing that. But this stuff needs to get interconnected and to make to work together. That is the hard part for scaling to a different chip size. And while AMD wants to make it easier for themselves in future iterations, so far you have to redo quite a bit of it for each different chip.