Design a distributed memory controller so that Each GPU in a muti-chip module could treat its nearby HMB2 module as a separate memory bank in a multi-channel configuration. I believe this has been implemented in the
UltraPath Interconnect protocol for LGA 3647 socket Xeon Gold/Platimun, as well as Xeon Phi 200 processors (these are not cancelled, unlike PCIe 'accelerator' boards).
Beneficial uses of that functionality do depend on optimizing for locality and avoiding excessive transfers between chips. There are even modes that sub-divide the LLC domains on-chip for workloads so that quadrants of the chip cache specific channels, which can point to sub-optimal access patterns not being able to be hidden, although what mining algorithms would notice this specifically isn't clear. That class of hardware tends to be uncommon and doesn't seem to overlap as much with the same range of workloads that are snapping so many discrete boards.
Miners use a limited number of PCIe lanes (36-48) available in current processors to control as many video cards as possible. On the other hand, AMD Ryzen Threadripper has 64 lines and EPYC has 128, so they can theoretically support a large amount of 8/16/32-lane PCIe slots.
Even more limited PCIe systems don't display much sensitivity to PCIe bandwidth, with lane counts per GPU potentially even narrower than that, and even going accepting plugging into PCIe 2.0 even with PCIe 3.0 cards. That seems like the expansion bus is far down the list of priorities for a mining-targeted product.
If your video card doesn't break, it keeps working and you don't have to spend any money and effort to replace it.
What is the level of demand in this scenario? If it's like right now, where buyers are spending vastly above base pricing even for cards that can have measurable deficits versus competing options and retailers consistently have minimal stock, it seems to come down to whether a card can be profitable at all rather than whether it is better than competition that is likely unavailable.
If Etherium mining is really memory bandwidth limited, the only thing you can reasonably do is actually increase memory bandwidth.
The power consumption of the GPU influences the temperature/power budget available for overclocking memory, or influences how many additional cards can be added, at least for the apparent optimal point for rigs targeting a workload like Ethereum in its current proof of work incarnation.
Multi-chip may increase the apparent bandwidth of the package, but it's not local in the same manner as a single chip. The access patterns are intended to exceed on-chip storage for ASICs while scaling poorly with inter-chip transfers.
For that class of algorithms, I am unclear as to how significant the difference is between two independent GPUs running independent payloads in parallel versus trying to unify them for one DAG using a bus with non-zero power cost with sub-optimal access patterns.
Ethereum's algorithm didn't seem to anticipate large numbers of discrete GPUs being run together, however.
Equihash was mentioned earlier, which seems to have adjusted the balance of arithmetic and bandwidth requirements. The proof of work algorithm appears from my early reading on the topic to try to correct that oversight with an algorithm that is more sensitive to memory capacity--which may not favor HBM if that is the next major ASIC-resistant target.
This is similar because Nvidia's 'professional' products do not offer much additional value over consumer cards for a considerable higher price.
This is likely a profit-maximization move, although I think Nvidia's restricting purchase quantities for the cards in question.
I think that points to a continuum with non-datacenter customers at a given price point, and datacenter and mining operating at higher ones. Possibly, the datacenter customers pay the most, but I don't think Nvidia would carve out the mining component if it were indistinguishable revenue-wise from the individual use case. The rumors of more direct sales to miners might figure into it.