The Navi operations increase throughput over a conventional FMA by acting like packed math and then allowing the results to be combined into an accumulator. Looking at the patent or how the tensor operations work for Nvidia, it looks like it would be a fraction of what the matrix ops would do. The lane format without a matrix unit would allow those dot operations to generate only the results along a diagonal of that big matrix.
The AMD scheme is more consistent with Vega, as the vector unit is 16-wide, and the new hardware may align with code referencing new instructions and registers for Arcturus. One other indication this is different is that the Navi dot instruction would take up a normal vector instruction slot since it happens in the same block. Arcturus and this matrix method would allow at least some normal vector traffic in parallel.
The scenario where the system is spending half a second in the slow pool requires something in the OS, an app, or a game resource put in the slow section needing 168 GB/s of bandwidth.
There is some impact because of the imbalance, but it scales by what percentage of the memory access mixture goes to the slow portion. If a game did that, it would likely be considered a mis-allocation. A background app would likely be inactive or prevented from having anything like that access rate, and the OS gets by with a minority share of the tens of GB/s in normal PCs without issue.
I can see the OS sporadically interacting with shared buffers for signalling purposes or copying data from secured memory to a place where the game can use it, but that's on the the order of things like networking or the less than 10 GB/s disk IO.
If the GDDR6 chips were all the same capacity, there would still be a "pool" for the OS and apps, since accesses for them wouldn't be going to the game. The individual controllers would see some percentage of accesses going to them that the game wouldn't be able to use. Let's say 1% goes the OS, or 5.6GB/s. The game experiences a bandwidth bottleneck if it needs something like 555 GB/s in that given second. If there's a set of code, sound data, or rarely accessed textures that don't get used in the current game scene unless the user hits a specific button or action, finally hitting that action while the game is going on blocks the other functions' accesses for some number of cycles.
With the non-symmetric layout, the OS or slow pool pushes some of that percentage onto the channels associated with the 2GB chips.
Going by the 1% scenario, the six controllers would need to find room for the 40% of the OS traffic that cannot be striped across the smaller chips, or 40% of 5.6 GB/s. The 336 GB/s pool would be burdened with an extra 2.24GB/s.
Unless something in the slow pool is demanding a significant fraction of the bandwidth, and I don't know what functionality other than the game's renderer that needs bandwidth on that order of magnitude, I can see why Microsoft saw it as a worthwhile trade-off.
If a game put one of its most heavily used buffers in the slow pool, I think the assumption is that the developers would find a way to move it out of there or scale it back so it'd fit elsewhere.