So question on BCPack... Does what is "BCPack(ed)" need to be decompressed for GPU use or is it a new native GPU format?
Per the Digital Foundry article on the Series X, the BCPack hardware is part of the decompression block paired with the SSD. The GPU would appear to be a consumer of the output of that block.
https://www.eurogamer.net/articles/digitalfoundry-2020-inside-xbox-series-x-full-specs
"Our second component is a high-speed hardware decompression block that can deliver over 6GB/s," reveals Andrew Goossen. "This is a dedicated silicon block that offloads decompression work from the CPU and is matched to the SSD so that decompression is never a bottleneck. The decompression hardware supports Zlib for general data and a new compression [system] called BCPack that is tailored to the GPU textures that typically comprise the vast majority of a game's package size."
The GPU's lossy texture compression formats would seem to fit within the systems both consoles have for their SSD compression. Perhaps BCPack has implications for the data formats or compression settings within its payload, but I don't see a benefit for BCPack decompression to fully non-compressed texture data, after which the GPU's routine accesses would choke.
The same goes for the PS5, where its decompression block should be a compression layer on top of lossily-compressed textures that the GPU natively handles.
Perhaps BCPack does more to compress these, or it has additional lossy formats?
They are following multiples of the PS4 CUs though. Their BC method is probably tied to it for whatever reason. Their other option was probably 64 CUs which was probably avoided due to costs
It's also the case that node transitions give a rough 2x of the transistor budget, which seems to bias the outcome a bit.
Thought experiment - assuming that the basic premise of the tweets was correct:
The likely places Series X could fall behind are: SSD throughput/latency, GPU bandwidth, gpu front-end (clock speed)
Can you expand on where the Series X falls behind on GPU bandwidth? Is that DRAM bandwidth, or bandwidth elsewhere?
The PS5's clock isn't going to make it win in total bandwidth for any per-CPU caches.
Perhaps the L1, assuming the Series X GPU didn't adjust its size/bandwidth. One reason why it might need to depends on whether the L2's slice count increased to mirror the wider memory bus. In RDNA, the L1 is subdivided to match the number of L2 groups, since there are 4 slices per 64-bit controller, and the L1's subdivisions match how many requests it can respond to per clock.
The Series X may have 5 L2 groups, in which case the L1 might increase to have 5 sections, and thus 5 requests per clock, which would keep it above the PS5.
However, if the Series X doesn't create a 5th L2 group, it might mean that the L1 and L2 capabilities are as wide per-clock as the probable PS5 arrangement, and then clock speed could have an effect.
One possible complication to adding another cache division like that is that the ROP caches are aligned in a specific manner, and some of the no-flush benefits that Vega touted for making them L2 clients didn't hold if there was some kind of misalignment (maybe for an APU?).
There is some evidence that of a product transition where not expanding internal network to match DRAM can have an impact, such as benchmarks and memory-intensive tests that indicated Fury didn't always do much better than Hawaii despite having an HBM interface, with signs that the internal L2 bandwidth didn't scale as expected.
Pretty much. There's no reason for 36 CUs beyond that, and we know devs did target specific CUs with their code. It seems bizarre that the GPU is so constrained as we're used to swapping GPUs with differing core counts on the PC and it just working, and it's hard to imagine why devs would be targetting so low level still that games can break on compatible hardware. But if you think about it, there's some reason, even if odd, to go with 36 CUs whereas no particular reason to go with a really hot, narrow chip. So BC seems the only justification.
That was the alleged reason for why the Pro's BC mode only exposed 18 CUs to old software. It didn't stop the Pro from having 36 CUs.