Those HBM GPUs have power-of-two numbers in slices & channels though. So sharing a slice is relatively easy.The general rule is a slice per channel, though there's precedent for this not being the case. The Xbox One X is one example, as are the 4-stack HBM GPUs (Fury, Radeon VII, Arcturus), going by driver values for the number texture channel caches--which was referenced in other places as representing the number of slices.
Having 20 slices on its own should be fine. The odd data point is the supposed leak of certain architectural values for the big RDNA2, which lists the count at 16.
16 slices to 20 channels would require, say, groups of 4 L2 slices bound to 5 channels (20 / GCF), each of which uses a local 4/5 crossbar for memory blocks to be stripped at the same granluarity (256B?)
That's how I read it too. The whitepaper also said:My interpretation is that the L1 cache controller evaluates requests and passes misses on to the L2. The various modes that bypass the L1 don't seem to bypass the controller, they just control whether the L1's storage will be used to service the request or if it needs to invalidate data at the same time. Skipping the L1 means the cache itself isn't used, but the controlling logic would be using the same paths to get the L2.
Accesses from any of the L0 caches (instruction, scalar, or vector data) proceed to the graphics L1.
a write to any line in the graphics L1 will invalidate that line and hit in the L2 or memory. There is an explicit bypass control mode so that shaders can avoid putting data in the graphics L1.
Each shader array comprises 10-20 different agents that request data, but from the perspective of the L2 cache, only the graphics L1 is requesting data.
Last edited: