AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
The general rule is a slice per channel, though there's precedent for this not being the case. The Xbox One X is one example, as are the 4-stack HBM GPUs (Fury, Radeon VII, Arcturus), going by driver values for the number texture channel caches--which was referenced in other places as representing the number of slices.

Having 20 slices on its own should be fine. The odd data point is the supposed leak of certain architectural values for the big RDNA2, which lists the count at 16.
Those HBM GPUs have power-of-two numbers in slices & channels though. So sharing a slice is relatively easy.

16 slices to 20 channels would require, say, groups of 4 L2 slices bound to 5 channels (20 / GCF), each of which uses a local 4/5 crossbar for memory blocks to be stripped at the same granluarity (256B?)

My interpretation is that the L1 cache controller evaluates requests and passes misses on to the L2. The various modes that bypass the L1 don't seem to bypass the controller, they just control whether the L1's storage will be used to service the request or if it needs to invalidate data at the same time. Skipping the L1 means the cache itself isn't used, but the controlling logic would be using the same paths to get the L2.
That's how I read it too. The whitepaper also said:

Accesses from any of the L0 caches (instruction, scalar, or vector data) proceed to the graphics L1.
a write to any line in the graphics L1 will invalidate that line and hit in the L2 or memory. There is an explicit bypass control mode so that shaders can avoid putting data in the graphics L1.
Each shader array comprises 10-20 different agents that request data, but from the perspective of the L2 cache, only the graphics L1 is requesting data.
 
Last edited:
Those HBM GPUs have power-of-two numbers in slices & channels though. So sharing a slice is relatively easy.

16 slices to 20 channels would require, say, groups of 4 L2 slices bound to 5 channels (20 / GCF), each of which uses a local 4/5 crossbar for memory blocks to be stripped at the same granluarity (256B?)

The Xbox One X is an example of having a power of two number of cache slices mapped to a non-power of two number of channels.
8 L2 slices were connected to 12 channels. There were four main controller clusters covering a power of two number of channels, and the remaining 4 channels were split into pairs that were each shared by two clusters.
https://en.wikichip.org/wiki/microsoft/scorpio_engine
That decision could easily be a fluke, but a similar scheme could happen with the Series X. Four controller clusters each hosting 4 channels, then the remaining four could be split into two pairs that each straddle two main clusters.

The data fabric has crossbar nodes, although the broad GPU bus makes the overall interconnect a mesh. This might simplify the sharing arrangement, since the network could naturally route packets along the mesh. The 2x wider fabric from Renoir may also leave more slack since it can roughly 2x the bytes per clock than a single channel can deliver.

If it were Vega there might be further reason to do this due to possible L2 alignment issues with the power of two count of RBEs. Flushes could still occur if for some reason the L2 slices and ROP data somehow did not align naturally. Perhaps Navi's L1 removes this concern, although I did see speculation that the 4-banked L1 would have a bank per set of L2 slices, and 20 slices would indicate another bank or an expansion of the L1-L2 interconnect.
 
Oh man, the combination of solid arch, high unit counts and market timing should result in a good old rumble at the high end
Yeah.
I can't remember the last time AMD checked all of those boxes at the same time.
Arguably Hawaii.
This time they have the full stack (from a meagre funne APU to N21) with quick rollouts so smells like Evergreens to me.
(could be Raja)
Was him, very recently.
regretting the whole HBM on consumer cards idea?
You just need performance chops to back it up.
 
I believe Raja said, in an Intel video, that he learned that hbm was costly, something like that. But I didn't remember the word "regret".
 
Were there any leaks about AMD using HBM on consumer cards this gen?
There are some Sienna related commits to AMDGPU kernel driver clearly mentioning HBM interface for that GPU.

Question is... will be customer cards based on Sienna Cichlid GPU or not?
 
Status
Not open for further replies.
Back
Top