[
Not shure I get what you guys are talking about. But it seems to me Xbox series X does have indeed a bootleneck in bandwidth if both pools of memory are accessed at the same time.
Problem lies within the memory configurations. The Xbox has 10 memory chips, 4 with 1 GB, and 6 with 2 GB. To get two pools one with 10 GB at 560 GB/s (320 bits) and one with 6 GB with 336 GB/s (192 bits), the memory disposition must be 4x1 GB modules accessed at 32 bits, and the first 1 GB of the six 2 GB modules also accessed at 32 bits.
This with 5x64 bits controllers gives you 320 bits access to all these chips, each provinding 56 GB/s, so 10 chips with 1 GB equals 10 GB at 560 GB/s.
Now for the other pool, you need to access the extra 1 GB on the 2 GB modules. Since each is connected with a 32 bits bus, and there are 6 modules, thats a 192 bits bus... and that equates to 6 GB at 336 GB/s.
Big problem is that you are quantifying the same 32 bits channel on the 2 GB modules on both pools: if that's ok to quantify maximum bandwidth on each of the pools, it doesn´t work like that in reality, since it´s the same bus on both.And if you are using the 32 bits on one pool, you cannot be using the same 32 bits channel on the other.
So to access both pools at full 32 bits the simple choice is to do it in alternate clock cycles. This is about the same as reducing the bus width to 16 bits for each pool, and acessing both at the same time.
Since the 1 GB modules are free from this, they will still provide 224 GB/s in total. But the 2 GB modules will provide half, reducing the 10 GB pool memory bandwidth to 392 GB/s, and the 6 GB one to 168 GB/s.
I really don´t know how this can be solved... Any ideas?
This is the first I've ever heard of a 192 bit bus. You would draw lanes to both chips but certainly not 16 bits to each chip. Why would you do that?
Microsoft's solution for the memory sub-system saw it deliver a curious 320-bit interface, with ten 14gbps GDDR6 modules on the mainboard - six 2GB and four 1GB chips. How this all splits out for the developer is fascinating.
"Memory performance is asymmetrical - it's not something we could have done with the PC," explains Andrew Goossen "10 gigabytes of physical memory [runs at] 560GB/s. We call this GPU optimal memory. Six gigabytes [runs at] 336GB/s. We call this standard memory. GPU optimal and standard offer identical performance for CPU audio and file IO. The only hardware component that sees a difference in the GPU."
Nothing here indicates that the speeds of the memory change at all.
I thought this was straight forward:
Same pool of memory, different chip sizes. Slow and fast is really more like do you want to pull from 10 or 6.
There are 10 chips in total, each chip has 56 GB/s bandwidth on a 320 bit bus.
56 * 10 = 560 GB/s
Bandwidth is the total size of the pipe, and in this case it will be about the total amount of pull you can grab at once.
Of the 10 chips, 6 of them are 2 GB in size.
56 * 6 = 366 GB/s
If you data is on the 2nd half of the 1GB chip you will get 366 GB/s because you only have 6 chips to pull that data from. I don't care how data is stored on the 2GB chips, it will always be able to return 32 bits of data per clock cycle. Whether it's 32 bits to each GB, or 16 bits to both and the data is split. Whatever the case is, it's returning 32 bits through the main controller every single time.
But you still have 4 bus openings on the remaining 4 chips, available, just because it's accessing the back half of those 2GB chips, doesn't mean it closes off the other remaining lanes.
So you can still pull 56 *4 on the remaining 1 GB chips.
which is 224 GB/s
so adding these together, it is back to 560 GB/s.
There is no averaging of memory
There is no split pool.
You're only downside is if you put _all_ of your data on the 6x2GB chips, you're limited to a bandwidth of 336 GB/s. Because you'll grab the data on 1/2 and then if you need data on the other 1/2, you'll need to alternate. But that can be handled by priority. But that doesn't stop the developers from always fully utilizing all lanes to achieve its 560 GB/s.
Regardless if you are alternating or not, those 6 chips will be constantly giving out 336GB/s.
And regardless if you are alternating or not on the 6x2 GB chips, you still have 4x1 GB chips waiting to go. Giving you a total of 560 GB/s bandwidth whenever all 10 chips are being utilized.
This should not be treated like 2 separate ram pools like they did with 360 or XBO.
Because of the imbalance in CPU to GPU, perhaps you'll just prioritize the GPU.
While I'm not sure how priority works, edit: they'll priortize whatever is going to give the most performance. Anyway all GPU data goes into the single chips. This is an easy one. Remaining GPU data goes into the 1GB of data. The CPU data would sit on top of that, and any sort of GPGPU work that may need to be done.
TLDR; I don't really see an issue here unless you're always got contention on those 6x2 GB chips. You'd still have this probably anyway. No one would be trying this argument if it was 10x2GB chips. You'd still have the same issue if you're trying to access the memory controller to pull the data all on the same chips. It would still be 560GB/s, you just have 20 GB of memory. Would you use your current argument to say that 10x2GB chips has bottlenecked because the data is split over 2GB chips and now it needs to alternate? Or have some sort of custom controller to do this inn which it needs to trade off bandwidth?