How to understand the 560 GB/s and 336 GB/s memory pools of Series X *spawn*

Status
Not open for further replies.
NOTE: Post corrected changing some parts as doubts are cleared.

Can someone help me with a doubt, please?

I'm going to place the thing as I see it, you guys fell free to correct me, if anything wrong.

Although the use of the word pool is not appropriate, I will use it regardless since it helps a lot in typing.

Xbox séries X as 4 1 GB modules, and 6 2 GB modules. That's 10 modules.
Each module is connected via a 2x16 bits channel, A total of 5 64 bits controllers.
The 4 1 GB modules have the 2x16 bits channels dedicated. They belong to the 10 GB fast RAM pool, and the 4 of them have a channel total of 128 bits, suplying up to 224 GB/s.

But the remaining 6 2 GB modules, they divide their capacity. The first 1 GB adds to the remaining 4 GB and constitute the 10 GB fast Pool. These 6 chips have eah a 2x16 bits channel, com we have a total os 293 biits. Add these withthe 128 bits from the 4 1 GB modules, and we have the 10 GB 320 bits pool, with 560 GB/s bandwidth.
Now, the second upper 1 GB from these 2 GB modules, constitutes de 6 GB slower pool. These do not add to anything, so we have a total of 192 bits channel, with 336 GB/s maximum bandwidth.

And so far so good!

Question is: The 2x16 bits channels we have on each of the 2 GB modules, cannot be used at the same time on both pools. If both are dedicated to the upper 1 GB on all modules we do have 336 GB on the 6 GB pool, but the 10 GB pool stops rceiving any data from these 2 GB modules, since it has no channels dedicated to it. So the fast RAM pooll decreases it's bandwidth to 244 GB/s.
Both pools till keep adding up to 560 GB/s, even though the bandwidth of each pool varies a lot.

But if we look at the upper 6 GB pool we see that we only have 3.5 GB free. The remaining memory is used by the OS.

Since the demand on this RAM will never be that big (CPU only), there is never a need to dedicate the 2x16 channels to the 6 GB slow pool.
So we can only dedicate one of these 16 bits channels, and access both pools at the same time.

In this case we have 392 GB/s on the fast ram pool, and 168 GB/s on the slow ram pool.

As I see it there are no other alternatives, unless we can acess the upper part of the 2 GB modules separately, and not on all chips at once. Is this possible?

Accepting it is not, that access to the upper part of the 2 GB requires acess to all the six at once, then 392 GB/s on the fast RAM pool, and 168 on the slow ram pool is what we can count on!

We can keep most access on the fast RAM, limiting acess to the slow RAM, but even if that increases the bandwidth available on the fast RAM, as soon as there is an acess to the slow ram (and they will need to exist), bandwidth will decrease.
This can lead to uncontrolable stutter in games. And if CPU is placed working on the slow memory, the bandwidth will vary a lot, and counting on having more than 392 GB/s on the fast memory can lead to sudden performance decreases.

But if not all of the 168 GB/s are used, the remaining is just lost, not added to the fast pool.

Now... will the GPU get 392 GB/s?

Well... no! Not even that! Off course this will depend on the game, but GPU/CPU memory usage can be 70/30 on a more demanding game. And this means, 3.5 GB will not be enough for the CPU, and he will need to use some memory on the fast pool, stealing extra bandwidth from the GPU.

Since Xbox séries X has 52 CUs, the ending result (492 GB/s) seems to be worse that the 448 GB/s available on the PS5.

Is there anything wrong with what I'm saying? I see lot of persons saying this like: "If we the CPU uses 50 GB on the slow RAM, we still have 510 GB on the fast RAM for the GPU ", but as I'm seeing it, this will not happen. As I see it 68 GB/s are wasted since those 3.5 GB will not output more that 100 GB/s, but we will need to reserve 168 GB due to the usage of the 6 16 bits channels. As such, if the CPU is working in the upper 6GB slow memory the global bandwidth will be 492 GB.
As I'm seing it, only with a very limited usage of those 6 GB, we can get more than 392 GB/s on the fast memory pool, and effectively aproaching the 560 GB/s.

Please fell free to correct me, giving your explanation.

Thank you!
 
Last edited:
@Metal_Spirit Yeah, you're making it too complicated and turning out with the wrong conclusions. I believe it was already explained in the DF article or our discussion about it.
 
Although the use of the word pool is not appropriate, I will use it regardless since it helps a lot in typing.
Pool is a term others have used. It's fine if you're discussing the memory space, or the overall capacity as a characteristic separate from the bus width.

The 4 1 GB modules have the 2x16 bits channels dedicated. They belong to the 10 GB fast RAM pool, and the 4 of them have a channel total of 128 bits, suplying up to 224 GB/s.
The fast pool is a range of memory addresses that when accessed can have a theoretical peak of 560 GB/s bandwidth. Bandwidth depends on channel width and speed, which is independent of the capacity of the chips.
It would be best to not try assigning modules to a given pool. All the GDDR6 modules contain portions of the address range that makes up the fast pool. If they didn't, then it would exclude their channels's bandwidth from being usable for the fast pool, and it wouldn't be the fast pool.
The slow pool is the range of storage addresses that are not distributed over all chips because of the capacity difference, and so only the chips that have addresses in that range can supply requests at those addresses.

Question is: The 2x16 bits channels we have on each of the 2 GB modules, cannot be used on both pools.
Channels don't care what pool they are being asked to access from. It's only a question of whether the chip will make a request over them, and for the slow pool the chip will only generate requests on 6 out of 10 chips.
A channel could burst data from the fast pool in one transaction, and then burst data from the slow pool immediately afterwards. This isn't a special restriction for the pool setup, just a result of a channel only being able to transmit data for one burst at a time.

How much bandwidth can these 2.5 GB really output?
The system should be distributing memory spaces evenly across the channels. Any range of addresses long enough to stretch across multiple chips can satisfy as many parallel transactions as there are channels available.
Past the requirement that an address range be distributed across the chips at a granularity far smaller than a GB, there's no point in discussing a given amount of capacity only providing an X amount of bandwidth.
Requesting a stretch of some hundreds of KB from the channels that are associated with addresses in the slow pool will generate enough parallel accesses to give a rate of 448 GB/s. (edit: correction 336, juggled wrong number in calculations)
Requesting a single memory location in the fast pool will top out at 28GB/s.

As such, it seems to me the 6 GB pool will never use more than about 100 GB/s, since the OS has small bandwidth usage.
The OS probably doesn't need as much bandwidth, but since the game also has part of the pool you cannot apply that logic to the physical distribution of channels.

We can only dedicate one of these 16 bits channels, and access both pools at the same time.
That would limit the OS to a fraction of one GDDR6 chip's capacity. A channel is physically attached to half the storage in a GDDR6 module. It cannot service a request in another chip's address range.
 
Last edited:
Requesting a stretch of some hundreds of KB from the channels that are associated with addresses in the slow pool will generate enough parallel accesses to give a rate of 448 GB/s.
Requesting a single memory location in the fast pool will top out at 28GB/s.

Wouldn't that be 336 GB/s for the slow pool addresses?
 
That would limit the OS to a fraction of one GDDR6 chip's capacity. A channel is physically attached to half the storage in a GDDR6 module. It cannot service a request in another chip's address range.

If this is like you say then we are fixing acesses at 392 GB/s on the Fast Memory and 168 GB on the slow memory.
There is no 560 GB/s on the fast pool and no 336 Gb/s on the slow poll, just 560 GB/s on both!
 
The fast pool is a range of memory addresses that when accessed can have a theoretical peak of 560 GB/s bandwidth.

Yes... in this case the first 1 GB of all modules. 56 GB/s per module, over a 32 bits bus, makes the 560 GB/s.

Bandwidth depends on channel width and speed, which is independent of the capacity of the chips.

Fact!

It would be best to not try assigning modules to a given pool. All the GDDR6 modules contain portions of the address range that makes up the fast pool. If they didn't, then it would exclude their channels's bandwidth from being usable for the fast pool, and it wouldn't be the fast pool.

Since adress ranges for the fast Memory poll cover the first 1 GB only, the 1 GB modules are excluded from the 6 GB pool since that's their entire capacity!

The slow pool is the range of storage addresses that are not distributed over all chips because of the capacity difference, and so only the chips that have addresses in that range can supply requests at those addresses.

Yes... the capacity over the 1st GB!

Channels don't care what pool they are being asked to access from. It's only a question of whether the chip will make a request over them, and for the slow pool the chip will only generate requests on 6 out of 10 chips.
A channel could burst data from the fast pool in one transaction, and then burst data from the slow pool immediately afterwards.

Never said otherwise. But when fetching data from one pool thay cannot read at the same time from the other pool. So you cannot account for the same channel on both pools at once.

This isn't a special restriction for the pool setup, just a result of a channel only being able to transmit data for one burst at a time.

Fact!

The system should be distributing memory spaces evenly across the channels. Any range of addresses long enough to stretch across multiple chips can satisfy as many parallel transactions as there are channels available.

Past the requirement that an address range be distributed across the chips at a granularity far smaller than a GB, there's no point in discussing a given amount of capacity only providing an X amount of bandwidth.
Requesting a stretch of some hundreds of KB from the channels that are associated with addresses in the slow pool will generate enough parallel accesses to give a rate of 448 GB/s.
Requesting a single memory location in the fast pool will top out at 28GB/s.

Each 16 bits channel will top out at 28 GB/s. Shure!
Did not catch how you reach the 448 GB/s though! The slow pool cannot provide more than 336 GB/s.


That would limit the OS to a fraction of one GDDR6 chip's capacity. A channel is physically attached to half the storage in a GDDR6 module. It cannot service a request in another chip's address range.

Of course you cannot acess other chip, never questioned that! Is it physically attached. Never said that!
BTW, I missundertood this phrase when replying to iroboto!

But after reading your reply, I seem to have found nothing that contradicts what I wrote! Did I miss your point?
 
No. Wrong again.

Sorry... I misunderstood what was beeing said.
Indeed my reply made no sense and was wrong!
But as far as what I wrote on the first message, from what I understood from 3dilettante answer he didn't seem to contradict anything I said! Maybe my wording was not the most correct, but I wasn´t saying anything different from what 3dilettante said.
 
Wouldn't that be 336 GB/s for the slow pool addresses?
Correct, I was running some other comparisons at the time of writing and input the wrong number. I added a note in my post.

If this is like you say then we are fixing acesses at 392 GB/s on the Fast Memory and 168 GB on the slow memory.
There is no 560 GB/s on the fast pool and no 336 Gb/s on the slow poll, just 560 GB/s on both!
There's 560 GB/s max for the whole memory subsystem.
Bandwidth is determined by physical channel width and speed, not memory pool.

The range of addresses for the fast pool is distributed across all the chips in the system. The slow pool is distributed across the additional capacity that 6 of the modules have over the other four.
Accesses to all locations can total up to 560 GB/s, in theory. Of that mix, up to 336 could be from accesses to the slow pool.
Both the OS and game can make use of the slow pool, and I'm not entirely sure that all of the fast pool is out of reach of the OS. Even if there's no kernel memory in the fast pool, there could be functions that transfer data into game memory that could be OS-related.
 
Never said otherwise. But when fetching data from one pool thay cannot read at the same time from the other pool. So you cannot account for the same channel on both pools at once.
That doesn't automatically translate into a loss of bandwidth for the game. The game has part of the slow memory pool as part of its allocation, so a game reading from the slow pool at max bandwidth and reading from the remaining modules would be getting peak bandwidth regardless.

Did not catch how you reach the 448 GB/s though! The slow pool cannot provide more than 336 GB/s.
That was a math error on my part, I added a correction.


Of course you cannot acess other chip, never questioned that! Is it physically attached. Never said that!
That was part of the point of contention I had with the portion of your post concerning:
"How much bandwidth can these 2.5 GB really output? If the full 2GB module can output 56 GB/s, these 3.5 GB (regardless of being on 2 modules or a pieve on all modules) can only output 98 GB/s."
The last claim about 3.5GB of storage only producing 98 GB/s regardless of whether being on all modules (all 6? of the larger chips) would require that to happen.
 
@3dilettante Is it accurate the only downside of this "split" pool is loss of flexibility for the dev?

According to MS, CPU audio and file IO requires no more than 336GB/s. I wonder what non-graphical work would benefit from 560GB/s. If there are little or none, then this "split" pool architecture might be the design for consoles going forward?

"Memory performance is asymmetrical - it's not something we could have done with the PC," explains Andrew Goossen "10 gigabytes of physical memory [runs at] 560GB/s. We call this GPU optimal memory. Six gigabytes [runs at] 336GB/s. We call this standard memory. GPU optimal and standard offer identical performance for CPU audio and file IO. The only hardware component that sees a difference in the GPU."
 
Correct, I was running some other comparisons at the time of writing and input the wrong number. I added a note in my post.


There's 560 GB/s max for the whole memory subsystem.
Bandwidth is determined by physical channel width and speed, not memory pool.

The range of addresses for the fast pool is distributed across all the chips in the system. The slow pool is distributed across the additional capacity that 6 of the modules have over the other four.
Accesses to all locations can total up to 560 GB/s, in theory. Of that mix, up to 336 could be from accesses to the slow pool.
Both the OS and game can make use of the slow pool, and I'm not entirely sure that all of the fast pool is out of reach of the OS. Even if there's no kernel memory in the fast pool, there could be functions that transfer data into game memory that could be OS-related.

I know... That was a mistake on my part
 
@3dilettante Is it accurate the only downside of this "split" pool is loss of flexibility for the dev?

According to MS, CPU audio and file IO requires no more than 336GB/s. I wonder what non-graphical work would benefit from 560GB/s. If there are little or none, then this "split" pool architecture might be the design for consoles going forward?
The primary downside I can see at this point is that devs need to pay attention to where the memory resources for the most bandwidth-intensive functions are being placed, and maybe some kind of capacity pressure if they have a renderer that needs more that 10 GB of data on-hand. However, I haven't seen indications that this a pressing issue at this stage.
Unlike with the Xbox One's ESRAM, or the EDRAM for the Xbox 360 or PS2, the ratio of "fast" to total memory is extremely generous. Most memory is fast, the slow bandwidth is still generous, and 10 GB is a lot of memory to need for high-bandwidth functionality. Most of the time, a minority of memory is hit very many times, rather than tens of GB of memory being run through in a row.

As far as why CPU audio and file I/O don't need more than 336 GB/s, I interpreted the Goossen quote to mean that the CPU and IO blocks have infinity fabric interfaces that have bandwidth on the order of 32B at ~1.8 GHz, which means their bandwidths are lower than even the "slow" value of 336 GB/s. Only a client capable of generating more than 336 GB/s of traffic would know the difference, and per the interview that would be the GPU--which makes sense.
 
Most of the time, a minority of memory is hit very many times, rather than tens of GB of memory being run through in a row.

RIP Xbox One.

For a mid gen upgrade, a SOC with 2+GB of HBM at 1024+GB/s and unified pool of 32+GB DDR5 8400 at 260+GB/s might give the best perf / dollar cost? This is assuming that the HBM can be added on substrate and not via a silicon interposer.
 
As far as why CPU audio and file I/O don't need more than 336 GB/s, I interpreted the Goossen quote to mean that the CPU and IO blocks have infinity fabric interfaces that have bandwidth on the order of 32B at ~1.8 GHz, which means their bandwidths are lower than even the "slow" value of 336 GB/s. Only a client capable of generating more than 336 GB/s of traffic would know the difference, and per the interview that would be the GPU--which makes sense.

I saw you put this forward before and it seems like a reasonable explanation for something I was confused by initially.
 
Status
Not open for further replies.
Back
Top