# How to understand the 560 GB/s and 336 GB/s memory pools of Series X *spawn*

Status
Not open for further replies.
I made the same mistake when I first considered Series X, but these are rates, not volumes. If in one second of game you access the RAM for 0.2 seconds at 560 GB/s (reading and writing 112 GB of data!), then you have 0.8 ms of the frame for the GPU to read data at 560 GB/s. Both get the full bandwidth. But of course, the GPU transfer a total lower quantity of data using the bus for 0.8s versus using it for 1s, but data consumed is meaningless in relation to transfer rates.

A rate turns into a volume (you just did it!), if the time factor of the rate and time being considered match.

But even if you did acess 50 GB in 15% of your cycles on the slow memory at 336 GB, and 85% of your cycles at 560 GB/s that would be 526,4 GB/s, not 560 GB/s.
See my reply to mrcorbo - the rate does not diminish.

A rate turns into a volume (you just did it!), if the time factor of the rate and time being considered match.
But the rate (volume over time) never changes. If you have a pipe that can deliver 10 litres a second of water, and you turn the tap a tiny bit and get 1 litre after a minute, the rate of the pipe is still 10 l/s. The bus speed is still 560 GB/s to the GPU no matter how much data you do or don't move.

For your own sanity and mental capacity, please stop relying on material and information that is known to be incorrect and inaccurate.

No. In your second example, if your access only generates 50 GB/s from the 2 GB chips you still have 510 GB/s to play with (assuming no CPU contention issues which would happen with a symmetrical memory configuration as well).

392 GB/s + 50 GB/s =/= 560 GB/s.

The only way that happens is if AMD is incredibly incompetent WRT their SOC design. And this would affect the PS5 SOC in exactly the same way.

No... You do not need to use all the bandwidth available, In a one pool system, even though you mays have 100 GB/s, you can only use 50 GB/s.

If you are constantly acessing the 6 GB RAM, using all the 3.5 GB of RAM, but only using 50 GB/s from a possible 160, you are indeed loosing 110 GB/s.

What your claim boils down to is...
• There are 10 memory chips.
• You access 6 of those memory chips (the "slow" pool) and somehow restrict it to only 50 GB/s
• You somehow use 392 GB/s on the remaining 4 chips which theoretically should only have 224 GB/s of bandwidth.

I don´t use 392 GB/s on 4 chips... There is no such thing as a 4 GB pool, so for me to acess those 4 chips, I´m acessing a 10 GB pool, and that means I use 392 GB/s on 10 chips (besides each chip can only supply 56 GB, so 4 of them would never give me 392 GB/s). I´m talking simultaneous acess to both pools, meaning a 224 bits channel to the 10 GB and a 95 bits channel to the 6 GB. 392 GB/s and 168 GB/s.
Both add up to 560. Question is if you only use 50 GB/s on the 6 GB pool, you are loosing 118 GB/S of Bandwidth!

On PS5 you'd have the exact same situation using your example.
• There are 8 memory chips.
• You access say 4 of those memory chips and somehow restrict it to only 50 GB/s
• You only have 224 GB/s of bandwidth remaining to use.
None of what you are saying is making sense in either situation.

There is no such thing as a bandwidth restriction to 50 GB/s when more memory is available on the chips. Besides data is stored equally amongst all chips. On a standard memory pool, if one CPU uses X bandwith, the remaining is total-X.
This is just not true if you use all memory, and not all bandwidth! That would create a case of wated bandwidth!
In this Xbox case here, since you used all the 3.5 GB memory, but did not use all bandwidth, you are in the above case. with 50 GB/s you wold have wasted 110 GB/s bandwidth.

What would happen in reality ignoring potential CPU-GPU contention for memory
• There are 10 memory chips.
• You access some portion of 6 memory chips at 50 GB/s.
• The SOC can still use the remaining 286 GB/s on those 6 chips for other requests.
• The remaining 4 chips can still provide up to 224 GB/s for other requests.

Bandwith depends only on channel. A 16 bit channel will carry 28 GB/s on a 14 Gbps GDDR6, module.
If this channel is acessing one pool, at the same given time that bandwith is not acessible on the other pool.
Best you can do is average acesses. If in global you do not spend as much cycles on the Slow RAM pool, you are giving more bandwidth to the fast pool. But CPU requires as many memory acess cycles as the GPU. This means a 392 GB/s on the fast RAM and 169 GB/s on the slow RAM to acess both memories at once.

There is no slow and fast pool of memory. Those are just constructs to try to allow people to understand in a very general way how things are happening.
• There are 10 chips of memory with 560 GB/s total bandwidth.
• Any combination of reads from memory can generate 560 GB/s of total bandwidth.
• It doesn't matter if reads are from the lower or upper memory addresses on the 2 GB chips. The total bandwidth regardless of what portion you access will come out to 560 GB/s.
Regards,
SB

I know that
Question is, you will only have 560 GB/s if you only use one pool (the 10 GB one).
If you acess both, to get the same 560 GB/s you have to read from both. But this will divide bandwidth. Although a non equal usage of both pools can average this to higher values, at any given moment you will either have 392 GB/s, 168 GB/s or 244 GB/s and 366 GB/s. And with regular simultaneous acess, 392, 168 is what you get...
A total of 560 GB/s, assuming the usage you give to the 3.5 GB/s will pull 168 GB/s. But a CPU can use all that memory, leaving the GPU with only the 392 GB/s pool.
And if the CPU uses all memory but doesn´t produce more than 50 GB/s traffic, GPU is limited to 392 GB/s just because of the way the memory is set up.

PS: From now on, I will just read your replies. I think I made my point, and that further explanations would just be repeating myself. So, feel free to correct me. I will be reading, and taking notes from what you say,

For your own sanity and mental capacity, please stop relying on material and information that is known to be incorrect and inaccurate.

well... It all started with a question... So my sanity is not at stake.
But until now, you failed to present me with convincing evidence.

But please feel free to explain how it works, I´m not making this a battle, and even less a console battle. That's why I placed the thing as a question, not a claim!

If you don't believe anything from Microsoft's technical details, or Digital Foundry's technical deep dives articles, or 3dilettante 's technical posts or even other's posts then this discussion has reached it's natural end.

Highlight References:
https://forum.beyond3d.com/posts/2118524/
https://forum.beyond3d.com/posts/2118676/
https://forum.beyond3d.com/posts/2118551/
https://forum.beyond3d.com/posts/2118639/
https://forum.beyond3d.com/posts/2118627/
https://forum.beyond3d.com/posts/2118811/
https://www.eurogamer.net/articles/digitalfoundry-2020-inside-xbox-series-x-full-specs

I was going to ask this in the XSX memory thread but since it has now been closed I'll ask it here.

1. It was explained to me in the a.b.m. thread that GGD6 is not dual-ported and that each channel addresses half of the chip. In searching for the information myself I came across this https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/8 where it states "Whereas both GDDR5 and GDDR5X used a single 32-bit channel per chip, GDDR6 instead uses a pair of 16-bit channels. This means that in a single memory core clock cycle (ed: not to be confused with the memory bus), 32 bytes will be fetched from each channel for a total of 64 bytes. This means that each GDDR6 memory chip can fetch twice as much data per clock as a GDDR5 chip, but it doesn’t have to be one contiguous chunk of memory. In essence, each GDDR6 memory chip can function like two chips." The "it doesn't have to be one contiguous chunk of memory" bit seems to imply that DDR6 can operate in two modes. One mode would allow you to use the two channels to double output per clock cycle over DDR5 while the other would allow you to read and then write to the same memory space which nets you the same speed as DDR5. If the second mode is possible would this be better for CPU use?

2. As the SX uses ram in two capacities (1 & 2gb) and they themselves are then addressable in halves (500mb & 1gb) which size would you assign to the CPU and GPU? For e.g. the 4 1gb modules to the CPU would be halved into 8 x 500mb (matches the CPU core count) with 5 of the 6 2gb chips for the GPU and the remaining 1gb module shared between the CPU & GPU? Or maybe mix and match the modules for both CPU and GPU (2 x 1gb chips and 1 x 2gb chip for the CPU and the other chips for the GPU) based on memory needs. My assumption would be that you would have a few workloads where the CPU may require more than 500mb of memory per core so maybe mixing and matching would be better?

3. If reading and writing to one contiguous chunk of memory mode exist in DDR6 would you then use the memory modules for the CPU this way & use the double data rate mode for the GPU? How would you configure the memory system to make best use of this?

Let me just add that I am not looking for "secret sauce". I will be getting both consoles (XSX first and PS5 later on just as I have done this generation as well as the last). I am just interested in understanding why MS choose to configure their system as they did and not go the PS5 route with 8 x 2gb chips.

I am just interested in understanding why MS choose to configure their system as they did and not go the PS5 route with 8 x 2gb chips.

I'll let @3dilettante or someone more experienced than me tackle your first set of questions, while I can answer and direct on your last question.

If they went with only 8 chips, they'd be at 256 wide and only get 448GB/s. They needed more bandwidth.

So they went with 10 chips, at 320 wide to hit 560 GB/s, but then scaled back capacity on some to save on costs, which then made part of the memory be at 336 GB/s.

The other option was going with faster memory chips but that's costly too and you run into memory path tracing issues too.

If they had the budget they'd run 10 chips same size at 320 wide and have 20 GB at 560 GB/s.

Pretty much what MrFox said: https://forum.beyond3d.com/threads/...y-pools-of-series-x-spawn.61681/#post-2118627

DigitalFoundry talking about signal integrity issues: https://www.eurogamer.net/articles/digitalfoundry-2020-inside-xbox-series-x-full-specs
DigitalFoundry said:
It sounds like a somewhat complex situation, especially when Microsoft itself has already delivered a more traditional, wider memory interface in Xbox One X - but the notion of working with much faster GDDR6 memory presented some challenges. "When we talked to the system team there were a lot of issues around the complexity of signal integrity and what-not," explains Goossen. "As you know, with the Xbox One X, we went with the 384[-bit interface] but at these incredible speeds - 14gbps with the GDDR6 - we've pushed as hard as we could and we felt that 320 was a good compromise in terms of achieving as high performance as we could while at the same time building the system that would actually work and we could actually ship."

If they had the budget they'd run 10 chips same size at 320 wide and have 20 GB at 560 GB/s.
On this note: the effect of running this setup would only provide diminishing improvement over the existing setup in terms of bandwidth sustainment. I’ll be frank in thinking that it barely would improve things. Edge cases or poor programming I suppose.

Though an additional 4 GB capacity would increase buffers dramatically. So where XSX falls short of PS5 ssd speed, with additional storage it can just hold more in memory at a single time.

1. It was explained to me in the a.b.m. thread that GGD6 is not dual-ported and that each channel addresses half of the chip. In searching for the information myself I came across this https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/8 where it states "Whereas both GDDR5 and GDDR5X used a single 32-bit channel per chip, GDDR6 instead uses a pair of 16-bit channels. This means that in a single memory core clock cycle (ed: not to be confused with the memory bus), 32 bytes will be fetched from each channel for a total of 64 bytes. This means that each GDDR6 memory chip can fetch twice as much data per clock as a GDDR5 chip, but it doesn’t have to be one contiguous chunk of memory. In essence, each GDDR6 memory chip can function like two chips." The "it doesn't have to be one contiguous chunk of memory" bit seems to imply that DDR6 can operate in two modes. One mode would allow you to use the two channels to double output per clock cycle over DDR5 while the other would allow you to read and then write to the same memory space which nets you the same speed as DDR5. If the second mode is possible would this be better for CPU use?
I'll let more senior members handle this one. The pros and cons of striping, cache lines, paging costs etc, are not my strength. But what your'e asking probably falls in that area and it has been answered in the other thread. How CPUs and GPUs want to use memory is different. In this case, we'll optimize for GPUs to ensure maximum bandwidth.

I am not sure if you can strip things one way and then in another. Mixing and matching on the same chip. best to ask another member here on that.

2. As the SX uses ram in two capacities (1 & 2gb) and they themselves are then addressable in halves (500mb & 1gb) which size would you assign to the CPU and GPU? For e.g. the 4 1gb modules to the CPU would be halved into 8 x 500mb (matches the CPU core count) with 5 of the 6 2gb chips for the GPU and the remaining 1gb module shared between the CPU & GPU? Or maybe mix and match the modules for both CPU and GPU (2 x 1gb chips and 1 x 2gb chip for the CPU and the other chips for the GPU) based on memory needs. My assumption would be that you would have a few workloads where the CPU may require more than 500mb of memory per core so maybe mixing and matching would be better?
At first thought you likely would not. The CPU and GPU need to be able to address the same memory locations, otherwise you'll run into a situation where the CPU can only place things into the 2GB chips and the GPU would need to copy those items out to the smaller chips for bandwidth, or the CPU/GPU would not be able to perform any shared GPGPU tasks. So They must be able to address both. But the file types, say specifically marked files like audio, game code, stuff only the CPU will use, may be identified/marked to strip over the 6x2GB chips as opposed to all chips. Or the obvious second thought is to have a special command that knows to strip memory to 6x2GB or to all the chips and just leave it to the developers to decide what goes where. Likely the latter thinking out loud. Developers will be ultimately holder to determine which files require high bandwidth and which ones don't.

The sequence of decision was probably something along those lines:

1. Design goals are at least 12TF, at least 16GB, and xyz BOM.
2. 256bit not fast enough, faster bins are expensive
3. 384bit too expensive, PCB crosstalk issues
4. 320bit is perfect for speed and cost
5. Now we are over budget with 20GB, and 10GB not nearly enough
6. Splitting capacities for 16GB meets all design goals.
7. Now it will causes a data layout issue, with performance drops
8. Skimming the top layer for OS plus 3.5GB game data will mitigate this
9. Add recommendations to devs, use the 3.5GB mostly with cpu data
10. Yay! All design goals met!

I wonder if they planned for 20GB, but the recent memory market instability caused them to find that solution to respect the target BOM without impacting performance. Or possibly a planned contingency based on memory prices, an open choice they could decide as late as possible.

Last edited:
I was going to ask this in the XSX memory thread but since it has now been closed I'll ask it here.

1. It was explained to me in the a.b.m. thread that GGD6 is not dual-ported and that each channel addresses half of the chip. In searching for the information myself I came across this https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/8 where it states "Whereas both GDDR5 and GDDR5X used a single 32-bit channel per chip, GDDR6 instead uses a pair of 16-bit channels. This means that in a single memory core clock cycle (ed: not to be confused with the memory bus), 32 bytes will be fetched from each channel for a total of 64 bytes. This means that each GDDR6 memory chip can fetch twice as much data per clock as a GDDR5 chip, but it doesn’t have to be one contiguous chunk of memory. In essence, each GDDR6 memory chip can function like two chips." The "it doesn't have to be one contiguous chunk of memory" bit seems to imply that DDR6 can operate in two modes. One mode would allow you to use the two channels to double output per clock cycle over DDR5 while the other would allow you to read and then write to the same memory space which nets you the same speed as DDR5. If the second mode is possible would this be better for CPU use?
The basic description is that the channels are independent, and so work on one channel isn't restricted by what is going on in the other.
There is a pseudo-channel mode that makes GDDR6 act more like GDDR5X, where the two channels share part of their command and address pins. This makes the channels act more like a single bus, although with some flexibility in how columns are accessed in the single long row that has been activated.
The primary benefit that is touted is the reduction in pins needed to connect to a GDDR6 module. GDDR5 needed 61, two-channel GDDR6 needs 74, and pseudo-channel GDDR6 needs 66.
(Source: Micron's GDDR6 pdf).
The most flexible implementation would be to have 2-channel mode for the modules, since two channels could be given the same commands to emulate pseudo-channel mode, but a pseudo-channel implementation cannot operate its joined channels with more than minor differences in behavior between them. Since this would be a physical implementation choice at the board level, it would be a choice at design time rather than a mode to select during operation.

2. As the SX uses ram in two capacities (1 & 2gb) and they themselves are then addressable in halves (500mb & 1gb) which size would you assign to the CPU and GPU?
Both GPU and CPU have mostly equivalent access to all chips. Assigning a chip to one client or the other would mean the 560 GB/s number would be pretty misleading, and the GPU in particular would lose out in bandwidth with that situation.
There's also a potential problem if the CPU cannot have free access to all memory, since system functions, security, and booting up are going to have problems if whole swaths of the memory space are inivisible.

3. If reading and writing to one contiguous chunk of memory mode exist in DDR6 would you then use the memory modules for the CPU this way & use the double data rate mode for the GPU? How would you configure the memory system to make best use of this?
There could be a decision to change how addresses are mapped across different chips, and there is some history of altering the mapping based on virtual memory page properties. APUs in the past did something like this for GPU memory, and the Xbox One's ESRAM was accessed like regular memory, with specific flags or properties on the virtual memory pages allocated on it.
Making a mode between independent and pseudo-channel GDDR6 is more of a physical change, so that decision would be permanent for the chips involved.

Status
Not open for further replies.

Replies
8
Views
690
Replies
39
Views
9K
Replies
71
Views
26K
Replies
2K
Views
192K
Replies
18
Views
3K