How to understand the 560 GB/s and 336 GB/s memory pools of Series X spawn

iroboto · Apr 11, 2020

dobwal said:
I just realize that. I had to change up my language to be a bit more diplomatic. LOL

I mean; those aren’t his words you quoted. Somehow you quoted Metal Spirit and put 3Dillante there

BRiT · Apr 11, 2020

iroboto said:
I mean; those aren’t his words you quoted. Somehow you quoted Metal Spirit and put 3Dillante there

A missing "] caused the issue of the second quote block from showing up. Should be fixed now.

3dilettante · Apr 11, 2020

MrFox said:
The unused channels during an access to the 6GB portion would deplete their queue.

Completing request Y on channels not immediately needed for function X means that the overall system is seeing a performance benefit by having both X and Y make progress.
If the channel queues become depleted, that would mean that the game is not bandwidth-limited at that point in time.

Any gpu algorithm would statistically request equally across all channels to get an equivalent 560GB/s. So having some request resolved (the channel that are still free) while the others stalled from serving the 6GB portion would result with a proportional stall.

This isn't always known as far as the memory channels and the controllers are concerned. The GPU path is aggressively reordered and does a lot of combining, which is not always obvious at the algorithm level. Compression can complicate things further by making the number of bus transactions for a set of fetches more variable.
If there's a function bottlenecked on the channels that service the regular memory pool, that would be a likely case where the developer would need to reconsider where they allocated that target. Since there are idle channels, that would seem to indicate that there is data in the GPU-optimized pool that is under-utilizing its capabilities and might be better placed in the normal pool.

The usage combinations I'm seeing discussed:
1) Function A is chronically bottlenecked reading from the 6GB regular range, while the other channels are apparently underutilized: it sounds like the developer needs to evaluate if the resource is in the wrong pool. This extra set of evaluation criteria for memory allocation is the main downside I see for the non-uniform capacity. The magnitude of the challenge: finding a way to fit a resource in the 1-100 MB range somewhere in a 10 GB pool?
2) Function A is not bottlenecked reading from the 6GB range, and other channels are apparently underutilized: it sounds like bandwidth isn't a concern and would not be different with a uniform capacity system.
3) Function A is utilizing the channels for reading from the 6GB range, and the other channels are being utilized well: both function A and other workloads are making as much progress in terms of memory access as they can. This would be the same regardless of the uniformity of capacity.
4) Function A is bottlenecking on the 6GB range, and other workloads are bottlenecking on the remaining channels+have competing work on the same channels as Function A. This is the "game is written to require 800 GB/s of bandwidth on a 560GB/s system" situation. The capacity of the chips wouldn't really change this scenario.

5) There is a continuum of access patterns for workload A and other workloads, where if A were striped across all channels, there would be 4 channels of accesses that would be striped over the already active 6 channels. If the bandwidth demand approaches the limit of the 6GB channels, it approaches scenario 1. If the bandwidth demand is more modest, like the typical range of SSD IO or many CPU loads, the system approaches scenario 2 or 3.

I'm thinking this is why the designers put the OS and things like CPU audio in the regular memory space. If those workloads needed something like 20-40 GB/s, that's 3.5% to 7% of system peak. The inability to use 4/10 of the channels means that somewhere between 1.4% and 2.8% of traffic is added as load onto channels to regular memory. This doesn't seem like a major shift in system balance.
The non-GPU portions of the hardware are dwarfed in their ability to saturate the memory subsystem by the GPU.

That leaves the question for game devs to answer: how many games are going to need 10GB+ of high-utilization GPU asset memory, where performance is so critical that none of it can tolerate working with 336 GB/s of bandwidth?

If the additional requests going to the 336 portion don't have an equivalent number of requests going to the remaining channels, wouldn't that still imbalance the queues? OTOH, if the data is spread unevenly to make the unused channels useful in that time frame, they are then not balanced equally when the 6GB portion is not used, and cannot do 560GB/s in total, it would instead cause the opposite queues starving.

If the channels aren't being used, then it sounds like the workload is fine with less that 560 GB/s, which can be frequently true. Even when a game is supposedly bandwidth-bound, it's more that a subset of the frame time budget is bandwidth-bound and it's holding up further progress.

Proelite said:
From what I remember from 2013 was that texture data isn't bandwidth intensive. Otherwise, Xbox One textures would be half the fidelity as PS4 texture. It didn't seem to be the case last gen.

https://www.eurogamer.net/articles/digitalfoundry-the-complete-xbox-one-interview

That quote seems to be focusing on a specific use case rather than texturing in general: a depth buffer that is then switched out and then read linearly later.
For one thing, the ESRAM absorbed the high-bandwidth construction of the depth buffer prior to the texturing phase. The initial construction process would likely place this same buffer in GPU-optimized memory for the Series X.
However, the ESRAM had immediate capacity crunch, where it was seemingly worth the effort to copy the data out to DRAM to make room for another depth buffer build. This seems inapplicable to the 10GB GPU-optimized space.
The texture phase also seems to be a more straightforward stream-in of data of a screen-aligned perfectly filled-in rectangle. It's highly regular and has a predictable and fixed demand on it.
More complex texturing of multiple surfaces with complicated multi-tap filtering would have a wider range of bandwidth demands. Some locations might see occasional hits, but the specific scenario may lead to a subset getting hit with many requests quite often in a small stretch of time.

Silent_Buddha · Apr 11, 2020

So basically. From a developer POV.

If something requires more than 336 GB/s of bandwidth, make sure it is allocated in the memory range that spans all memory channels.
If something doesn't require more than 336 GB/s of bandwidth you may or may not need to put it into the memory range that just spans memory channels servicing the 2 GB memory chips.
- Basically if you don't have 10 GB of data that requires more than 336 GB/s you can put whatever you want in the memory allocation pool that spans all memory channels. You just want to ensure that anything that needs more than 336 GB/s resides in that memory allocation range.
- So, bullet point 2 is only a concern if you have 10 GB of data that requires more than 336 GB/s.
Something accessing data on the 2 GB memory chips consumes a portion of the total 560 GB/s theoretical memory bandwidth.
- If that is a CPU request it might trigger a reduction in overall bandwidth due to contention with the GPU.

So from what I'm seeing at a very basic level. The only difference between a non-uniform system as used in the XBSX versus a uniform system as used in the PS5 is the first bullet point.

Bullet points 2 and 3 impact both memory configurations equally. And Bullet point 1 should be trivial for the developer if MS has done their job WRT the SDK and dev tools for game profiling.

I myself use a non-uniform memory pool on my Windows PC. 2 channels have 16 GB memory sticks while 2 channels have 8 GB memory sticks for a total of 48 GB. I don't suffer any ill effects, but then again the GPU isn't rendering from main memory and I don't currently use anything that is very sensitive to main memory bandwidth.

Regards,
SB

Metal_Spirit · Apr 11, 2020

BRiT said:
"Doctor, it hurts when I do this."
"Don't do that".

Just like if games on PS5 will use 90 GB/s of bandwidth on audio and 22 GB/s on decompression it will only have memory bandwidth of 336 GB/s left to use for GPU and CPU.
Developers won't do that.

Brit, that is an option... on Xbox it is not, unless the 6 GB are not used. The sharing of data channels and the fact that if a data channel is readind from one pool it is not available to the other are a reality devs have to deal with.
There is a 560 GB/s pool if the 6 GB are unused, but if used, those 560 gets divided by the two pools with a 392GB/s and a 168 GB/s division. This is still 560 GB/s global, there is no global bandwidth reduction, but no individual pool will give you the 560 GB/s.

Besides, the CPU can consume the full 3.5 GB available on the slow pool very easily, but he will not generate 168 GB/s of bandwidth traffic, meaning several large GB/s of bandwidth on this pool are just wasted.

Shifty Geezer · Apr 11, 2020

Proelite said:
I've divided Killzone ShadowFall's memory usage into two groups, one requiring fast access, and one requiring slow access.

Though a nice investigation, one shouldn't take one example of a launch title as indicative of 1) what games are doing now and 2) what they'll be doing next-gen. We might see a five fold increase in audio, maybe, and a load of RT data that doesn't fit at all into the KZSF model. The moment the CPU processing is no longer a bottleneck, the requirement for RAM use may increase dramatically, or maybe it'll go down with the GPU doing more compute work then before?

Nesh · Apr 11, 2020

3 pages and I still dont understand shit in a thread that should have helped me understand

iroboto · Apr 11, 2020

Nesh said:
3 pages and I still dont understand shit in a thread that should have helped me understand

If your'e following @3dilettante, who is providing very thorough answers to address Metal Spirit, the tldr is: there is only 1 possible scenario in which the asymmetric memory will cause an issue, and that's purely on the developer not knowing or caring about how they optimize their memory, and just dumping major critical items into the wrong area. All other scenarios are a non-issue whether the memory setup is symmetric or not.

I think Metal Spirit is hung up on slow and fast pools of memory. And we've tried to explain it to him before that it's a single physical pool. And that there is no such thing as fast or slow, it's just bits * clock speed.

We've also tried to explain that the amount of memory bandwidth provided by the 6 GB chips is very generous for CPU loads that would likely never get close to that amount.

We've also tried to explain that asymmetric memory chip setups has nothing to do with CPU and GPU contention over memory. And if the CPU ever monopolized that much data irrespective of being on a symmetric or asymmetric, you'd dog the memory anyway. Basically suggesting the CPU is a bigger consumer of bandwidth over the GPU.

He's then also explained that unused channels can be leveraged making the total system always fill out its potential. Instead of wasting its calls.

All in all, Xbox should perform well with it's memory setup

mrcorbo · Apr 11, 2020

Metal_Spirit said:
Besides, the CPU can consume the full 3.5 GB available on the slow pool very easily, but he will not generate 168 GB/s of bandwidth traffic, meaning several large GB/s of bandwidth on this pool are just wasted.

I'll let the more technically knowledgeable address the rest, but your thinking here seems really strange. If you don't use every TF of compute in every cycle on the GPU, are the unused TFlops wasted? If you only have 13.5GB total and in that mix is 9GB of data that needs fast memory access and 4.5GB that doesn't, what actual difference does it make if you have more capacity of fast memory available than data that needs it?

Metal_Spirit · Apr 11, 2020

mrcorbo said:
I'll let the more technically knowledgeable address the rest, but your thinking here seems really strange. If you don't use every TF of compute in every cycle on the GPU, are the unused TFlops wasted? If you only have 13.5GB total and in that mix is 9GB of data that needs fast memory access and 4.5GB that doesn't, what actual difference does it make if you have more capacity of fast memory available than data that needs it?

I might have not been clear! I'm sorry for that!

Let me try to explain it in another way:

Imagine a single pool of 16 GB memory, with 560 GB/s. (2.5GB already used by the OS)
You acess it with the CPU and use 3.5 GB of RAM generating a 50 GB/s traffic.
The GPU will have 10 GB RAM to use, and it can create a 510 GB/s traffic.

Now lets look at this case:

Accessing the 6 GB of slow ram, creates a 192 bit bus to this ram. This gives you 168 GB/s bandwidth to this memory! The remaining pool gets 392 GB/s, and both together give you the same 560 GB/s.
Now you use the full 3.5 GB creating a 50 GB/s in traffic.
How much will the GPU get?
392 GB/s... because there is no more memory on the other pool to generate extra bandwidth traffic.
Compared to the first case, unused bandwidth is wasted!

And this will not be for just a couple of cycles... We are talking CPU usage of this ram, so acess will be intense.

Was I clear this time?

Nesh · Apr 11, 2020

iroboto said:
If your'e following @3dilettante, who is providing very thorough answers to address Metal Spirit, the tldr is: there is only 1 possible scenario in which the asymmetric memory will cause an issue, and that's purely on the developer not knowing or caring about how they optimize their memory, and just dumping major critical items into the wrong area. All other scenarios are a non-issue whether the memory setup is symmetric or not.

I think Metal Spirit is hung up on slow and fast pools of memory. And we've tried to explain it to him before that it's a single physical pool. And that there is no such thing as fast or slow, it's just bits * clock speed.

We've also tried to explain that the amount of memory bandwidth provided by the 6 GB chips is very generous for CPU loads that would likely never get close to that amount.

We've also tried to explain that asymmetric memory chip setups has nothing to do with CPU and GPU contention over memory. And if the CPU ever monopolized that much data irrespective of being on a symmetric or asymmetric, you'd dog the memory anyway. Basically suggesting the CPU is a bigger consumer of bandwidth over the GPU.

He's then also explained that unused channels can be leveraged making the total system always fill out its potential. Instead of wasting its calls.

All in all, Xbox should perform well with it's memory setup

I understand the conclusion but not how it works

mrcorbo · Apr 11, 2020

Metal_Spirit said:
I might have not been clear! I'm sorry for that!

Let me try to explain it in another way:

Imagine a single pool of 16 GB memory, with 560 GB/s. (2.5GB already used by the OS)
You acess it with the CPU and use 3.5 GB of RAM generating a 50 GB/s traffic.
The GPU will have 10 GB RAM to use, and it can create a 510 GB/s traffic.

Now lets look at this case:

Accessing the 6 GB of slow ram, creates a 192 bit bus to this ram. This gives you 168 GB/s bandwidth to this memory! The remaining pool gets 392 GB/s, and both together give you the same 560 GB/s.
Now you use the full 3.5 GB creating a 50 GB/s in traffic.
How much will the GPU get?
392 GB/s... because there is no more memory on the other pool to generate extra bandwidth traffic.
Compared to the first case, unused bandwidth is wasted!

And this will not be for just a couple of cycles... We are talking CPU usage of this ram, so acess will be intense.

Was I clear this time?

The slow pool is 336 GB/s, not 168 GB/s. As for the rest, there are really people more equipped to properly explain this than me, but the 560 GB/s isn't the fast pool + the slow pool. The 560GB/s is the bandwidth of the fast pool any time you access data from it.

PSman1700 · Apr 11, 2020

mrcorbo said:
The slow pool is 336 GB/s, not 168 GB/s. As for the rest, there are really people more equipped to properly explain this than me, but the 560 GB/s isn't the fast pool + the slow pool. The 560GB/s is the bandwidth of the fast pool any time you access data from it.

Well thats the easiest way to explain it i think thx. It seems this was a better solution then if ms went for ’uniform’ but slower memory bw.

Globalisateur · Apr 11, 2020

mrcorbo said:
The slow pool is 336 GB/s, not 168 GB/s. As for the rest, there are really people more equipped to properly explain this than me, but the 560 GB/s isn't the fast pool + the slow pool. The 560GB/s is the bandwidth of the fast pool any time you access data from it.

Sure if you don't use the CPU at all (I know what you meant: that ram can be accessed only by one component at the same time). But I doubt it will happen. With all the power in those CPUs, and seeing how CPU starved developers were during this gen, I highly doubt they wouln't try to max the CPUs with 60fps, physics, audio stuff etc.

Also things like streaming and OS features will have an impact on the CPU.

But sure the cross gen games are going to shine running exclusively on the fast pool.

Metal_Spirit · Apr 11, 2020

mrcorbo said:
The slow pool is 336 GB/s, not 168 GB/s. As for the rest, there are really people more equipped to properly explain this than me, but the 560 GB/s isn't the fast pool + the slow pool. The 560GB/s is the bandwidth of the fast pool any time you access data from it.

Nope...
Nothing of the sort:

If you acess only 10 GB you have 560 GB/s, if you acess only 6 GB, you have 336 GB/s, If you acess both at once, you have 392 GB/s on the fast pool,168 on the slow pool, if you average acess to both pools, you will have 280GB/s on the fast, and 168 on the slow.

mrcorbo · Apr 11, 2020

mrcorbo said:
The slow pool is 336 GB/s, not 168 GB/s. As for the rest, there are really people more equipped to properly explain this than me, but the 560 GB/s isn't the fast pool + the slow pool. The 560GB/s is the bandwidth of the fast pool any time you access data from it.

All right, F it. I'll take a stab and, hopefully if I get anything wrong someone will step in and correct me. I think the confusion may be from conflating the usage of GB/s in both a spec that represents a potential available resource and an actual measurement of use over a period of time.

If we say the CPU uses 50GB/s we are actually saying that the CPU has a demand for 50GB of data over a second. On the XBSX, 50GB can be delivered from the slow pool using around 15% of the cycles of the memory bus given that the max that can be transferred from that memory is 336GB over a second if it were to be used every cycle. That leaves 85% of the bus cycles available for everything else, including (and probably most frequently) GPU accesses to the fast memory. 85% of 560 is 476 which is still more than the full bandwidth of the PS5.

Now let's do the same for the PS5. 50GB can be transferred using 11% of the cycles of the bus with a max bandwidth of 448GB/s. That leaves 89% left for the GPU and everything else. 89% of 448GB/s is ~399 GB/s.

Get it now?

Silent_Buddha · Apr 11, 2020

Metal_Spirit said:
I might have not been clear! I'm sorry for that!

Let me try to explain it in another way:

Imagine a single pool of 16 GB memory, with 560 GB/s. (2.5GB already used by the OS)
You acess it with the CPU and use 3.5 GB of RAM generating a 50 GB/s traffic.
The GPU will have 10 GB RAM to use, and it can create a 510 GB/s traffic.

Now lets look at this case:

Accessing the 6 GB of slow ram, creates a 192 bit bus to this ram. This gives you 168 GB/s bandwidth to this memory! The remaining pool gets 392 GB/s, and both together give you the same 560 GB/s.
Now you use the full 3.5 GB creating a 50 GB/s in traffic.
How much will the GPU get?
392 GB/s... because there is no more memory on the other pool to generate extra bandwidth traffic.
Compared to the first case, unused bandwidth is wasted!

And this will not be for just a couple of cycles... We are talking CPU usage of this ram, so acess will be intense.

Was I clear this time?

No. In your second example, if your access only generates 50 GB/s from the 2 GB chips you still have 510 GB/s to play with (assuming no CPU contention issues which would happen with a symmetrical memory configuration as well).

392 GB/s + 50 GB/s =/= 560 GB/s.

The only way that happens is if AMD is incredibly incompetent WRT their SOC design. And this would affect the PS5 SOC in exactly the same way.

What your claim boils down to is...

There are 10 memory chips.
- You access 6 of those memory chips (the "slow" pool) and somehow restrict it to only 50 GB/s
- You somehow use 392 GB/s on the remaining 4 chips which theoretically should only have 224 GB/s of bandwidth.
- Alternatively your math doesn't add up.

On PS5 you'd have the exact same situation using your example.

There are 8 memory chips.
- You access say 4 of those memory chips and somehow restrict it to only 50 GB/s
- You only have 224 GB/s of bandwidth remaining to use.

None of what you are saying is making sense in either situation.

What would happen in reality ignoring potential CPU-GPU contention for memory

There are 10 memory chips.
- You access some portion of 6 memory chips at 50 GB/s.
- The SOC can still use the remaining 286 GB/s on those 6 chips for other requests.
- The remaining 4 chips can still provide up to 224 GB/s for other requests.

There is no slow and fast pool of memory. Those are just constructs to try to allow people to understand in a very general way how things are happening.

There are 10 chips of memory with 560 GB/s total bandwidth.
Any combination of reads from memory can generate 560 GB/s of total bandwidth.
It doesn't matter if reads are from the lower or upper memory addresses on the 2 GB chips. The total bandwidth regardless of what portion you access will come out to 560 GB/s.

Regards,
SB

Shifty Geezer · Apr 11, 2020

mrcorbo said:
If we say the CPU uses 50GB/s we are actually saying that the CPU has a demand for 50GB of data over a second. On the XBSX, 50GB can be delivered from the slow pool using around 15% of the cycles of the memory bus given that the max that can be transferred from that memory is 336GB over a second if it were to be used every cycle. That leaves 85% of the bus cycles available for everything else, including (and probably most frequently) GPU accesses to the fast memory. 85% of 560 is 476 which is still more than the full bandwidth of the PS5.

I made the same mistake when I first considered Series X, but these are rates, not volumes. If in one second of game you access the RAM for 0.2 seconds at 560 GB/s (reading and writing 112 GB of data!), then you have 0.8 ms of the frame for the GPU to read data at 560 GB/s. Both get the full bandwidth. But of course, the GPU transfer a total lower quantity of data using the bus for 0.8s versus using it for 1s, but data consumed is meaningless in relation to transfer rates.

mrcorbo · Apr 11, 2020

Metal_Spirit said:
Nope...
Nothing of the sort:

If you acess only 10 GB you have 560 GB/s, if you acess only 6 GB, you have 336 GB/s, If you acess both at once, you have 392 GB/s on the fast pool,168 on the slow pool, if you average acess to both pools, you will have 280GB/s on the fast, and 168 on the slow.

This is not a good source.

Most glaring error (of many).

Yes, the SX has 2.5 GB reserved for system functions and we don't know how much the PS5 reserves for that similar functionality but it doesn't matter - the Xbox SX either has only 7.5 GB of interleaved memory operating at 560 GB/s for game utilisation before it has to start "lowering" the effective bandwidth of the memory below that of the PS5... or the SX has an averaged mixed memory bandwidth that is always below that of the baseline PS4.

The 2.5GB system reservation resides in the slow pool. We know this as it was specifically mentioned in the tech reveal. The fast pool is 10GB, full stop.

And why they think the only possibilities for accessing the two pools of memory is to either alternate each cycle or access them simultaneously is beyond me.

Metal_Spirit · Apr 11, 2020

mrcorbo said:
All right, F it. I'll take a stab and, hopefully if I get anything wrong someone will step in and correct me. I think the confusion may be from conflating the usage of GB/s in both a spec that represents a potential available resource and an actual measurement of use over a period of time.

If we say the CPU uses 50GB/s we are actually saying that the CPU has a demand for 50GB of data over a second. On the XBSX, 50GB can be delivered from the slow pool using around 15% of the cycles of the memory bus given that the max that can be transferred from that memory is 336GB over a second if it were to be used every cycle. That leaves 85% of the bus cycles available for everything else, including (and probably most frequently) GPU accesses to the fast memory. 85% of 560 is 476 which is still more than the full bandwidth of the PS5.

Now let's do the same for the PS5. 50GB can be transferred using 11% of the cycles of the bus with a max bandwidth of 448GB/s. That leaves 89% left for the GPU and everything else. 89% of 448GB/s is ~399 GB/s.

Get it now?

Nobody is saying PS5 has more or less memory bandwidth...

. That is not the point here!
We are talking about a limitation on Xbox memory... Period. PS5 , if used, is just as a comparison to a standard memory system.

But even if you did acess 50 GB in 15% of your cycles on the slow memory at 336 GB, and 85% of your cycles at 560 GB/s that would be 526,4 GB/s, not 560 GB/s.
There is a loss!
And CPU needs as much acess to the RAM as the GPU... not just 15%. It may use less bandwidth but acess time is not inferior.

How to understand the 560 GB/s and 336 GB/s memory pools of Series X spawn

iroboto

Daft Funk

BRiT

(>• •)>⌐■-■ (⌐■-■)

3dilettante

Silent_Buddha

Metal_Spirit

Shifty Geezer

uber-Troll!

Nesh

Double Agent

iroboto

Daft Funk

mrcorbo

Foo Fighter

Metal_Spirit

Nesh

Double Agent

mrcorbo

Foo Fighter

PSman1700

Globalisateur

Globby

Metal_Spirit

mrcorbo

Foo Fighter

Silent_Buddha

Shifty Geezer

uber-Troll!

mrcorbo

Foo Fighter

Metal_Spirit

Similar threads

How to understand the 560 GB/s and 336 GB/s memory pools of Series X *spawn*

Daft Funk

(>• •)>⌐■-■ (⌐■-■)

uber-Troll!

Double Agent

Daft Funk

Foo Fighter

Double Agent

Foo Fighter

Globby

Foo Fighter

uber-Troll!

Foo Fighter

Similar threads

How to understand the 560 GB/s and 336 GB/s memory pools of Series X spawn