How to understand the 560 GB/s and 336 GB/s memory pools of Series X *spawn*

Status
Not open for further replies.
I've divided Killzone ShadowFall's memory usage into two groups, one requiring fast access, and one requiring slow access.

slow said:
System Memory (1536MB Total)
  • Sound: 553MB
  • Havok Scratch: 350MB
  • Game Heap: 318MB
  • Various Assets/Entities: 143MB
  • Animation: 75MB
  • Executable/Stack: 74MB
  • LUA Script: 6MB
  • Particle Buffer: 6MB
  • AI Data: 6MB
  • Physics Meshes: 5MB
Shared Memory (CPU/GPU - 18MB)
  • Streaming Pool: 18MB
Video Memory (1893MB)
  • Non-Streaming Textures: 1321MB
  • Streaming Pool (1.6GB of streaming data): 572MB

Total: 3429MB

fast said:
Shared Memory (CPU/GPU - 110MB)
  • Display list (2x): 64MB
  • GPU Scratch: 32MB
  • CPU Scratch: 12MB
  • Queries/Labels: 2MB
Video Memory (1179MB)
  • Render Targets: 800MB
  • Meshes: 315MB
  • CUE Heap (49x): 32MB
  • ES-GS Buffer: 16MB
  • GS-VS Buffer: 16MB

Total: 1307MB

So let's say that a console game usually requires around 25-33% to be in the fast pool.

For a next gen, that can be minimum 8GB HBM + 32GB DDR5 (4GB for OS).

@3dilettante, any changes adjustment do you think is needed for my division? Do you think the ratio of 1 to 2-3 fast to slow memory split makes sense?
 
Last edited:
And that only in that you don't want to exceed 10GB of fast data. Overflowing the slow partition doesn't really hurt anything given you have a finite total amount of memory anyway.

Not as I see it. You have two pools, 6 and 10 GB, accesing one uses data channels that during that time cannot.be used to access the other.
So, if you are accessing the slow memory you must at least use 6 16 bits channels to access the full pool, reducing bandwidth on the 10 GB RAM by 168 GB/s.
So you will have 392 GB/s on the fast pool and 168 on the slow one, even if you access the slow memory just to "check the time of day".
Off course average bandwidth will depend on the number of accesses, but regardless of how many average you have, you cannot rely on more than 392 GB/s if you want your performance to be steady at all times.
 
So why not use the default setup then. Why bother with split pools at all?
 
Off course average bandwidth will depend on the number of accesses, but regardless of how many average you have, you cannot rely on more than 392 GB/s if you want your performance to be steady at all times.

I guess you cannot rely on developers to be not complete bozos.
 
Developers are going to optimize all memory access patterns to get as close to 560 GB/s as possible when required.

I know that... but if you have a drop due to acesses to the 6GB, and you count on the extra bandwidth, your performance will suffer.
 
I guess you cannot rely on developers to be not complete bozos.

The bandwidth cut is something that cannot be avoided.
You can minimize it at max, but not remove it from the equation.
Even less if the CPU is working on this memory.
I don’T see any bozos, but I don’t see any houdini’s either.
 
I know that... but if you have a drop due to acesses to the 6GB, and you count on the extra bandwidth, your performance will suffer.

"Doctor, it hurts when I do this."
"Don't do that".

Just like if games on PS5 will use 90 GB/s of bandwidth on audio and 22 GB/s on decompression it will only have memory bandwidth of 336 GB/s left to use for GPU and CPU.
Developers won't do that.
 
So why not use the default setup then. Why bother with split pools at all?
Because like any distributed storage system, the access rate would become the average of the entire available memory space. In this case it would be 6GB at 336 and 10GB at 560... So that would be 476GB/s average. That's the primary reason nobody makes such imbalanced busses.

If they had the BOM for 20GB they would have a unified system. Now they are working to mitigate the negative impact of having unequal capacities. They need to split it to prevent the excess requests to some chips from lowering the average bandwidth. The best known solution is to isolate the excess into a rarely accessed upper region: the 6GB partition. It's a much better average than going random access on a single partition and it's worth the trouble of making that split.

You get 560GB/s only if the address space is spread evenly. So if they want more, they either make all chips bigger equally, or in this case, they make the excess of the few bigger chips a rarely accessed region.
 
I've divided Killzone ShadowFall's memory usage into two groups, one requiring fast access, and one requiring slow access.



Total: 3429MB



Total: 1307MB

So let's say that a console game usually requires around 25-33% to be in the fast pool.

For a next gen, that can be minimum 8GB HBM + 32GB DDR5 (4GB for OS).

@3dilettante, any changes adjustment do you think is needed for my division? Do you think the ratio of 1 to 2-3 fast to slow memory split makes sense?
Some of the big bandwidth consumers at that time would have been the render targets. I am not sure about excluding the video memory portion that contains non-streaming textures and the streaming texture pool. Textures might be more spiky in consumption or dependent on filtering behavior, but they're being accessed by the highly parallel CUs and their vector memory pipelines. The non-streaming textures sound like those textures that were too in-demand to risk waiting on disk reads, and streaming textures might have slow stream-in to the pool, after which some could be very heavily hit by GPU reads.
However, a good amount of the motivation for those pools' size would be the HDD bottleneck that the next gen is changing significantly, and things like sampler feedback may also make it possible to reduce the overall size. However, with the performance of the GPUs and ability to use resources in less time, what fraction of that category remains would be the more intensely used data.

Not as I see it. You have two pools, 6 and 10 GB, accesing one uses data channels that during that time cannot.be used to access the other.
So, if you are accessing the slow memory you must at least use 6 16 bits channels to access the full pool, reducing bandwidth on the 10 GB RAM by 168 GB/s.
I don't think that's necessary. From the point of view of the DRAM, the granularity can be as fine as one 16-bit channel for 16 bus clocks before picking a different address range.
There may be other constraints on the parallelism of the system, but accesses like cache line fills can be satisfied in 2-4 bursts from a single channel. Why reserve any more channels than are needed for a given transaction? A CPU's 64 byte line likely uses one channel for a few bursts, or at most two if they're in lockstep. 6 channels would pull in 192 bytes, which doesn't map to any basic action done by the system.

That aside, I don't understand the idea that if data comes from "slow" memory that it means there's some innate "slow" quality to the data that hurts overall performance. It's just bandwidth. If the GPU or CPU are pulling over 100 GB/s from a portion of regular memory, it's because the game needed that much data. The developer would have dictated that.
If that slow functionality needs more bandwidth, why couldn't the developer map the allocation to the address range where it could get more?

If the problem is that a given function needs 168 GB/s, while a different function needs 560 GB/s at the exact same time, that's just life. You'd still have a problem in a 20GB system if there are two pieces of functionality whose total demand exceeds the physical capabilities of the memory bus. There's only a difference if the developer somehow cannot fit a high-use target in the 10 GB range, but I haven't seen an argument that games are in imminent danger of having 10GB of render targets or the like.

I know that... but if you have a drop due to acesses to the 6GB, and you count on the extra bandwidth, your performance will suffer.
There's a separate debate to be had about contention between client types and the difficulty in getting max utilization due to DRAM sensitivity to access patterns. However, that sort of challenge is more about how accesses map to ranges with patterns in the KB or MB range, not 10GB or more.


Because like any distributed storage system, the access rate would become the average of the entire available memory space. In this case it would be 6GB at 336 and 10GB at 560... So that would be 476GB/s average. That's the primary reason nobody makes such imbalanced busses.
2.5GB of the 6GB normal space is OS reserve, which the design of the platform should be trying to minimize the bandwidth impact of in favor of what the game needs, unless the OS or some application takes over the foreground.
I think it's risky to correlate the capacity of different spaces to bandwidth utilization. It also creates the image of a random or purposefully even distribution of accesses to all addresses in the system, which isn't the case.
Some of the most dominant bandwidth consumers are streaming compute shaders or the ROPs writing and overwriting a finite set of render targets. That's a fraction of the overall memory capacity, but those ranges get hit many more times, and this is very predictable.
The trade-off here is that Microsoft thinks developers that can provide lists of buffer capacity and frame-time budgets can figure out if something has significant stretches where it pulls close to or more than 24 bytes a cycle from DRAM. If so, the developer would be expected to move that memory target into an address range that can supply 40 bytes per cycle.

If they had the BOM for 20GB they would have a unified system. Now they are working to mitigate the negative impact of having unequal capacities. They need to split it to prevent the excess requests to some chips from lowering the average bandwidth.
If a client pulls 336 GB/s from the slow pool, and something else needs 560 GB/s but just gets the unused channels, that's very high utilization. Another combination is that the 560GB/s load gets everything and the 336 GB/s load gets nothing, still high utilization of the bus. The real problem is that the game is written to need 896 GB/s, and having balanced capacity chips wouldn't help with that.

The best known solution is to isolate the excess into a rarely accessed upper region: the 6GB partition. It's a much better average than going random access on a single partition and it's worth the trouble of making that split.
If by rarely accessed we mean something that can be accessed for up to 336 GB/s.

You get 560GB/s only if the address space is spread evenly. So if they want more, they either make all chips bigger equally, or in this case, they make the excess of the few bigger chips a rarely accessed region.
This happens if up to 10 GB of GPU data can be fit in the GPU-optimized range. If it's not GPU-related, nobody will know. If there's a heavy GPU target in slow memory, is it because there's 10 GB of high-demand GPU assets already in the optimal range, or did the dev misplace the render target?
 
If a client pulls 336 GB/s from the slow pool, and something else needs 560 GB/s but just gets the unused channels, that's very high utilization. Another combination is that the 560GB/s load gets everything and the 336 GB/s load gets nothing, still high utilization of the bus. The real problem is that the game is written to need 896 GB/s, and having balanced capacity chips wouldn't help with that.

If by rarely accessed we mean something that can be accessed for up to 336 GB/s.

This happens if up to 10 GB of GPU data can be fit in the GPU-optimized range. If it's not GPU-related, nobody will know. If there's a heavy GPU target in slow memory, is it because there's 10 GB of high-demand GPU assets already in the optimal range, or did the dev misplace the render target?
The unused channels during an access to the 6GB portion would deplete their queue. Any gpu algorithm would statistically request equally across all channels to get an equivalent 560GB/s. So having some request resolved (the channel that are still free) while the others stalled from serving the 6GB portion would result with a proportional stall.

If the additional requests going to the 336 portion don't have an equivalent number of requests going to the remaining channels, wouldn't that still imbalance the queues? OTOH, if the data is spread unevenly to make the unused channels useful in that time frame, they are then not balanced equally when the 6GB portion is not used, and cannot do 560GB/s in total, it would instead cause the opposite queues starving.
 
Please correct me if I'm wrong but GDDR6 chips are dual channel (ported) and allow for synchronous data transfers (just like the ESRAM in the Xone?). The GPU only gets 10gb at the full 560GB when both channels are used and the CPU can only access one channel of the larger chips with the other given over to the GPU to use.
But it makes me wonder if all the operations of the GPU require the full 560GBs at all times? Wouldn't some operations require less than this and if so wouldn't we have a situation where one channel moves data out while the other moves fresh data in? Would this allow for greater efficiency? If this is the case wouldn't MS just create a virtually addressable memory space where the GPU gets 560/ and the CPU gets 280GBs and the lower bandwidth tasks get the other 280? I would think after this it comes down to the dev to order his memory usage to maximize the system?
 
As far as why CPU audio and file I/O don't need more than 336 GB/s, I interpreted the Goossen quote to mean that the CPU and IO blocks have infinity fabric interfaces that have bandwidth on the order of 32B at ~1.8 GHz, which means their bandwidths are lower than even the "slow" value of 336 GB/s. Only a client capable of generating more than 336 GB/s of traffic would know the difference, and per the interview that would be the GPU--which makes sense.
Not as I see it. You have two pools, 6 and 10 GB, accesing one uses data channels that during that time cannot.be used to access the other.
So, if you are accessing the slow memory you must at least use 6 16 bits channels to access the full pool, reducing bandwidth on the 10 GB RAM by 168 GB/s.
So you will have 392 GB/s on the fast pool and 168 on the slow one, even if you access the slow memory just to "check the time of day".
Off course average bandwidth will depend on the number of accesses, but regardless of how many average you have, you cannot rely on more than 392 GB/s if you want your performance to be steady at all times.

Arent you are accessing some combo of 32 bit interfaces? Just because you divide the memory in half doesn’t mean you have to divide the interface in half too.

At least from what I read the cpu and gpu can’t access ram simultaneously.
 
Last edited by a moderator:
Please correct me if I'm wrong but GDDR6 chips are dual channel (ported) and allow for synchronous data transfers (just like the ESRAM in the Xone?). The GPU only gets 10gb at the full 560GB when both channels are used and the CPU can only access one channel of the larger chips with the other given over to the GPU to use.
But it makes me wonder if all the operations of the GPU require the full 560GBs at all times? Wouldn't some operations require less than this and if so wouldn't we have a situation where one channel moves data out while the other moves fresh data in? Would this allow for greater efficiency? If this is the case wouldn't MS just create a virtually addressable memory space where the GPU gets 560/ and the CPU gets 280GBs and the lower bandwidth tasks get the other 280? I would think after this it comes down to the dev to order his memory usage to maximize the system?
Each channel can only access half of the data, this is not dual ported ram. It's indistinguishable from having two separate chips with their own part of the data.
 
Some of the big bandwidth consumers at that time would have been the render targets. I am not sure about excluding the video memory portion that contains non-streaming textures and the streaming texture pool. Textures might be more spiky in consumption or dependent on filtering behavior, but they're being accessed by the highly parallel CUs and their vector memory pipelines. The non-streaming textures sound like those textures that were too in-demand to risk waiting on disk reads, and streaming textures might have slow stream-in to the pool, after which some could be very heavily hit by GPU reads.
However, a good amount of the motivation for those pools' size would be the HDD bottleneck that the next gen is changing significantly, and things like sampler feedback may also make it possible to reduce the overall size. However, with the performance of the GPUs and ability to use resources in less time, what fraction of that category remains would be the more intensely used data.

From what I remember from 2013 was that texture data isn't bandwidth intensive. Otherwise, Xbox One textures would be half the fidelity as PS4 texture. It didn't seem to be the case last gen.

Andrew Goossen said:
Yeah, again I think we under-balanced and we had that great opportunity to change that balance late in the game. The DMA Move Engines also help the GPU significantly as well. For some scenarios there, imagine you've rendered to a depth buffer there in ESRAM. And now you're switching to another depth buffer. You may want to go and pull what is now a texture into DDR so that you can texture out of it later and you're not doing tons of reads from that texture so it actually makes more sense for it to be in DDR. You can use the Move Engines to move these things asynchronously in concert with the GPU so the GPU isn't spending any time on the move. You've got the DMA engine doing it. Now the GPU can go on and immediately work on the next render target rather than simply move bits around.

https://www.eurogamer.net/articles/digitalfoundry-the-complete-xbox-one-interview
 
I've divided Killzone ShadowFall's memory usage into two groups, one requiring fast access, and one requiring slow access.



Total: 3429MB



Total: 1307MB

So let's say that a console game usually requires around 25-33% to be in the fast pool.

For a next gen, that can be minimum 8GB HBM + 32GB DDR5 (4GB for OS).

@3dilettante, any changes adjustment do you think is needed for my division? Do you think the ratio of 1 to 2-3 fast to slow memory split makes sense?

So you are advocating carving out a partition for the gpu’s VRAM in both the fast portion and slow portion of RAM?
 
Status
Not open for further replies.
Back
Top