I've divided Killzone ShadowFall's memory usage into two groups, one requiring fast access, and one requiring slow access.
Total: 3429MB
Total: 1307MB
So let's say that a console game usually requires around 25-33% to be in the fast pool.
For a next gen, that can be minimum 8GB HBM + 32GB DDR5 (4GB for OS).
@3dilettante, any changes adjustment do you think is needed for my division? Do you think the ratio of 1 to 2-3 fast to slow memory split makes sense?
Some of the big bandwidth consumers at that time would have been the render targets. I am not sure about excluding the video memory portion that contains non-streaming textures and the streaming texture pool. Textures might be more spiky in consumption or dependent on filtering behavior, but they're being accessed by the highly parallel CUs and their vector memory pipelines. The non-streaming textures sound like those textures that were too in-demand to risk waiting on disk reads, and streaming textures might have slow stream-in to the pool, after which some could be very heavily hit by GPU reads.
However, a good amount of the motivation for those pools' size would be the HDD bottleneck that the next gen is changing significantly, and things like sampler feedback may also make it possible to reduce the overall size. However, with the performance of the GPUs and ability to use resources in less time, what fraction of that category remains would be the more intensely used data.
Not as I see it. You have two pools, 6 and 10 GB, accesing one uses data channels that during that time cannot.be used to access the other.
So, if you are accessing the slow memory you must at least use 6 16 bits channels to access the full pool, reducing bandwidth on the 10 GB RAM by 168 GB/s.
I don't think that's necessary. From the point of view of the DRAM, the granularity can be as fine as one 16-bit channel for 16 bus clocks before picking a different address range.
There may be other constraints on the parallelism of the system, but accesses like cache line fills can be satisfied in 2-4 bursts from a single channel. Why reserve any more channels than are needed for a given transaction? A CPU's 64 byte line likely uses one channel for a few bursts, or at most two if they're in lockstep. 6 channels would pull in 192 bytes, which doesn't map to any basic action done by the system.
That aside, I don't understand the idea that if data comes from "slow" memory that it means there's some innate "slow" quality to the data that hurts overall performance. It's just bandwidth. If the GPU or CPU are pulling over 100 GB/s from a portion of regular memory, it's because the game needed that much data. The developer would have dictated that.
If that slow functionality needs more bandwidth, why couldn't the developer map the allocation to the address range where it could get more?
If the problem is that a given function needs 168 GB/s, while a different function needs 560 GB/s at the exact same time, that's just life. You'd still have a problem in a 20GB system if there are two pieces of functionality whose total demand exceeds the physical capabilities of the memory bus. There's only a difference if the developer somehow cannot fit a high-use target in the 10 GB range, but I haven't seen an argument that games are in imminent danger of having 10GB of render targets or the like.
I know that... but if you have a drop due to acesses to the 6GB, and you count on the extra bandwidth, your performance will suffer.
There's a separate debate to be had about contention between client types and the difficulty in getting max utilization due to DRAM sensitivity to access patterns. However, that sort of challenge is more about how accesses map to ranges with patterns in the KB or MB range, not 10GB or more.
Because like any distributed storage system, the access rate would become the average of the entire available memory space. In this case it would be 6GB at 336 and 10GB at 560... So that would be 476GB/s average. That's the primary reason nobody makes such imbalanced busses.
2.5GB of the 6GB normal space is OS reserve, which the design of the platform should be trying to minimize the bandwidth impact of in favor of what the game needs, unless the OS or some application takes over the foreground.
I think it's risky to correlate the capacity of different spaces to bandwidth utilization. It also creates the image of a random or purposefully even distribution of accesses to all addresses in the system, which isn't the case.
Some of the most dominant bandwidth consumers are streaming compute shaders or the ROPs writing and overwriting a finite set of render targets. That's a fraction of the overall memory capacity, but those ranges get hit many more times, and this is very predictable.
The trade-off here is that Microsoft thinks developers that can provide lists of buffer capacity and frame-time budgets can figure out if something has significant stretches where it pulls close to or more than 24 bytes a cycle from DRAM. If so, the developer would be expected to move that memory target into an address range that can supply 40 bytes per cycle.
If they had the BOM for 20GB they would have a unified system. Now they are working to mitigate the negative impact of having unequal capacities. They need to split it to prevent the excess requests to some chips from lowering the average bandwidth.
If a client pulls 336 GB/s from the slow pool, and something else needs 560 GB/s but just gets the unused channels, that's very high utilization. Another combination is that the 560GB/s load gets everything and the 336 GB/s load gets nothing, still high utilization of the bus. The real problem is that the game is written to need 896 GB/s, and having balanced capacity chips wouldn't help with that.
The best known solution is to isolate the excess into a rarely accessed upper region: the 6GB partition. It's a much better average than going random access on a single partition and it's worth the trouble of making that split.
If by rarely accessed we mean something that can be accessed for up to 336 GB/s.
You get 560GB/s only if the address space is spread evenly. So if they want more, they either make all chips bigger equally, or in this case, they make the excess of the few bigger chips a rarely accessed region.
This happens if up to 10 GB of GPU data can be fit in the GPU-optimized range. If it's not GPU-related, nobody will know. If there's a heavy GPU target in slow memory, is it because there's 10 GB of high-demand GPU assets already in the optimal range, or did the dev misplace the render target?