But that doesn't explain (to me anyway!) why the BW drop was far higher than the CPU was using, and why that can't be fixed with a better memory controller. I would have expected (as did everyone else, because the BW drop came as a surprise) that while the CPU was accessing the RAM, the GPU had to wait, but it'd be 1:1 CPU usage to BW impact. What we saw on Liverpool was the RAM losing efficiency somehow, as if there was a switching penalty. I would hope AMD can fix that issue and have a near 1:1 impact on their console UMAs, so 1 ms of full RAM access for the CPU means only 1 ms less available for the GPU and the remaining frame time accessible at full rate.
This came up back with the launch of the current generation, with the ESRAM discussions and that PS4 contention slide.
There's a broad set of reasons, but a fundamental issue is that DRAM is not easy to get good utilization out.
DRAM optimizes for density and cost, meaning that the speed of internal DRAM arrays has had limited increase.
There is a preference for hitting the same array, or running a very linear pattern so that the process for page activation and access can be pipelined without showing up as lost cycles on the bus.
To optimize for cost and traces, the bus for reads and writes is shared and needs to be turned around whenever the access type changes.
It helps if the controller can build up a long list of pending accesses, where it can then re-order, combine, or chain them so that they take advantage of pipelining in the DRAM, don't hop between banks or bank groups, or force bus turnaround that prevents any transfers from occurring for tens of cycles.
The trade-off is that collecting a large number of accesses means accepting more latency for the individual accesses.
GPUs are structured to tolerate long latency and to generate many accesses, and they accept a lot of reordering.
CPUs are latency-optimized and don't tolerate much reordering, and they can be running workloads that just don't make very good access patterns.
There's a balancing act in how long the controller can run with a high-utilization pattern before it needs to disrupt it in favor of a latency-sensitive client, and the process can mean changing banks or bus modes for dozens of cycles, and eating a similar penalty going back.
AMD's supposedly incorporated more intelligent controllers, and it may have given more levels of priority. Zen has a much better memory subsystem than Jaguar, and it is more tolerant of latency, which could help. There are also cache subsystem changes and protocol differences that might reduce how many high-priority operations like atomics need to go to memory.
On the other hand, the higher performance for Zen can also give it greater ability to put demands on the memory subsystem versus the much more limited core performance and limited bandwidth of the coherent memory bus in the Jaguar SOCs.
Utilization should be better, but I don't think the utilization loss can go to zero.
Has this been discussed?
I'm trying to wrap my head around this. Apparently the 'Velocity Architecture' will make a 100GB pool where devs can put assets which a game can have instant access to.
I can't quite get my head around how this can be 'instant' - I'm thinking essentially this 100GB must live on the SSD and therefore be limited to the 4.8GB/s restrictions.
Can anyone explain how this works please?
An SSG version of a gaming card could be a pricier way to get a storage system into PCs with similar parameters to the customized console subsystems. It wouldn't require mass replacement of all the systems where the CPU doesn't have built-in compression and extra DCMAC hardware and the motherboard lacks a PCIe 4.0 NVME slot. There would need to be some transfers over the PCIe bus to the graphics card, but those could be limited to swapping a game's asset partition in and out rather than constant transfers.
It'd be a value-add for AMD's hardware, at least.