oldschoolnerd
Newcomer
I have a theory that things may not be as smooth for the ps4 performance as big numbers suggest. Now this is a theory and it involves a certain amount of conjecture though I think that the principles used are sound.
So my theory is based on one basic principle about hardware performance. Contention. Whenever you have more than one thing contending for a given resource they both will suffer to some extent. The more contention the more suffering. When there is ram bandwidth contention, the overall utilisation of the bus drops. The more contention the more it drops. In the PS4 the primary consumers (clients) of the GDDR5 ram are the 8 CPU cores and the GPU. If there is contention on the bus then this manifests itself by the clients being stalled for longer, as the time taken between a request for data being made and that data getting to the client increases.
Now, we know that GPUs are more latency tolerant. Mainly because they work on comparatively large quantities of data that can be read/written in larger chunks, having said that they are only so tolerant, they will need data eventually - without it their utilisation will drop. Different story for the CPU though, if there is a L1/L2 cache miss, a read has to be done from system memory. This is not latency tolerant. While the CPU core is waiting for this data it is stalled.
So, with reference to these principles what can be said for the PS4 architecture? We have two types of clients, the cpu cores (8 of them) and shader cores (potentially 1152 of those). Now it is well known that the GPU is a massive consumer of bandwidth, in fact, you could say that available bandwidth to the GPU is as important or if not more so than the number of CUs etc. That GPU will be soaking up all the bandwidth it can.
Overall this is not a normal architecture. It is quite a novel architecture. If you look at a PC for example, the GPU is in the main getting its memory requests serviced from its own GDDR5 ram. This leaves the CPU pretty much unfettered access to its DDR3 ram, the cpu cores only contend with one another, and seeing as the data reads and writes are so small they usually only occupy the bus for 1 cycle and the memory controller has a simple job to arbitrate between the cores. We’ve seen this. It works well. But in the case of the ps4 and its much touted “unified ram” this is not the case. You have potentially 1152 other clients contending with the CPU for bandwidth. To my mind this says “massive contention”. The CPUs cores do not like contention. Then need the lowest possible latency on their memory requests. The other thing that doesn’t help is the slightly higher latency of GDDR5. No, its not much, from what I have seen its about 10 - 15%, but that is still 10 - 15% longer the CPU will be waiting for that piece of data.
The sony designers will have thought about this contention issue early on and attempted to mitigate this with some logic in the memory controller, maybe prioritising the CPU requests for data over the GPU requests. However this wont be as simple as arbitrating between contending CPU core requests because the GPUs will be requesting bigger chunks of data, that require more than one cycle to service in their entirety. What’s the memory controller meant to do? Say to the GPU client “yep you have some of the data, but I need to interrupt you while I hand the bus over to the waiting CPU core”. That sounds really complex to me and certainly not great for the GPU, and certainly not great for the overall utilisation of the bus. Or, does the memory controller say, CPU client, “please wait while I give this GPU request some more cycles to complete.” Resulting in the CPU stalling for longer. At the end of the day whatever they do, I don’t think you can magic away all that contention, somewhere, something is going to have to give. I suggest that either the CPU or the GPU will be stalling much more than they would be in a split pool architecture, be that some DDR3 and some GDDR5 or whatever...
To my mind the biggest detractor to this argument is “Couldn’t the Sony designers see this happening, they are smart guys, why would they build a system that is gimped due to memory bandwidth contention?”. Its a good question...maybe they didn’t do the simulation correctly ( unrealistic workload?, tight budget - couldn’t afford the best equipment?). Maybe they did and it was all working fine, then the specs changed from 4GB to 8GB at the last minute and as a result something that was running on an edge condition just about ok, popped over that edge.
“Why now? surely they would have found out sooner.” Really? We know that the PS4 dev kits were high spec PCs, when did the devs actually get a dev kit that was running the final SoC, married to an 8GB unified GDDR5 ram pool?
So my theory is based on one basic principle about hardware performance. Contention. Whenever you have more than one thing contending for a given resource they both will suffer to some extent. The more contention the more suffering. When there is ram bandwidth contention, the overall utilisation of the bus drops. The more contention the more it drops. In the PS4 the primary consumers (clients) of the GDDR5 ram are the 8 CPU cores and the GPU. If there is contention on the bus then this manifests itself by the clients being stalled for longer, as the time taken between a request for data being made and that data getting to the client increases.
Now, we know that GPUs are more latency tolerant. Mainly because they work on comparatively large quantities of data that can be read/written in larger chunks, having said that they are only so tolerant, they will need data eventually - without it their utilisation will drop. Different story for the CPU though, if there is a L1/L2 cache miss, a read has to be done from system memory. This is not latency tolerant. While the CPU core is waiting for this data it is stalled.
So, with reference to these principles what can be said for the PS4 architecture? We have two types of clients, the cpu cores (8 of them) and shader cores (potentially 1152 of those). Now it is well known that the GPU is a massive consumer of bandwidth, in fact, you could say that available bandwidth to the GPU is as important or if not more so than the number of CUs etc. That GPU will be soaking up all the bandwidth it can.
Overall this is not a normal architecture. It is quite a novel architecture. If you look at a PC for example, the GPU is in the main getting its memory requests serviced from its own GDDR5 ram. This leaves the CPU pretty much unfettered access to its DDR3 ram, the cpu cores only contend with one another, and seeing as the data reads and writes are so small they usually only occupy the bus for 1 cycle and the memory controller has a simple job to arbitrate between the cores. We’ve seen this. It works well. But in the case of the ps4 and its much touted “unified ram” this is not the case. You have potentially 1152 other clients contending with the CPU for bandwidth. To my mind this says “massive contention”. The CPUs cores do not like contention. Then need the lowest possible latency on their memory requests. The other thing that doesn’t help is the slightly higher latency of GDDR5. No, its not much, from what I have seen its about 10 - 15%, but that is still 10 - 15% longer the CPU will be waiting for that piece of data.
The sony designers will have thought about this contention issue early on and attempted to mitigate this with some logic in the memory controller, maybe prioritising the CPU requests for data over the GPU requests. However this wont be as simple as arbitrating between contending CPU core requests because the GPUs will be requesting bigger chunks of data, that require more than one cycle to service in their entirety. What’s the memory controller meant to do? Say to the GPU client “yep you have some of the data, but I need to interrupt you while I hand the bus over to the waiting CPU core”. That sounds really complex to me and certainly not great for the GPU, and certainly not great for the overall utilisation of the bus. Or, does the memory controller say, CPU client, “please wait while I give this GPU request some more cycles to complete.” Resulting in the CPU stalling for longer. At the end of the day whatever they do, I don’t think you can magic away all that contention, somewhere, something is going to have to give. I suggest that either the CPU or the GPU will be stalling much more than they would be in a split pool architecture, be that some DDR3 and some GDDR5 or whatever...
To my mind the biggest detractor to this argument is “Couldn’t the Sony designers see this happening, they are smart guys, why would they build a system that is gimped due to memory bandwidth contention?”. Its a good question...maybe they didn’t do the simulation correctly ( unrealistic workload?, tight budget - couldn’t afford the best equipment?). Maybe they did and it was all working fine, then the specs changed from 4GB to 8GB at the last minute and as a result something that was running on an edge condition just about ok, popped over that edge.
“Why now? surely they would have found out sooner.” Really? We know that the PS4 dev kits were high spec PCs, when did the devs actually get a dev kit that was running the final SoC, married to an 8GB unified GDDR5 ram pool?