Is PS4 hampered by its memory system?

Status
Not open for further replies.
I have a theory that things may not be as smooth for the ps4 performance as big numbers suggest. Now this is a theory and it involves a certain amount of conjecture though I think that the principles used are sound.

So my theory is based on one basic principle about hardware performance. Contention. Whenever you have more than one thing contending for a given resource they both will suffer to some extent. The more contention the more suffering. When there is ram bandwidth contention, the overall utilisation of the bus drops. The more contention the more it drops. In the PS4 the primary consumers (clients) of the GDDR5 ram are the 8 CPU cores and the GPU. If there is contention on the bus then this manifests itself by the clients being stalled for longer, as the time taken between a request for data being made and that data getting to the client increases.

Now, we know that GPUs are more latency tolerant. Mainly because they work on comparatively large quantities of data that can be read/written in larger chunks, having said that they are only so tolerant, they will need data eventually - without it their utilisation will drop. Different story for the CPU though, if there is a L1/L2 cache miss, a read has to be done from system memory. This is not latency tolerant. While the CPU core is waiting for this data it is stalled.

So, with reference to these principles what can be said for the PS4 architecture? We have two types of clients, the cpu cores (8 of them) and shader cores (potentially 1152 of those). Now it is well known that the GPU is a massive consumer of bandwidth, in fact, you could say that available bandwidth to the GPU is as important or if not more so than the number of CUs etc. That GPU will be soaking up all the bandwidth it can.

Overall this is not a normal architecture. It is quite a novel architecture. If you look at a PC for example, the GPU is in the main getting its memory requests serviced from its own GDDR5 ram. This leaves the CPU pretty much unfettered access to its DDR3 ram, the cpu cores only contend with one another, and seeing as the data reads and writes are so small they usually only occupy the bus for 1 cycle and the memory controller has a simple job to arbitrate between the cores. We’ve seen this. It works well. But in the case of the ps4 and its much touted “unified ram” this is not the case. You have potentially 1152 other clients contending with the CPU for bandwidth. To my mind this says “massive contention”. The CPUs cores do not like contention. Then need the lowest possible latency on their memory requests. The other thing that doesn’t help is the slightly higher latency of GDDR5. No, its not much, from what I have seen its about 10 - 15%, but that is still 10 - 15% longer the CPU will be waiting for that piece of data.

The sony designers will have thought about this contention issue early on and attempted to mitigate this with some logic in the memory controller, maybe prioritising the CPU requests for data over the GPU requests. However this wont be as simple as arbitrating between contending CPU core requests because the GPUs will be requesting bigger chunks of data, that require more than one cycle to service in their entirety. What’s the memory controller meant to do? Say to the GPU client “yep you have some of the data, but I need to interrupt you while I hand the bus over to the waiting CPU core”. That sounds really complex to me and certainly not great for the GPU, and certainly not great for the overall utilisation of the bus. Or, does the memory controller say, CPU client, “please wait while I give this GPU request some more cycles to complete.” Resulting in the CPU stalling for longer. At the end of the day whatever they do, I don’t think you can magic away all that contention, somewhere, something is going to have to give. I suggest that either the CPU or the GPU will be stalling much more than they would be in a split pool architecture, be that some DDR3 and some GDDR5 or whatever...

To my mind the biggest detractor to this argument is “Couldn’t the Sony designers see this happening, they are smart guys, why would they build a system that is gimped due to memory bandwidth contention?”. Its a good question...maybe they didn’t do the simulation correctly ( unrealistic workload?, tight budget - couldn’t afford the best equipment?). Maybe they did and it was all working fine, then the specs changed from 4GB to 8GB at the last minute and as a result something that was running on an edge condition just about ok, popped over that edge.
“Why now? surely they would have found out sooner.” Really? We know that the PS4 dev kits were high spec PCs, when did the devs actually get a dev kit that was running the final SoC, married to an 8GB unified GDDR5 ram pool?
 
We know that the PS4 dev kits were high spec PCs, when did the devs actually get a dev kit that was running the final SoC, married to an 8GB unified GDDR5 ram pool?
Some time ago. All launch titles have been developed and tested on final hardware. All gameshow demos running on devkits are doing so on final silicon.
 
Is this based at all on anything or, as you admitted, is this only pure conjecture?

So far developers have been stating that the PS4 is both powerful and easy to develop on. From what I gather, they've been more willing to parade that console version over others.
Don't go there. This is a simple technical discussion about the impact of contention on asymmetric multiprocessor systems (CPU and GPU). It can and should be answered on a technical level.
 
I think what he is thinking about is a not an issue or a solved problem.
The fact that the memory is gddr5 or ddr3 I think is irrelevant, AMD or Intel have solved the issue aka the memory controller servicing different clients on chip be it cpus cores or iGPU.
I don't know how they are doing the balancing but they are successfully doing it.
 
If that was a problem, it would have shown up on any unified memory systems where the CPU and GPU are accessing the same pool, AMD APUs, the 360, the xbox one. So far I'm willing to believe that bandwidth remains more important.

The key must be in the memory requests scheduling (which on such an APU, it must be a work of art), if GPU is more tolerant, it can have a longer queue so that they can be reordered and optimized, while the CPU would get a lower latency priority?
 
A quick summary would be:

Limited but high availability and reliable bandwidth is reserved to the CPU. In addition, CPU can pass data to the GPU without hitting any RAM at all, so not using any bandwidth whatsoever. It can also work against its L1 and L2 cache, which also doesn't hit GDDR5 bandwidth. Developers have quite detailed control over what chuncks of memory they want to access in what way.
 
OP: you should probably start with reading this:

http://www.eurogamer.net/articles/digitalfoundry-how-the-crew-was-ported-to-playstation-4

Then come back here with more specific questions.

Ok. Nice one. So the onion bus takes care of the cpu, and garlic the GPU. However there is a diagram that shows the GPU getting 176GB/s and onion getting 20GB/s, which I don't think can possibly be the case. They have to both share the one and only 176GB/s bus to the actual chips?

The other thing of note is that they say it all went horribly wrong until they worked out they needed to make the onion and garlic ram allocations in different parts of the ram. So it's unified in one way, but devs still have to allocate ram in the correct area or performance goes down the pan.
 
I signed up to this forum to ask that exact same question. Heres the answers I got.

Isn't the unified memory on the consoles something of a performance detractor in some cases?
Besides having the gddr5 split the bandwidth between the cpu and gpu, wouldn't there be some instances where either the cpu or gpu have to stall memory reads because one or the other is reading from memory. Same with writes, if one processing unit is writing to memory and the other needed to as well it would have to stall to wait?

Since the jaguar cpu has so little L2 cache wouldn't there be many cache misses, creating countless main memory reads?

I specifically want to talk about memory bandwidth, read/write stalling rather than the potential benefits in OTHER areas we've all read and talked about hsa and huma potentially providing.

Some amount of interference would happen.
The GPU doesn't seem like it would be as severely affected. The numbers for the CPU block's bandwidth (<20 GB/s) are a very small fraction of the whole, so the CPU is physically unable to block the GPU from most of the bandwidth. The cache subsystem is designed to reduce the amount of time the CPU would get anywhere near that.
The ratio of GPU to CPU bandwidth is such that the bus should manage decently, even if the CPUs were actively trying to interfere with the GPU. The GPU is able to tolerate a significant amount of latency, so that doesn't seem like a likely problem.

The CPU might be a bigger loser here, depending on how effectively the memory controllers can balance the latency requirements of the CPUs versus the sheer volume of GPU accesses.
This wouldn't be unique to Orbis.
APUs in general have tended to have noticeably worse memory latency, with the earliest ones having pretty terrible numbers.
Certain numbers of an unspecified competitor I can't use in a versus comparison to Orbis show it's not a GDDR5 or single-pool problem.



I have been thinking this for months, the stalling issue, but I guess Im kinda pro sony cuz they are the underdog (and I don't want there to be an ms monopoly in the industry right now), I didn't want to give Xbots the opportunity to gloat and celebrate. If you look at PS4's polycount and draw distance are fine, but the texture quality for games continuously looks a step down from what you can achieve on 7850 @ 30fps on a pc and what the Xbox can do.

There will be numerous cache misses occurring on cpu jaguar for both consoles. But the Xbox has esram and perhaps superior memory controller setup (we don't know enough about sony's ps4 yet). In the DF article the Xbox developers said they the cpu and gpu can simultaneously access the ddr3. With the ps4 it may only be able to have a single processing unit access the gddr5 at once. If this is a known downside to gddr5 or an unknown peculiarity with gddr5 that nobody expected or poor design of PS4s memory controllers I don't know.
 
If that was a problem, it would have shown up on any unified memory systems where the CPU and GPU are accessing the same pool, AMD APUs, the 360, the xbox one. So far I'm willing to believe that bandwidth remains more important.

The key must be in the memory requests scheduling (which on such an APU, it must be a work of art), if GPU is more tolerant, it can have a longer queue so that they can be reordered and optimized, while the CPU would get a lower latency priority?

I take your point, but I don't think ram access latencies have got massively smaller than they were when the 360 came out. However the number of requests hitting the ram must have jumped by a significant amount for this gen. Hence the potential for more contention.

At the end of the day whatever onion and garlic do, they are still going to be contending for that single 176GB/s bus that can only do one thing at a time.
 
Is contention really a greater issue than it would have been with split pools? With split pools the CPU would have to copy data from RAM to VRAM, which means you have contention on the VRAM anyway. At worst this is a wash for the GPU accessing memory, no? On the CPU side, if you're right I can see how you might have a point if CPUs are more sensitive to latency and stalls. But you also don't have to copy data from RAM to VRAM anymore, so what kind of win is that for the CPU vs potential for stalls?
 
I signed up to this forum to ask that exact same question. Heres the answers I got.







I have been thinking this for months, the stalling issue, but I guess Im kinda pro sony cuz they are the underdog (and I don't want there to be an ms monopoly in the industry right now), I didn't want to give Xbots the opportunity to gloat and celebrate. If you look at PS4's polycount and draw distance are fine, but the texture quality for games continuously looks a step down from what you can achieve on 7850 @ 30fps on a pc and what the Xbox can do.

There will be numerous cache misses occurring on cpu jaguar for both consoles. But the Xbox has esram and perhaps superior memory controller setup (we don't know enough about sony's ps4 yet). In the DF article the Xbox developers said they the cpu and gpu can simultaneously access the ddr3. With the ps4 it may only be able to have a single processing unit access the gddr5 at once. If this is a known downside to gddr5 or an unknown peculiarity with gddr5 that nobody expected or poor design of PS4s memory controllers I don't know.

Sorry I'm on a phone and quoting/bolding is difficult, but the guy who replied to you saying the cpu could be the looser in all this, that's the point I am making. Multi core coding is difficult at best, and there going to be logical dependencies between the threads. ANY increased latency of cpu requests could impact more than just the thread concerned. Multiply that by the number of threads and the potential for stalls (logical or physical) has to go up.
 
Is contention really a greater issue than it would have been with split pools? With split pools the CPU would have to copy data from RAM to VRAM, which means you have contention on the VRAM anyway. At worst this is a wash for the GPU accessing memory, no? On the CPU side, if you're right I can see how you might have a point if CPUs are more sensitive to latency and stalls. But you also don't have to copy data from RAM to VRAM anymore, so what kind of win is that for the CPU vs potential for stalls?
I see what you are saying, and some stuff needs to be copied between the pools, but a lot of the memory access performed by the GPU are to that vram. Just look at the NOT PS4 getting 150GB/s to the esram, only 50GB/s to system ram.
 
OP: you should probably start with reading this:

http://www.eurogamer.net/articles/digitalfoundry-how-the-crew-was-ported-to-playstation-4

Then come back here with more specific questions.

This article ( part 1 and part 2 ) has been mentioned before but not really discussed all that much:

http://www.vgleaks.com/playstation-4-includes-huma-technology/

The Eurogamer link also links back to an Extremetech article:
http://www.extremetech.com/gaming/1...avily-modified-radeon-supercharged-apu-design

and I was curious if based on what we know now if we can say which of the the memory interconnect strategies laid out in:
http://www.extremetech.com/gaming/1...ily-modified-radeon-supercharged-apu-design/2
http://www.extremetech.com/gaming/1...ily-modified-radeon-supercharged-apu-design/3

is being used. Either directly ( strategy 1 most likely ) Ring Bus ( strategy 2 least likely ) or a crossbar ( somewhere in between in likely hoodiness).

I assume number 1 since other ways are more expensive/complex. A crossbar would give you better bandwidth utilization but cost a bunch more silicon and design time and increased complexity ( more testing and unknowns to contend with ).
 
I take your point, but I don't think ram access latencies have got massively smaller than they were when the 360 came out. However the number of requests hitting the ram must have jumped by a significant amount for this gen. Hence the potential for more contention.

- the GDDR5 bus in the PS4 has 176GB/s with 20GB/s of that contended. Depending on priorities, that would give "156GB/s to the GPU".
- the DDR3 bus in the XB1 has 68GB/s with 30GB/s of that contended. Depending on priorities, that would give "38GB/s to the GPU".

That's obviously vastly oversimplified (obviously there's latency etc), and I think the memory controllers in APUs are specifically altered to try to mitigate contention problems.

The XB1 appears designed specifically with the above problem in mind, and it seems the magic works. (AFAIR there's also 3GB/s coming from Kinect2 and something like 4GB/s from the HDMI input? those need to be moved around within the system)

So, no, AFAIK both consoles work around the problem.
 
what we'll surely see over the time is a model where developers will make CPU code 'L2 cache friendly'.
On both consoles, you have 8 CPU cores competing for memory, likely 2x4 jaguar blocks.
So, it is devs will optimize&split their code/threads affinity to keep the 2mb L2 from trashing (i.e. 4 renderer threads in the 0-3 processor and other stuff on other partition, for better L2 usage).
As a side effect, it should benefit the odd Bulldozer L2 cache management.
 
- the GDDR5 bus in the PS4 has 176GB/s with 20GB/s of that contended. Depending on priorities, that would give "156GB/s to the GPU".
- the DDR3 bus in the XB1 has 68GB/s with 30GB/s of that contended. Depending on priorities, that would give "38GB/s to the GPU".

That's obviously vastly oversimplified (obviously there's latency etc), and I think the memory controllers in APUs are specifically altered to try to mitigate contention problems.

The XB1 appears designed specifically with the above problem in mind, and it seems the magic works. (AFAIR there's also 3GB/s coming from Kinect2 and something like 4GB/s from the HDMI input? those need to be moved around within the system)

So, no, AFAIK both consoles work around the problem.

Bear in mind the cpu is far more sensitive to latency than the GPU, so any contention is going to have a bigger impact. It's the effect on the CPU that I'm interested in.
 
what we'll surely see over the time is a model where developers will make CPU code 'L2 cache friendly'.
On both consoles, you have 8 CPU cores competing for memory, likely 2x4 jaguar blocks.
So, it is devs will optimize&split their code/threads affinity to keep the 2mb L2 from trashing (i.e. 4 renderer threads in the 0-3 processor and other stuff on other partition, for better L2 usage).
As a side effect, it should benefit the odd Bulldozer L2 cache management.

I tell you what, achieving keeping your dataset in L2 because accessing system ram is too latency risky sounds like a right PITA. Code is more complex or dataset smaller. I rather not bother given the choice!
 
Status
Not open for further replies.
Back
Top