Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Status
Not open for further replies.
Why didn't sony see this coming? Would it cost that much more resources to devide the memory?
Could you clarify what Sony did not see coming?
The PS5's memory chips are symmetric, so that part of the conversation isn't relevant.
If you are talking about the matrix unit discussion, there isn't evidence that it applies to anything but the compute-only Arcturus.
 
You are ignoring why they were struggling to keep 2GHz and 3GHz. Remember they stress test their systems in extreme environments with extreme/unrealistic workloads.
Not ignoring and not true. Cerny said they attempt to estimate for the worst case "game", not some theoretical or unrealistic possibility. And if they underestimate that, then it could result not only in an extremely loud system but potential overheating and shutdown. You are trying to read into his comments the notion that no applications realistically hit that upper power band which is simply not true. And selectively quoting scenarios such as a map screen or the use of AVX instructions as the only reasons for incurring high power draw. I think the variable clock solution they have come up with is ingenious and an excellent idea, as it provides the best possible performance given their acoustic and power design targets in ALL scenarios. Again, however, this doesn't change the fact that their acoustic and power design targets ARE going to limit overall system performance as games push greater utilization of the hardware and it reduces clock speeds to compensate as designed. His expectation is simply that "most" games, whatever the definition of that is, aren't going to push that hard and as such run at or near the max clocks often.
 
Total bandwidth is based on physical channels and their bit rate, and GDDR6 modules have 2 16-bit channels each. The 10 GB needs to be on all the chips to have the peak bandwidth amount.
Virtual addressing for x86 works on a 4KB granularity at a minimum, so there's lower-level details I'm not sure of about where the additional mapping is done. Caches past the L1 tend to be based on physical address, and they themselves might have striping functions in order to handle variable capacity like the L3s in the ring-bus processors. It might delay the final determination of the responsible controller to a when packets go out onto the fabric, which should have various routing policies or routing tables to destinations.

I think I just hadn't remembered correctly about how the memory bandwidth had been specified. It's all making sense now.
 
Not ignoring and not true. Cerny said they attempt to estimate for the worst case "game", not some theoretical or unrealistic possibility. And if they underestimate that, then it could result not only in an extremely loud system but potential overheating and shutdown. You are trying to read into his comments the notion that no applications realistically hit that upper power band which is simply not true. And selectively quoting scenarios such as a map screen or the use of AVX instructions as the only reasons for incurring high power draw. I think the variable clock solution they have come up with is ingenious and an excellent idea, as it provides the best possible performance given their acoustic and power design targets in ALL scenarios. Again, however, this doesn't change the fact that their acoustic and power design targets ARE going to limit overall system performance as games push greater utilization of the hardware and it reduces clock speeds to compensate as designed. His expectation is simply that "most" games, whatever the definition of that is, aren't going to push that hard and as such run at or near the max clocks often.
I think the misunderstanding here is that when Cerny talks about "the previous technique" he doesn't mean the part about fixed clock, it's the part about designing for unknowns, which is indirectly caused by fixed clocks. The new technique doesn't have to design for unknowns but it requires variable clocks. In turn he adds that it allows higher clocks (peak) and better average.

If they estimate 1800mhz can cause peaks up to 180W even 1% of the time, they have to add a margin just in case some game ever reach 200W. They end up with a 200W design cost, a normal consumption of 150W, and only a 1800mhz performance.

I f they cap the power at 200W and vary the clock to keep it at 200W, they end up with 2230mhz most of the time, and 1800mhz less than 1% of the time. the average becomes ridiculously advantageous for the exact same design cost.
 
Last edited:
There's no reason to think they are doing this. When you access an address in slow RAM (or it's the CPU/IO), you do it at 192-bits. When you access an address in fast RAM (and it's the GPU), it's done at 320-bit. There's no reason why you'd have to alternate cycles. Which pool you were accessing would be totally dictated by client (GPU,GPU or I/O) demand.

Now... try to access data on both at the same time and keep bandwidths.
Don´t forget the 192 bits on the slow RAM are shared with the 320 bits on the fast RAM. It's not 320+192!
 
Now... try to access data on both at the same time and keep bandwidths.
Don´t forget the 192 bits on the slow RAM are shared with the 320 bits on the fast RAM. It's not 320+192!

So on PS5 if both the CPU and GPU try to access memory at the same time, how much bandwidth does the GPU get?

Or are you talking about the GPU accessing memory in address ranges that split across both pools?
 
You don't. You're either doing a 320-bit access or a 192-bit access depending on which client is asking for that particular segment of data and which pool it exists in.
Maybe you can, if the system found a way to fill some latency-critical data for the GPU using the memory that isn't connected to the same 6 channel bus that the CPU can access, though that leaves you with a 4-channel 128bit bus.

I really don't know if modern memory controller units allow that level of granularity though, or if it makes practical sense.
 
Sure I totally respect that.
So if I told the GPU that it's memory address space is from 0000-16GB
And I told the CPU that it's memory address space is from 10GB-16GB.

What would happen in this case? Because this is what it sounds like to me.

The real problem here is not the bandwidth decrease caused by the CPU or GPU... CPU and GPU usage will decrease bandwidth in any system.
Problem here is that to keep 560 GB/s steady you need to read from both pools. But on the correct percentages, because the more usage you give to the second pool, the more bandwidth on the fist decreases.
 
I f they cap the power at 200W and vary the clock to keep it at 200W, they end up with 2230mhz most of the time, and 1800mhz less than 1% of the time. the average becomes ridiculously advantageous for the exact same design cost.
Whatever the PS5 cooling situation is the power supply only has to hit the power limit (or a tad better ) by design of the console. Does that save much in money from the power supply standpoint at the very least ?
 
The real problem here is not the bandwidth decrease caused by the CPU or GPU... CPU and GPU usage will decrease bandwidth in any system.
Problem here is that to keep 560 GB/s steady you need to read from both pools. But on the correct percentages, because the more usage you give to the second pool, the more bandwidth on the fist decreases.

To get 560GB/s you would have to exclusively read from one pool at full load for an entire second. Where you are getting tripped up is you are using a metric of capacity to do work over a period of time and trying to apply it to moment to moment, cycle to cycle usage. If you were to precisely track the amount of data that was actually transferred over any given second when running a game, I'll bet it would be some lesser number than the theoretical max bandwidth. So whether this theoretical max number goes up or down with any particular usage pattern is irrelevant. What's relevant is whether this particular setup delivers sufficient available bandwidth to meet the needs of the system as it needs it.
 
The real problem here is not the bandwidth decrease caused by the CPU or GPU... CPU and GPU usage will decrease bandwidth in any system.
Problem here is that to keep 560 GB/s steady you need to read from both pools. But on the correct percentages, because the more usage you give to the second pool, the more bandwidth on the fist decreases.
There isn't 2 pools. There is some chips with 2GB and some with 1GB but to get the speed, you are reading from all chips. It makes no sense to think that you lose bandwidth from the fast memory when reading the slow memory. If you need to read the higher addressed bits, it will be slow during that operation. If you are reading the lower bits, it will be fast. The GPU will be able to access all 10GB at full speed if the CPU isn't holding up the memory controller/bus and if it needs the slower memory, it will obviously be a slower average but that is where a game dev can potentially design their memory access to not make this a problem for memory access that requires bandwidth.
 
So on PS5 if both the CPU and GPU try to access memory at the same time, how much bandwidth does the GPU get?

Or are you talking about the GPU accessing memory in address ranges that split across both pools?

On both systems the GPU gets what remains after the CPU usage.
The question is that on series X for the GPU to use to the full the bandwidth that remains, he must have to be constantly reading from both pools. Otherwise Bandwidth usage will be sub-optimal.
 
On both systems the GPU gets what remains after the CPU usage.
The question is that on series X for the GPU to use to the full the bandwidth that remains, he must have to be constantly reading from both pools. Otherwise Bandwidth usage will be sub-optimal.

No. GPU only has to read from the 10GB pool to get the full bandwidth that's left. This is no different than the PS5. Why does everyone think it's any different?
 
If we use an extreme case, say the CPU takes 50GB/s, it imbalance the controllers distribution and costs an equivalent 83GB from the ideal maximum of 560GB/s, so the max drops to a 527GB/s on average because of the stalls.

This could fluctuate between 560 GB/s + 0 GB/s, 392 GB/s + 168 GB/s or 224 GB/s + 366 GB/s, and a lot of combinations in between.

Problem here is that to keep 560 GB/s steady you need to read from both pools. But on the correct percentages, because the more usage you give to the second pool, the more bandwidth on the fist decreases.

I think you'd run into these issues anyway whether it was 10x1GB chips, 10x2GB chips, PS5's setup, or XSX current setup. Both of you are describing a challenge of having memory contention between CPU and GPU, not a contention problem between 2GB and 1GB. Memory Contention by _default_ is sub optimal. Having to share memory between the CPU and GPU has it's pros and cons.
Cons: you have contention
Pros: you waste less cycles copying back and forth, you can access the same memory locations, you can efficiently have 2 processors work on the same memory locations without needing a copy and burdening other buses. You guys were absolutely there when HUMA was the big thing to talk about for PS4 and XBO didn't have it because of split pools of memory. Didn't seem like an issue then, why is it now?

The advantages outweigh the cons and that's why we do it.

The only difference here is that with PS5 you run into full memory contention, and we have a graph of how bandwidth drops significantly as the CPU uses more bandwidth on PS4. Which makes sense, while CPU is getting the data it needs, it's not going somehow mix the request and fill the gaps with GPU data. It's going to get it's pull, then the GPU gets what it needs, and vice versa.

On XSX it's just very obvious now what is happening you know exactly which chips will be under contention and which memory won't be.

It's not about GB/s, that doesn't even make sense, you're never going to maintain 560 GB/s every single second anyway, you're just going to pull data as you request it. It's not like for a full second you can keep pulling and putting 560GB/s of data back and forth on system. That's means your memory is running a full 32bits of data every single clock cycle from every single chip for a full second no breaks. No imperfect pulls.
If you can't saturate a CPU or GPU to 100% saturation, there was no way your memory would be like this as well. You'll always grab an imperfect amount of data. Developers can make things better by making texture sizes to be exactly the amount that memory would grab from each chip of course; but I mean there is only so much you can do.

You're going to have inefficiency somewhere.

On PS5 when the CPU needs its work, their GPU will be locked out until the CPU gets what it needs except for the chips it can pull from.

ON XSX when the CPU needs its work, their GPU will be locked out until the CPU gets what is needs, except for the chips that are still available to pull from. In this case, there will always be 4 chips dedicated for the GPU that the CPU can't touch.
 
Whatever the PS5 cooling situation is the power supply only has to hit the power limit (or a tad better ) by design of the console. Does that save much in money from the power supply standpoint at the very least ?
It all add up together, but probably more the cooling cost than the PSU, there's almost always standardized designs from PSU manufacturers, they adapt well tested designs (already passed all regulation worldwide, dirt cheap to make), there's very little difference in cost between say a 300W and a 350W.

The problem I see with cooling is that it can reach a threshold where different materials and assembly becomes required. Like going from just aluminum, then alu+copper slug, then add heatpipes, then more heatpipes, then a large vapor chamber is required.

So maybe Sony decided to stay with two or three heatpipes instead of a vapor chamber?
 
I think you'd run into these issues anyway whether it was 10x1GB chips, 10x2GB chips, PS5's setup, or XSX current setup. Both of you are describing a challenge of having memory contention between CPU and GPU, not a contention problem between 2GB and 1GB. Memory Contention by _default_ is sub optimal. Having to share memory between the CPU and GPU has it's pros and cons.
Cons: you have contention
Pros: you waste less cycles copying back and forth, you can access the same memory locations, you can efficiently have 2 processors work on the same memory locations without needing a copy and burdening other buses. You guys were absolutely there when HUMA was the big thing to talk about for PS4 and XBO didn't have it because of split pools of memory. Didn't seem like an issue then, why is it now?

The advantages outweigh the cons and that's why we do it.

The only difference here is that with PS5 you run into full memory contention, and we have a graph of how bandwidth drops significantly as the CPU uses more bandwidth on PS4. Which makes sense, while CPU is getting the data it needs, it's not going somehow mix the request and fill the gaps with GPU data. It's going to get it's pull, then the GPU gets what it needs, and vice versa.

On XSX it's just very obvious now what is happening you know exactly which chips will be under contention and which memory won't be.

It's not about GB/s, that doesn't even make sense, you're never going to maintain 560 GB/s every single second anyway, you're just going to pull data as you request it. It's not like for a full second you can keep pulling and putting 560GB/s of data back and forth on system. That's means your memory is running a full 32bits of data every single clock cycle from every single chip for a full second no breaks. No imperfect pulls.
If you can't saturate a CPU or GPU to 100% saturation, there was no way your memory would be like this as well. You'll always grab an imperfect amount of data. Developers can make things better by making texture sizes to be exactly the amount that memory would grab from each chip of course; but I mean there is only so much you can do.

You're going to have inefficiency somewhere.

On PS5 when the CPU needs its work, their GPU will be locked out until the CPU gets what it needs except for the chips it can pull from.

ON XSX when the CPU needs its work, their GPU will be locked out until the CPU gets what is needs, except for the chips that are still available to pull from. In this case, there will always be 4 chips dedicated for the GPU that the CPU can't touch.
On Xbox Series X there is also going to have the usual memory contention problem on top of the aformentionned reduced bandwidth when CPU is accessing the memory.
 
Status
Not open for further replies.
Back
Top