Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Status
Not open for further replies.
But each time you use the slower memory you diminish the total amount of bandwidth. This an example, If you use half a second the slow memory(336 GB/s) and half a second the fast memory(560 GB/s) at the end you have 448 GB/s of bandwidth.

I understand that OS functionnality which are often accessed can be in the fast memory to have better bandwidth and functionnality rarely used in the slower memory.

EDIT: I was like you saying why use fast memory for OS this is stupid but at the end the answer is logic.

There is no reason to store the OS in fast RAM. MS explicitly states that the only component that sees 560 GBs is the gpu. And even if audio or file i/o data ends ups in local ram it only has access to 336 GBs at most.

One thing has always been true about AMD apus. If you are not the gpu you don’t get full access to bandwidth offered by gddr. You have unified memory but three separate memory pools, cpu cacheable, uncacheable and local. Each with their own max bandwidth because the granularity is different so only gpu related data is interweaved in a way that fully exploits gddr.
 
As I have stated previously in this thread the reason they had trouble reaching 2.0/3.0 was because they used their old way of selecting clock speeds.
If they wanted a consistent power supply and cooling they would need to clock the GPU+CPU much lower than what it otherwise could mostly run at, just in case.

Mark Cerny gave examples of what types of workloads were powerhungry for the GPU (HZD map screen with low triangle count) and CPU (AVX workloads). They wanted a solution that would be at the highest possible frequency during normal use, and then outlier/extra powerhungry instructions would decrease the frequency to keep within the powerbudget.
This way they don't have to increase the fan- and PSU size because of an outlier scenario.
AMD Smartshift was only mentioned as being used in a specific scenario where the CPU was not using up its assorted power budget, and because of that it would be able to transfer the power to the GPU.

You can keep saying it but that doesn't help it make sense. The "old way" of selecting clock speeds was simply estimating for the worst-case game, ie. God of War, in Cerny's example. So, all that is being said is that when the next God of War type power usage game shows up, it would struggle to run the CPU and GPU at 3GHz and 2GHz respectively based on his same comments. But rather than holding all games to that lower bound, they are allowing the clocks to ramp up in lower power usage scenarios (ie. all the other games that don't push the system as hard). The lower the activity in the processor, the faster the clock. It doesn't change the fact that at the higher processor activity level, you hit the same clock speed limits as if you were using the "old" method of setting them.
 
If you data is on the 2nd half of the 1GB chip you will get 366 GB/s because you only have 6 chips to pull that data from. I don't care how data is stored on the 2GB chips, it will always be able to return 32 bits of data per clock cycle. Whether it's 32 bits to each GB, or 16 bits to both and the data is split. Whatever the case is, it's returning 32 bits through the main controller every single time.

I think the missing information here that is causing some confusion is that no data of significant size (which won't fit the L2 at least) is ever going into one specific memory chip. Memory controllers distribute the data in a RAID0-like manner along all chips to guarantee maximum throughput. In SeriesX, the "fast data" gets distributed among the 10 chips and the "slow data" gets distributed into 6 chips.




Looks like ps4 is rated to work at up to 35C. I would assume ps5 is same: https://www.playstation.com/en-za/c...ps4/sg/b-chassis/1tb/CUH-1116B_SG_EN_web.pdf/

Those temps can happen for example in LA. Some poor young soul who leaves in cheap place with no ac or is too poor to run ac.
Don't forget the average temp is probably higher than room ambient as it's in a poorly ventilated av cabinet
Those 35º are probably referred to room temperature, meaning there's probably some headroom for those in hot rooms who still put their consoles inside cabinets and such.
Still, there's probably very few people living on 35º rooms. I couldn't handle over 30º room temperature more than 10 minutes, it's way too hot!



Many parts of the world that are hot with little to no air conditioning.
Cheap (and super air-polluting) AC is widespread in hot first-world countries even.
Who has money for a PS4 but no money for a cheap AC unit?
 
I think the missing information here that is causing some confusion is that no data of significant size (which won't fit the L2 at least) is ever going into one specific memory chip. Memory controllers distribute the data in a RAID0-like manner along all chips to guarantee maximum throughput. In SeriesX, the "fast data" gets distributed among the 10 chips and the "slow data" gets distributed into 6 chips.






Those 35º are probably referred to room temperature, meaning there's probably some headroom for those in hot rooms who still put their consoles inside cabinets and such.
Still, there's probably very few people living on 35º rooms. I couldn't handle over 30º room temperature more than 10 minutes, it's way too hot!




Cheap (and super air-polluting) AC is widespread in hot first-world countries even.
Who has money for a PS4 but no money for a cheap AC unit?

Those cheap places getting hot have very poor insulation. The heat comes straight in and coldness goes straight out. It's potentially hundreds of dollars a month for electricity to run ac. It's kind on unbelievable what poor construction quality can be like. Places like LA tend to get cooler at night. Some folks just don't run ac and they use fans to get cooler air in for nighttime(easier to sleep)
 
I think the missing information here that is causing some confusion is that no data of significant size (which won't fit the L2 at least) is ever going into one specific memory chip. Memory controllers distribute the data in a RAID0-like manner along all chips to guarantee maximum throughput. In SeriesX, the "fast data" gets distributed among the 10 chips and the "slow data" gets distributed into 6 chips.
Sure I totally respect that.
So if I told the GPU that it's memory address space is from 0000-16GB
And I told the CPU that it's memory address space is from 10GB-16GB.

What would happen in this case? Because this is what it sounds like to me.
 
Last edited:
I think the missing information here that is causing some confusion is that no data of significant size (which won't fit the L2 at least) is ever going into one specific memory chip. Memory controllers distribute the data in a RAID0-like manner along all chips to guarantee maximum throughput. In SeriesX, the "fast data" gets distributed among the 10 chips and the "slow data" gets distributed into 6 chips.

...

So as far as memory addresses go, are you saying the first 2 GB chip wouldn't have sequential addresses 0x00000000 - 0x7FFFFFFF? Those addresses would be spread across multiple chips?
 
PS5 GPU and CPU will have variable clocks depending of total APU power consumption
XBX main ram will have variable bandwidth depending of CPU load and accesses to the main ram

The irony!
 
So as far as memory addresses go, are you saying the first 2 GB chip wouldn't have sequential addresses 0x00000000 - 0x7FFFFFFF? Those addresses would be spread across multiple chips?
At least for GPU-targeted memory, it should be interleaved. I recall one granularity given some time ago was changing channels after 128B.
CPU memory can have varying functions for assigning addresses. For example, if there is NUMA involvement there can be per-node or more general interleaving.
Memory systems sneak in multiple levels of indirection, even for physical addresses. Caches can have hash functions that can be adjusted to give different slices responsibility for a given physical address range, but the memory controllers themselves are given ranges and interleaving patterns either in hardware or when they are initialized.
 
At least for GPU-targeted memory, it should be interleaved. I recall one granularity given some time ago was changing channels after 128B.
CPU memory can have varying functions for assigning addresses. For example, if there is NUMA involvement there can be per-node or more general interleaving.
Memory systems sneak in multiple levels of indirection, even for physical addresses. Caches can have hash functions that can be adjusted to give different slices responsibility for a given physical address range, but the memory controllers themselves are given ranges and interleaving patterns either in hardware or when they are initialized.

Interesting. So I’m the case of series x if they’re saying there is higher bandwidth to 10 GB, is that 10 GB spread across all of the chips or would virtual addressing basically map it across the chips with the larger bus?
 
Interesting. So I’m the case of series x if they’re saying there is higher bandwidth to 10 GB, is that 10 GB spread across all of the chips or would virtual addressing basically map it across the chips with the larger bus?

The 10GB is comprised of 1 GB of the 2GB total on each of the 6 higher capacity chips plus the entire 1GB of each of the 4 lower capacity chips. 32 bit access * 10 total chips = 320bit. The 6GB of slow memory is comprised of the remaining 1GB on each of the 6 higher capacity chips. 6 * 32 = 192

Data would be spread across all of the chips in all each pool to maximize bandwidth.
 
Sure I totally respect that.
So if I told the GPU that it's memory address space is from 0000-16GB
And I told the CPU that it's memory address space is from 10GB-16GB.

What would happen in this case? Because this is what it sounds like to me.
If we use an extreme case, say the CPU takes 50GB/s, it imbalance the controllers distribution and costs an equivalent 83GB from the ideal maximum of 560GB/s, so the max drops to a 527GB/s on average because of the stalls.

The reason is that some chips have fewer operations queued (some serve the GPU only, others serve both CPU and GPU). They serve the same proportion of requests as all the others because they server the same percentage of the GPU memory space, but they don't have the burden of serving the CPU too, so they WILL stall in proportion to the addition CPU requests the others have to serve.

More reasonably, if it's 25GB/s which I think it more of a normal 8 core bandwidth, it's 543GB/s average. If MS were just using the whole space randomly without any such partitioning, they'd end up with the average between 560 and 336. It becomes obvious that partitioning the upper part is the best solution and reaches close enough to be worth the trouble.

Interestingly, this problem applies equally to the idea of putting mismatched nand capacities on the 12 controllers PS5, but they could have done exactly the same concept as MS, and put the OS partition on the upper half of the higher capacity chips, and by not accessing it much during gameplay it would have mitigated the possible drop in performance.
 
This is the first I've ever heard of a 192 bit bus. You would draw lanes to both chips but certainly not 16 bits to each chip. Why would you do that?

Nothing here indicates that the speeds of the memory change at all.

I thought this was straight forward:
Same pool of memory, different chip sizes. Slow and fast is really more like do you want to pull from 10 or 6.

There are 10 chips in total, each chip has 56 GB/s bandwidth on a 320 bit bus.
56 * 10 = 560 GB/s
Bandwidth is the total size of the pipe, and in this case it will be about the total amount of pull you can grab at once.
Of the 10 chips, 6 of them are 2 GB in size.
56 * 6 = 366 GB/s

If you data is on the 2nd half of the 1GB chip you will get 366 GB/s because you only have 6 chips to pull that data from. I don't care how data is stored on the 2GB chips, it will always be able to return 32 bits of data per clock cycle. Whether it's 32 bits to each GB, or 16 bits to both and the data is split. Whatever the case is, it's returning 32 bits through the main controller every single time.

But you still have 4 bus openings on the remaining 4 chips, available, just because it's accessing the back half of those 2GB chips, doesn't mean it closes off the other remaining lanes.

So you can still pull 56 *4 on the remaining 1 GB chips.
which is 224 GB/s

so adding these together, it is back to 560 GB/s.

There is no averaging of memory
There is no split pool.

You're only downside is if you put _all_ of your data on the 6x2GB chips, you're limited to a bandwidth of 336 GB/s. Because you'll grab the data on 1/2 and then if you need data on the other 1/2, you'll need to alternate. But that can be handled by priority. But that doesn't stop the developers from always fully utilizing all lanes to achieve its 560 GB/s.

Regardless if you are alternating or not, those 6 chips will be constantly giving out 336GB/s.
And regardless if you are alternating or not on the 6x2 GB chips, you still have 4x1 GB chips waiting to go. Giving you a total of 560 GB/s bandwidth whenever all 10 chips are being utilized.


This should not be treated like 2 separate ram pools like they did with 360 or XBO.

Because of the imbalance in CPU to GPU, perhaps you'll just prioritize the GPU.
While I'm not sure how priority works, edit: they'll priortize whatever is going to give the most performance. Anyway all GPU data goes into the single chips. This is an easy one. Remaining GPU data goes into the 1GB of data. The CPU data would sit on top of that, and any sort of GPGPU work that may need to be done.

TLDR; I don't really see an issue here unless you're always got contention on those 6x2 GB chips. You'd still have this probably anyway. No one would be trying this argument if it was 10x2GB chips. You'd still have the same issue if you're trying to access the memory controller to pull the data all on the same chips. It would still be 560GB/s, you just have 20 GB of memory. Would you use your current argument to say that 10x2GB chips has bottlenecked because the data is split over 2GB chips and now it needs to alternate? Or have some sort of custom controller to do this inn which it needs to trade off bandwidth?

As for he part you quote me:

There are six 2 GB modules... Each conected via a 32 bits lane... 32*6=192 bits (hence the 336 GB/s)

These will occupy 4 of the 64 bits controllers. And will use six 32 bits lanes from a total of 8 available.

The remaining two lanes from these controllers are connected to two of the 1 GB modules (making a 256 bits total). The remaining two 1 GB modules will use the remaining fifth controller.

Now for the rest.

Yes... you will get 560 GB at all time... 224 + 336 as an example. But this is just mathematics. Fact is you can have a combined total of 560 GB/s, but acess bandwith will fluctuate on each of the memory parts depending on access..

This could fluctuate between 560 GB/s + 0 GB/s, 392 GB/s + 168 GB/s or 224 GB/s + 366 GB/s, and a lot of combinations in between.

So yes... a total of 560 GB/s, but lots of bandwidth fluctuation on the parts. This puts a lot of constrains on the usage you can give to the 6 GB.

And this is why I was talking about access in alternated cicles. In that way you would lock bandwidths on 392 + 168, and know what to expect from each part.
 
You can keep saying it but that doesn't help it make sense. The "old way" of selecting clock speeds was simply estimating for the worst-case game, ie. God of War, in Cerny's example. So, all that is being said is that when the next God of War type power usage game shows up, it would struggle to run the CPU and GPU at 3GHz and 2GHz respectively based on his same comments. But rather than holding all games to that lower bound, they are allowing the clocks to ramp up in lower power usage scenarios (ie. all the other games that don't push the system as hard). The lower the activity in the processor, the faster the clock. It doesn't change the fact that at the higher processor activity level, you hit the same clock speed limits as if you were using the "old" method of setting them.

You are ignoring why they were struggling to keep 2GHz and 3GHz. Remember they stress test their systems in extreme environments with extreme/unrealistic workloads.
Those extra powerhungry workloads? Yeah, those are outlier workloads that disproportionately bog down the system.
In order to manage those outlier workloads, they test the system using them at peak theoretical capacity, and that was the scenario they were having trouble keeping the clocks above 2GHz and 3GHz.
And this was a problem because those workloads are seemingly not used extensively anyway. Likely no game is going to use those instructions at peak theoretical numbers, so why cater so strongly to that scenario?
IF the devs choose to use those workloads in the usual small numbers, it will not bog down performance all that much, just a few frequency points.
It is much preferable to do this than to lock the clocks at something way lower that ultimately only serves the purpose of dragging down performance of typical workloads.
 
Interestingly, this problem applies equally to the idea of putting mismatched nand capacities on the 12 controllers PS5, but they could have done exactly the same concept as MS, and put the OS partition on the upper half of the higher capacity chips, and by not accessing it much during gameplay it would have mitigated the possible drop in performance.

But would lead to unbalanced wear on the flash memory.
 
But would have unbalanced the wear on the flash.
Wear is proportional to capacity. Twice the flash size can do twice the write volume with wear levelling. So it's would be no different than having 4 or 6 additional chips for the OS.
 
The Navi operations increase throughput over a conventional FMA by acting like packed math and then allowing the results to be combined into an accumulator. Looking at the patent or how the tensor operations work for Nvidia, it looks like it would be a fraction of what the matrix ops would do. The lane format without a matrix unit would allow those dot operations to generate only the results along a diagonal of that big matrix.
The AMD scheme is more consistent with Vega, as the vector unit is 16-wide, and the new hardware may align with code referencing new instructions and registers for Arcturus. One other indication this is different is that the Navi dot instruction would take up a normal vector instruction slot since it happens in the same block. Arcturus and this matrix method would allow at least some normal vector traffic in parallel.


The scenario where the system is spending half a second in the slow pool requires something in the OS, an app, or a game resource put in the slow section needing 168 GB/s of bandwidth.
There is some impact because of the imbalance, but it scales by what percentage of the memory access mixture goes to the slow portion. If a game did that, it would likely be considered a mis-allocation. A background app would likely be inactive or prevented from having anything like that access rate, and the OS gets by with a minority share of the tens of GB/s in normal PCs without issue.
I can see the OS sporadically interacting with shared buffers for signalling purposes or copying data from secured memory to a place where the game can use it, but that's on the the order of things like networking or the less than 10 GB/s disk IO.


If the GDDR6 chips were all the same capacity, there would still be a "pool" for the OS and apps, since accesses for them wouldn't be going to the game. The individual controllers would see some percentage of accesses going to them that the game wouldn't be able to use. Let's say 1% goes the OS, or 5.6GB/s. The game experiences a bandwidth bottleneck if it needs something like 555 GB/s in that given second. If there's a set of code, sound data, or rarely accessed textures that don't get used in the current game scene unless the user hits a specific button or action, finally hitting that action while the game is going on blocks the other functions' accesses for some number of cycles.
With the non-symmetric layout, the OS or slow pool pushes some of that percentage onto the channels associated with the 2GB chips.
Going by the 1% scenario, the six controllers would need to find room for the 40% of the OS traffic that cannot be striped across the smaller chips, or 40% of 5.6 GB/s. The 336 GB/s pool would be burdened with an extra 2.24GB/s.

Unless something in the slow pool is demanding a significant fraction of the bandwidth, and I don't know what functionality other than the game's renderer that needs bandwidth on that order of magnitude, I can see why Microsoft saw it as a worthwhile trade-off.
If a game put one of its most heavily used buffers in the slow pool, I think the assumption is that the developers would find a way to move it out of there or scale it back so it'd fit elsewhere.

Why didn't sony see this coming? Would it cost that much more resources to devide the memory?

Many parts of the world that are hot with little to no air conditioning.

Like the part of the world where many have their devices, under the tv in a cabinet with other devices warming the whole thing. I dont have it like that personally, but i know many that do. One has their Pro in a cabinet together with a switch and a tv-box, i gets very hot in there, the Pro sounding like the jet it already does. Floor heating under the whole setup.... And then, even in sweden it can get close to 40 degrees celcius in the summer. Last summer we had some days where it was close to 40. Summer 2018 was hot the whole summer

Who has money for a PS4 but no money for a cheap AC unit?

AC air units are not expensive, but operating them day and night is. And yes you need to, you cant just turn it on to play. Maybe if you cool just one room. Let's hope AC units are not a requirement for any console ever, it has never been before so :p
 
Interesting. So I’m the case of series x if they’re saying there is higher bandwidth to 10 GB, is that 10 GB spread across all of the chips or would virtual addressing basically map it across the chips with the larger bus?
Total bandwidth is based on physical channels and their bit rate, and GDDR6 modules have 2 16-bit channels each. The 10 GB needs to be on all the chips to have the peak bandwidth amount.
Virtual addressing for x86 works on a 4KB granularity at a minimum, so there's lower-level details I'm not sure of about where the additional mapping is done. Caches past the L1 tend to be based on physical address, and they themselves might have striping functions in order to handle variable capacity like the L3s in the ring-bus processors. It might delay the final determination of the responsible controller to a when packets go out onto the fabric, which should have various routing policies or routing tables to destinations.
 
Status
Not open for further replies.
Back
Top