Playstation 5 [PS5] [Release November 12 2020]

I forgot, is it known if the Zen 2 floating point pipelines in Xbox Series X are full 256-bit or are they too cut down to 128-bit?


Overall, I think for any future PS6 home hardware, Cerny and team probably already realize that everything will need to go significantly "wider" next time. A lot more GPU Compute Units, or equivalent around 2025-2027, more bandwidth (north of 1TB/sec) and possibly even dividing PS5 silicon into 2 or 3 high-volume 3nm/2nm GAFFET/EUV chiplets (with only 1 GPU rasterizer chiplet assuming putting in 2 will still cause huge developer headaches as well as whatever AMD will have in 2025+) .

Assuming 10th gen consoles happen (~2027+) would inherently have, one way or another, better dedicated HW for RT that's more ambitious and capable than Nvidia Turing and Ampere. Not to mention, more dedicated, less "bolted on" as it seems, with both PC RDNA2 and PS5 RDNA2 (and Xbox RDNA 2 which is more like full PC RDNA 2.

I mean, Microsoft will have to make similar choices, although Xbox Series X is already somewhat "wider" a design than PS5, which is narrow but GPU clock is very high.
 
Last edited:
It depends on a lot of factors. Cost, thermals, production deadlines, form factor. So if pressure is there, I wouldnt be surprised if they make some compromises we wouldnt expect.

First this is not sure what they have done. It look likes this isnbot halving the FPU thorouput but after a second and better shot, they have no idea. Reading what Locuza did for the Xbox Series X and will do for the PS5. I suppose they are customized CPU and GPU. Different choice, tradeoff, I like the transparency of MS, less how Sony choose to do the thing but it looks like the design are good on the two sides. We will wait two years of games exclusives and multiplatform and some GDC presentation and maybe it will help have a better understanding of what Sony has done in 2022/2023.

EDIT:

The better PS5 Zen 2 shot.
 
Last edited:
This is definitely a downgrade

Depends how you see it, if one suspected full fat zen2 cpu chips then yes. Even worse if the expectation was Zen3 and full rdna2 features.

I forgot, is it known if the Zen 2 floating point pipelines in Xbox Series X are full 256-bit or are they too cut down to 128-bit?

Cerny himself said so. But that it does impact CPU performance quite a bit.

Overall, I think for any future PS6 home hardware, Cerny and team probably already realize that everything will need to go significantly "wider" next time.

Probably. This 36CU most likely had to do with BC. Theres no other reason i can thinkoff going with narrow.

As far as I can see, the FP reg file is cut in half and the FMA area is also reduced somewhat, so this might be indeed a 128-bit implementation of the original Zen 2 SIMD design.

Ok thanks.

maybe it will help have a better understanding of what Sony has done in 2022/2023.

Ouch.
 
As you need to do the processing for more and more objects, not using vectorized instructions start to slow you down dramatically. If you need to check whether the player is colliding with 1 of thousands of possible collision entities for game code logic to happen, 256-bit math allows you to stuff in many more objects into a single calculation. So if the position vector for each object is 16 bits, than you can calculation the collisions for 16 objects in 1 go. Provided the numbers you are working with are large, you're going to get an advantage using the AVX instructions over iterating the array normally.

This is my general understanding I've read about, but I don't know how often it's actually used.

Clarifies a lot. I think I can relay some of this to a DSP and get the general idea behind the FADD, so it would sound like a pretty necessary element to keep especially for more complex logic/physics/AI calculations (an area games could definitely benefit from more hardware support for outside of offloading to the GPU).

Supposing the FADD has been cut, would it be possible to run that intended logic in software on the GPU and is that a potential reason (outside of simply higher pixel fillrates and culling/rasterization rates) Sony could've went with higher GPU clocks (to have some spare cycles for asynchronous compute such as 256-bit math calculations)?

I forgot, is it known if the Zen 2 floating point pipelines in Xbox Series X are full 256-bit or are they too cut down to 128-bit?


Overall, I think for any future PS6 home hardware, Cerny and team probably already realize that everything will need to go significantly "wider" next time. A lot more GPU Compute Units, or equivalent around 2025-2027, more bandwidth (north of 1TB/sec) and possibly even dividing PS5 silicon into 2 or 3 high-volume 3nm/2nm GAFFET/EUV chiplets (with only 1 GPU rasterizer chiplet assuming putting in 2 will still cause huge developer headaches as well as whatever AMD will have in 2025+) .

Assuming 10th gen consoles happen (~2027+) would inherently have, one way or another, better dedicated HW for RT that's more ambitious and capable than Nvidia Turing and Ampere. Not to mention, more dedicated, less "bolted on" as it seems, with both PC RDNA2 and PS5 RDNA2 (and Xbox RDNA 2 which is more like full PC RDNA 2.

I mean, Microsoft will have to make similar choices, although Xbox Series X is already somewhat "wider" a design than PS5, which is narrow but GPU clock is very high.

Only thing I'm still in contention with for 10th-gen Sony console would be actual need to go wider. Personally I still think they will favor a narrower design and rely on a lot of hardware accelerators and maybe other changes (they can maybe increase the number of TMUs to each CU, the number of shader cores per CU, increase of ROPs from 64 to 128) to reach acceptable performance while keeping the chip small and thus keeping costs down on production because shifting to smaller and smaller nodes is bringing higher prices, not lower.

Of course some of that is my own wishing, I'm also hoping some tighter FPGA integration makes it into 10th-gen consoles from Sony & MS. But for Sony in particular I think they'll try and stick with narrower designs and offload other tasks to hardware accelerants while making the CUs themselves bigger (PS5's are 62% larger than PS4's for example).

Probably. This 36CU most likely had to do with BC. Theres no other reason i can thinkoff going with narrow.

There is one other reason, actually: costs. These nodes are getting smaller but costs for the real estate is going up. So a narrower design means you can save on costs. To offset the reduced silicon presence, Sony chose to increase the clocks. That's a tradeoff in and of itself with its own pluses and minuses, as we're seeing.

And I think costs are still going to be the reason they stay with a narrower design, though "narrow" could change over the years. 40 CUs used to be considered big, then 60, now it's 80. RDNA 3 rumors are for 120 CU dual-chiplet designs on the flagship, so maybe 60 CUs could end up being the new "narrow" by the time of 10th gen, who knows.
 
Last edited:
Of course some of that is my own wishing, I'm also hoping some tighter FPGA integration makes it into 10th-gen consoles from Sony & MS. But for Sony in particular I think they'll try and stick with narrower designs and offload other tasks to hardware accelerants while making the CUs themselves bigger (PS5's are 62% larger than PS4's for example).

Just curious what sorts of things did you envision an integrated FPGA being used for? I cant see game developers designing FPGAs on a per game basis for rendering acceleration.

I know people say probably say this every gen, but how much higher does console tflops need to go? Realistically I dont think anyone will be using 8K TVs ever, and as we go forward reconstruction/upscaling tech like DLSS will only improve. So maybe the next gen of consoles is something like 20 tflops but with much better ray tracing and the like. I think the 'add ons' like ray tracing, better physics, more interact-able environments will be the sorts of things that define future generations, not necessarily an increase in pixel count
 
Mark Cerny at 34:33

Mark Cerny said:
Playstation 5 is especially challenging because the CPU supports 256 bit native instructions that consume a lot of power. These are great here and there but presumably only minimally used or are they if we plan for major 256 bit instruction usage we need to set the CPU clock substantially lower or noticeably increase the size of the power supply and fan.


Then he explains they had to invent a new variable clocks system.
 
Just curious what sorts of things did you envision an integrated FPGA being used for? I cant see game developers designing FPGAs on a per game basis for rendering acceleration.

I know people say probably say this every gen, but how much higher does console tflops need to go? Realistically I dont think anyone will be using 8K TVs ever, and as we go forward reconstruction/upscaling tech like DLSS will only improve. So maybe the next gen of consoles is something like 20 tflops but with much better ray tracing and the like. I think the 'add ons' like ray tracing, better physics, more interact-able environments will be the sorts of things that define future generations, not necessarily an increase in pixel count

Well that's the thing; the game devs wouldn't be programming the FPGA components (I want to use components here because I'm thinking more along the lines of the logic cells, LUT/BRAM etc. blocks and frontend/backend to handle the programming and targeted output; not so much a literal FPGA block just grafted onto the GPU), but rather presets of configurations that could be adjusted in short cycle time and loaded from some type of small, fast and preferably updatable block of storage on the GPU, like maybe perhaps some type of MRAM cache.

I think with some FPGA logic integration and hardware acceleration, you basically can leverage more with that versus just piling on the TFLOPs. Personally I don't think 10th-gen systems are gonna go over 35-40 TFs; with what type of memory'll likely be around and in decent quantities (I think they'll definitely need to go HBM-based by then, at least one of them will) you wouldn't want to push TF too much higher than that if you still want decent bandwidth-per-TF numbers.

10th-gen systems will need more than just power increases or even just faster storage to justify them, though, IMHO, I think we're on the same page with that. So I'm hoping VR & AR are standardized with that generation, instead of treated as peripheral bonuses in the ecosystem like they are currently (at least on PlayStation; VR/AR isn't even supported by Xbox at this time).

Mark Cerny at 34:33




Then he explains they had to invent a new variable clocks system.

I mean, this can be (and is) true but it doesn't refute the x-ray scan if that's why it's being brought up. I think iroboto, function or tunafish mentioned about there being "standard" hardware in the CPU that can handle FADD instructions with a 5-cycle latency and specialized units in the CPU also able to handle FADD with 2-cycle latency.

The point of interest seems to be that PS5's removed the specialized units that offer the lower latency, but it doesn't mean they lack native hardware support for 256-bit AVX instructions.

EDIT: A bit an aside but I think it's safe to say Sony's work with the variable frequency system is the "result of fruitful collaboration" that's now being seen in the RDNA 2 GPUs with those and Zen 3 CPUs being able to shift power budgets and SAM.
 
Didn't one of the AMD cpu's use two 128bit units to do a 256 bit instruction back on bulldozer ?

Could this be how sony is supporting it ?
 
Am I the only one noticing the "butterfly" design of PS5's GPU (like the PS4 Pro) compared with the more monolithic style of the SX ?
 
As far as I can see, the FP reg file is cut in half and the FMA area is also reduced somewhat, so this might be indeed a 128-bit implementation of the original Zen 2 SIMD design.
So what does that mean in terms of performance?
 
As far as I can see, the FP reg file is cut in half and the FMA area is also reduced somewhat, so this might be indeed a 128-bit implementation of the original Zen 2 SIMD design.
The area savings for shaving off that portion of the FPU seem minor in the grand scheme of things. Would the bright areas on either side of the CPU section be test silicon/pads, or could those areas be blank? There are some sort of visible striations, but I don't recognize the patterns from other AMD silicon.
Would Sony have been that desperate for die area to pay for a rearchitecting of the FPU and new layout, or maybe this is something AMD had on offer, like a scrapped alternate version of the mobile core?
Another consideration is thermal density, since Microsoft cites the 256-bit FPU as being the thermal limiter of the Series X.
https://www.anandtech.com/show/16489/xbox-series-x-soc-power-thermal-and-yield-tradeoffs
"For Scarlett, it is actually the CPU that becomes the limiting factor. Using AMD’s high-performance x86 Zen 2 cores, rather than the low power Jaguar cores from the previous generation, combined with how gaming workloads have evolved in the 7 years since, means that when a gaming workload starts to ramp up, the dual 256-bit floating point units on the CPU is where the highest thermal density point happens."

Granted, the PS5 GPU probably ramps thermal density significantly more, and then there's the liquid metal TIM.

Depends how you see it, if one suspected full fat zen2 cpu chips then yes. Even worse if the expectation was Zen3 and full rdna2 features.
Individuals expecting Zen3 were setting themselves up for disappointment. I don't consider that a fair standard to measure the downgrade.

Probably. This 36CU most likely had to do with BC. Theres no other reason i can thinkoff going with narrow.
From the die shot, the GPU really dominates the die area already. The ratio of GPU to overall die area may need to checked with the PS4 and PS4 Pro. This might be somewhere in the same range as the original PS4, while the GPU area for the PS4 Pro was even more lopsided.
36 would make sense as a minimum that they couldn't go below.


Mark Cerny at 34:33




Then he explains they had to invent a new variable clocks system.
The clocking method isn't particularly new, as far as AMD is concerned. The PS5 implements a less aggressive version of AMD's DVFS.
The claim that the CPU supports native 256-bt instructions leads to questions about what was done the FPU.
The register file is split like the original 256-bit Zen 2 FPU, but the area and layout don't match very well. If the FPU were treated like two 64-bit halves, that might explain why the alleged register file section is also narrower.
The Bulldozer line did have a series of changes to the FPU, first by dropping one FP pipe, and then the Steamroller to Excavator transition included high-density libraries that saved quite a bit of area at the expense of top-line clocks. The area savings were notable for the FPU, but I don't think they were limited to just the FP portion and the register file didn't benefit that much.
The PS5's CPU cores look pretty standard outside of the FPU.



Am I the only one noticing the "butterfly" design of PS5's GPU (like the PS4 Pro) compared with the more monolithic style of the SX ?
RDNA GPUs have gone with either layout, depending on unit counts and possibly considerations like making room for other silicon.
Using a two-sided arrangement like the PS4 Pro means that particular way of growing the GPU in a mid-gen refresh is ruled out.
 
The area savings for shaving off that portion of the FPU seem minor in the grand scheme of things. Would the bright areas on either side of the CPU section be test silicon/pads, or could those areas be blank? There are some sort of visible striations, but I don't recognize the patterns from other AMD silicon.
Would Sony have been that desperate for die area to pay for a rearchitecting of the FPU and new layout, or maybe this is something AMD had on offer, like a scrapped alternate version of the mobile core?
Another consideration is thermal density, since Microsoft cites the 256-bit FPU as being the thermal limiter of the Series X.
https://www.anandtech.com/show/16489/xbox-series-x-soc-power-thermal-and-yield-tradeoffs
"For Scarlett, it is actually the CPU that becomes the limiting factor. Using AMD’s high-performance x86 Zen 2 cores, rather than the low power Jaguar cores from the previous generation, combined with how gaming workloads have evolved in the 7 years since, means that when a gaming workload starts to ramp up, the dual 256-bit floating point units on the CPU is where the highest thermal density point happens."

Granted, the PS5 GPU probably ramps thermal density significantly more, and then there's the liquid metal TIM.


Individuals expecting Zen3 were setting themselves up for disappointment. I don't consider that a fair standard to measure the downgrade.


From the die shot, the GPU really dominates the die area already. The ratio of GPU to overall die area may need to checked with the PS4 and PS4 Pro. This might be somewhere in the same range as the original PS4, while the GPU area for the PS4 Pro was even more lopsided.
36 would make sense as a minimum that they couldn't go below.



The clocking method isn't particularly new, as far as AMD is concerned. The PS5 implements a less aggressive version of AMD's DVFS.
The claim that the CPU supports native 256-bt instructions leads to questions about what was done the FPU.
The register file is split like the original 256-bit Zen 2 FPU, but the area and layout don't match very well. If the FPU were treated like two 64-bit halves, that might explain why the alleged register file section is also narrower.
The Bulldozer line did have a series of changes to the FPU, first by dropping one FP pipe, and then the Steamroller to Excavator transition included high-density libraries that saved quite a bit of area at the expense of top-line clocks. The area savings were notable for the FPU, but I don't think they were limited to just the FP portion and the register file didn't benefit that much.
The PS5's CPU cores look pretty standard outside of the FPU.




RDNA GPUs have gone with either layout, depending on unit counts and possibly considerations like making room for other silicon.
Using a two-sided arrangement like the PS4 Pro means that particular way of growing the GPU in a mid-gen refresh is ruled out.

My post was more in the general context of BC with PS4/PRO.
 
You already knew an AMD system with dynamic clocks based on total instructions budget?

The earliest version of AMD's method for DVFS I can recall was patented around the Bulldozer introduction, and various iterations go through later CPUs and GPUs.

Compared to Bulldozer's contemporary form Intel, Sandy Bridge, AMD didn't go with turbo and frequency control using thermal sensor input as the primary input. The space trade-offs and longer latency for a temperature sensor next to a logic block weren't acceptable to AMD.

Sandy Bridge instead implemented smaller temperature sensors that were designed to only measure within a more limited temperature range near the top end of the operating specs, which meant they could be smaller and more responsive.
AMD's initial claims were the use of activity counters in the hardware, whose results would depend on the instruction mix going through the core. The counters would be paired with a table of values for the approximate thermal impact of an event in that region of the chip at the present conditions of the silicon.
This allowed for more rapid detection of thermal spikes at lower die area cost, although it would depend on other factors like the accuracy of the silicon characterization to determine how close it was to calculating temperatures.
The characterization is necessarily conservative, but a conservative calculation that can accumulate dynamic data at the cycle level can potentially approximate better than thermal diodes that might not register change for multiple milliseconds.

Bugs or weak silicon characterization have dogged AMD at times. For example, Jaguar should have had turbo (one SKU had a very limited upclock when not on battery power), but its hop to TSMC for the prior gen may not have been part of the original plan. It wasn't until the Globalfoundries variants came out that the turbo AMD had announced for Jaguar was actually offered. AMD chips have a history of getting weaker clocking or less effective turbo in the first generation of a chip, with the later refresh usually having more effective clocks despite being generally the same silicon. (Jaguar, Bulldozer APUs, Ryzen 2, 7970, multiple Hawaii generations, etc.).

Later versions of the DVFS would include more dynamic monitoring of current and voltage behavior, with blocks of dummy ALUs and other units running representative operations to better approximate what the actual logic is doing.

If you take that foundation and make some adjustments for consistency over the whole family, it acts like what Sony claims. The lookup table for characterizing a chip's silicon doesn't need to be unique to a chip. It would be best for efficiency and top clocks if the values did match the chip, but as long as they aren't too aggressive the chip will function fine. As Sony indicated, there's an ideal SOC model that is given to all PS5 chips, which means their DVFS algorithms produce the same output based on the shared set of values. What's lost is the upper performance range.
For the CPU, it's a matter of dropping the boost clocks of an architecture able to go +4.5 GHz. For the GPU, it's dropping the upper turbo clocks first shown with Big Navi and going significantly narrower. There's some indication that many RDNA2 chips could go faster, but the overclocking settings max out before the chips do.
 
The earliest version of AMD's method for DVFS I can recall was patented around the Bulldozer introduction, and various iterations go through later CPUs and GPUs.

Compared to Bulldozer's contemporary form Intel, Sandy Bridge, AMD didn't go with turbo and frequency control using thermal sensor input as the primary input. The space trade-offs and longer latency for a temperature sensor next to a logic block weren't acceptable to AMD.

Sandy Bridge instead implemented smaller temperature sensors that were designed to only measure within a more limited temperature range near the top end of the operating specs, which meant they could be smaller and more responsive.
AMD's initial claims were the use of activity counters in the hardware, whose results would depend on the instruction mix going through the core. The counters would be paired with a table of values for the approximate thermal impact of an event in that region of the chip at the present conditions of the silicon.
This allowed for more rapid detection of thermal spikes at lower die area cost, although it would depend on other factors like the accuracy of the silicon characterization to determine how close it was to calculating temperatures.
The characterization is necessarily conservative, but a conservative calculation that can accumulate dynamic data at the cycle level can potentially approximate better than thermal diodes that might not register change for multiple milliseconds.

Bugs or weak silicon characterization have dogged AMD at times. For example, Jaguar should have had turbo (one SKU had a very limited upclock when not on battery power), but its hop to TSMC for the prior gen may not have been part of the original plan. It wasn't until the Globalfoundries variants came out that the turbo AMD had announced for Jaguar was actually offered. AMD chips have a history of getting weaker clocking or less effective turbo in the first generation of a chip, with the later refresh usually having more effective clocks despite being generally the same silicon. (Jaguar, Bulldozer APUs, Ryzen 2, 7970, multiple Hawaii generations, etc.).

Later versions of the DVFS would include more dynamic monitoring of current and voltage behavior, with blocks of dummy ALUs and other units running representative operations to better approximate what the actual logic is doing.

If you take that foundation and make some adjustments for consistency over the whole family, it acts like what Sony claims. The lookup table for characterizing a chip's silicon doesn't need to be unique to a chip. It would be best for efficiency and top clocks if the values did match the chip, but as long as they aren't too aggressive the chip will function fine. As Sony indicated, there's an ideal SOC model that is given to all PS5 chips, which means their DVFS algorithms produce the same output based on the shared set of values. What's lost is the upper performance range.
For the CPU, it's a matter of dropping the boost clocks of an architecture able to go +4.5 GHz. For the GPU, it's dropping the upper turbo clocks first shown with Big Navi and going significantly narrower. There's some indication that many RDNA2 chips could go faster, but the overclocking settings max out before the chips do.
Everything is an evolution of something already existing. Cars are nothing new as they evolved from carts pulled by horses.
 
Everything is an evolution of something already existing. Cars are nothing new as they evolved from carts pulled by horses.
The specific link is the use of event counters paired with silicon characterization data to calculate power and temperature, rather than directly measuring them.
To get what the PS5 does, a universal table that all PS5 APUs follow would yield consistent behavior, as long as other variation-inducing measures like the upper boost clocks are removed.

The PS5's method can be done by AMD's existing DVFS, by dropping clocks, not using the upper clock range, and not using per-chip characteristic data--doing less than what AMD's standard solution can do.
 
Back
Top