Current Generation Hardware Speculation with a Technical Spin [post launch 2021] [XBSX, PS5]

Status
Not open for further replies.
The 10 GB is GPU optimized, but is not exclusive to the GPU. The 10 GB would be the range of addresses that can be equally strided over all the channels, with the remaining space being an extra 1GB on only some of the chips, which Microsoft seems to have decided to handle a little differently due to the channel disparity.

Yeah, I know the 10 GB over the whole 320 bit bus is "GPU optimal" and not solely for the GPU. I should have made that clear, my fault entirely. Likewise the other 6GB can in theory be accessed by the GPU (and perhaps is, for some OS operations).

Wikichip has a discussion about the variations on the TSMC 7nm process, in terms of standard cell libraries.
The number of fins and number of opportunities for routing metal through a cell can be adjusted to emphasize area or performance.
https://en.wikichip.org/wiki/7_nm_lithography_process
However, from elsewhere, it was indicated Zen 2 already went for high density:
https://fuse.wikichip.org/news/3320/7nm-boosted-zen-2-capabilities-but-doubled-the-challenges/
There's also a caveat that AMD doesn't need to always default to standard cells, but a whole FPU is a bigger exception than certain things like custom registers and a few key structures.

One thing that did come up in the review is that there's an apparent drop in FPU ports:

This is one possibility that I mentioned when the die shot came out:
https://forum.beyond3d.com/posts/2193153/
My question as to why they would go through this much trouble for what appears to be limited gains in area remains.

Whether it's truly a full halving of ports isn't clear to me.

A few operations like logical ones are tied to ports, and those weren't halved. Perhaps there was a reduction in register file/bypass ports and reduction in functionality while leaving a few basic functions of stubs of the original 4?

edit: misread the heading and thought it was only FPU testing, the logical ops are likely integer domain

Another question is what else changed with division, since that's only on one port in Zen2, so a port diet alone wouldn't account for the drop there.

Thanks, I'll try my best to understand all of this tomorrow when I'm a bit more with it. I really appreciate your efforts to share what you understand.

I've only skimmed, and some items are at a higher level than the implementation, but it does mention Zen2 using a model with over a thousand monitors.

My thought when (trying to) read it was that I'd underestimated the complexity and also flexibility of modern chip power management features. It did seem that within the existing Zen power management platform there was already a huge opportunity to implement the kind of power management that Cerny had talked about.
 
While its underwhelming compared to even mid range Zen2 CPU's, its still a substantional gap over the tablet cpu the 2013 console had. ideally would be DDR4/5 ram for CPU tasks along GDDR for vram but that wouldnt be very cost effective for a 400/500 dollar box i guess.
Id be more thinking of the memory BW since its shared between everything (CPU, GPU, audio chip etc).

perhaps with the higher density chips for ddr 5 and faster speeds we could see it make a come back for a ps6 or xbox whatever. 8 or 16gigs of ddr 5 and then on its own bus gddr of whatever speed is avalible , perhaps 8-16 gigs on that side or more.
 
One thing that did come up in the review is that there's an apparent drop in FPU ports:

This is one possibility that I mentioned when the die shot came out:
https://forum.beyond3d.com/posts/2193153/
My question as to why they would go through this much trouble for what appears to be limited gains in area remains.

Yeah, I think this is definitely a brag moment for you. :D

Just wanted to come back to this to suggest it could again simply be about power. 50% / 33 % less work being done should surely mean a somewhat corresponding drop in power used and heat generated in those areas. It might make Sony's strategy of boosting less likely to see huge drops due to AVX operation.

There may even be hints in the die shots that this cut happened during development of the PS5 APU. I think that another one of Nemez's tweets perhaps shows this:


"The full featured Renoir CCXs would only be margin-of-error larger, they would probably fit without major issues or redesigns."

I think, quite possibly, that PS5 started out with full fat FPUs but moved to these skinnier units later, and the footprint is still there. PS5 was probably deep into development and tons of layout work had already been done at this time.

Lets say Sony were at the point of trying to balance performance, area and power with a given set of technologies. The cuts are probably nothing to do with area, and they're actually costing performance (in some areas), so that'd mean the gain was in the peak power they could consume. And that could be benefit maintaining boost locks across the rest of the system.

TL : DR - Hypothesis: Sony started out with full fat 256-bit units, reduced them well into development to suit their power / frequency strategy, and the footprint of the original units remains.

Maybe this is what MS were having a pop at when they talked about having a "server class" Zen 2 implementation, and people were like "u wot?"
 
TL : DR - Hypothesis: Sony started out with full fat 256-bit units, reduced them well into development to suit their power / frequency strategy, and the footprint of the original units remains.

Maybe this is what MS were having a pop at when they talked about having a "server class" Zen 2 implementation, and people were like "u wot?"
Sounds plausible.
My guess would be, that they cut out some of the 256-bit units, because they are not often used but draw much power when they are used. This way the instructions still work but need more cycles, but at the same time they don't need more power than their power-envelop for the CPU allows.
Not much lost (because those instructions are not really often used in games) but therefor they have a more stable power envelop to clock higher or let the GPU have a bit more power.

The thing I find really odd is, that decompression (in BC games) seems to work even a bit slower than on PS4 Pro (with an SSD). The Xbox does not seem to have that problem. But maybe the cuts hit especially in those cases although for the general performance it is irrelevant.
 
Yeah, I think this is definitely a brag moment for you. :D

Just wanted to come back to this to suggest it could again simply be about power. 50% / 33 % less work being done should surely mean a somewhat corresponding drop in power used and heat generated in those areas. It might make Sony's strategy of boosting less likely to see huge drops due to AVX operation.
...
I think this is exactly this. Because with 4 ports of FPU too much power used in a short time would maybe create a drop of frequency (that would impact the whole CPU). So I think the idea is to force developers at doing the same job but slower using 2 ports ideally without dropping the frequency. As 3dilettante wrote the very robust cooling should be enough to take care of heat density.
 
Sounds plausible.
My guess would be, that they cut out some of the 256-bit units, because they are not often used but draw much power when they are used. This way the instructions still work but need more cycles, but at the same time they don't need more power than their power-envelop for the CPU allows.
Not much lost (because those instructions are not really often used in games) but therefor they have a more stable power envelop to clock higher or let the GPU have a bit more power.

The thing I find really odd is, that decompression (in BC games) seems to work even a bit slower than on PS4 Pro (with an SSD). The Xbox does not seem to have that problem. But maybe the cuts hit especially in those cases although for the general performance it is irrelevant.

Well whatever they're doing I agree it's got to be because of power. Zen 2 is one 256-bit unit per core, so I don't think they could have cut out any of the FPUs as such, but limiting the ability in some other way would physically guarantee lower power demands in some other way. I like the port reduction idea because I don't think it would cause a complete redesign of the entire unit, it would be more like selectively removing duplicated elements. Plus you'd still be left with additional room for any small layout changes (I guess).

I hadn't picked up on some PS4 BC games having slower decompression on PS5. That's curious, but interesting. Could there be some kind of hardware decompression unit in PS4 that's been removed or bypassed in PS5 due to it being superseded? Some kind of single threaded CPU fallback on PS5?

I think this is exactly this. Because with 4 ports of FPU too much power used in a short time would maybe create a drop of frequency (that would impact the whole CPU). So I think the idea is to force developers at doing the same job but slower using 2 ports ideally without dropping the frequency. As 3dilettante wrote the very robust cooling should be enough to take care of heat density.

Yeah, and I don't think it'd necessarily be just to reduce / prevent CPU clock drops. Power not being used by the CPU is directed to the GPU to sustain high boost rates. Guaranteeing that a chunk of power could no longer be taken by the CPU under any circumstances would mean you can reliably deliver higher lowest and average clocks to the GPU, all while staying in your existing power and cooling capability that you've been planning on.

With the move to a potentially less power demanding FPU, perhaps the CPU doesn't have problems maintaining 3ghz under certain 256-bit loads any more. If that was under an old system and before the current FPU, PS5 might now be a in better position to keep CPU clocks high or at max whatever you throw at it.
 
Correct me if I'm wrong but does that mean if developers start making heavy use of AVX instructions in game engines then PS5 will potentially perform worse than XBSS/XBSX/PC?
 
Last edited:
Correct me if I'm wrong but does that mean if developers start making heavy use of AVX instructions in game engines then PS5 will potentially perform worse than XBSS/XBSX/PC?
If they started pushing FP256 instructions a lot, then yes it would, because the CPU cores would start throttling down heavily.
Realistically they won't, because Sony knows how often these instructions come up and that's why they probably used density-optimized transistors on those blocks.
 
Correct me if I'm wrong but does that mean if developers start making heavy use of AVX instructions in game engines then PS5 will potentially perform worse than XBSS/XBSX/PC?
Well, yes it would, but AVX is really only an edge case in games.
AVX instructions do also consume much more power, so this would be another thing why it would hurt PS5 performance (more power for the CPU less for the GPU). But heavy usage of AVX is really nothing for game so far.
 
Games that leverage entity component system with burst compiler like Unity, support various types of instructions up to AVX512.

Even though the CPU will be impacted by it; it’s going to be one hell of a game lol. If a game is pushing ECS to high loads, it will be a sight regardless if GPU performance takes a hit. Imagine a lot of active stuff moving on the screen at once. Graphics will need to downgrade anyway.
 
Well, yes it would, but AVX is really only an edge case in games.
AVX instructions do also consume much more power, so this would be another thing why it would hurt PS5 performance (more power for the CPU less for the GPU). But heavy usage of AVX is really nothing for game so far.

So when games do use AVX extensively, what are the main functions of it? I think BFV5 MP somehow used it to some extend which made OCing and temperatures go somewhat more unstable/higher. DICE never officially stated BFV uses AVX instructions, though CP2077 certainly does make use of it since there was a patch to fix issues regarding AVX, altering code so older CPU's lacking AVX could run the game. No idea what concessions where made though..

https://www.dsogaming.com/mods/cyberpunk-2077-patch-1-3-avx-mod-fixes-the-game-on-older-cpus/

Found this
https://www.prowesscorp.com/what-is-intel-avx-512-and-why-does-it-matter/

''Intel AVX-512 can accelerate performance for workloads and use cases such as scientific simulations, financial analytics, artificial intelligence (AI)/deep learning, 3D modeling and analysis, image and audio/video processing, cryptography, and data compression.''

It seems that while AVX(512) hasnt been used in games all that much but it sure can assist in certain tasks that seem applicable in future modern games.
 
So when games do use AVX extensively, what are the main functions of it? I think BFV5 MP somehow used it to some extend which made OCing and temperatures go somewhat more unstable/higher. DICE never officially stated BFV uses AVX instructions, though CP2077 certainly does make use of it since there was a patch to fix issues regarding AVX, altering code so older CPU's lacking AVX could run the game. No idea what concessions where made though..

https://www.dsogaming.com/mods/cyberpunk-2077-patch-1-3-avx-mod-fixes-the-game-on-older-cpus/

Found this
https://www.prowesscorp.com/what-is-intel-avx-512-and-why-does-it-matter/

''Intel AVX-512 can accelerate performance for workloads and use cases such as scientific simulations, financial analytics, artificial intelligence (AI)/deep learning, 3D modeling and analysis, image and audio/video processing, cryptography, and data compression.''

It seems that while AVX(512) hasnt been used in games all that much but it sure can assist in certain tasks that seem applicable in future modern games.
AVX instructions are designed to do a lot of things in parallel. And the smaller the size, the more you can cram into the SMID unit.
Typically for games, it would be to access say, do a collision check on a lot of objects. doors, NPCs moving etc.
 
Just wanted to come back to this to suggest it could again simply be about power. 50% / 33 % less work being done should surely mean a somewhat corresponding drop in power used and heat generated in those areas. It might make Sony's strategy of boosting less likely to see huge drops due to AVX operation.
Going by the instruction profiling, it's not just AVX. The FPU is half as effective at 128-bit and 256-bit code, hence the same performance drop in SSE operations.
Vector loads tend to stress AMD's boost speeds the most on the desktop, so power reduction would seem to be the motivator. However, whether this needed such a significant re-plumbing of the FPU points to a very significant constraint, like the GPU leaving an unusually limited amount of power for the CPU section.
Microsoft didn't resort to this, and promises consistent clocks with the apparently standard Zen2 FPUs even with higher clock speeds.
If there's ever a salvage SKU for that, perhaps we can get similar profiling to see if it's really that consistent or other less drastic methods were used to limit power, like instruction issue throttling or duty-cycling of the hardware.

Why something like those measures wouldn't be good enough versus a thinned custom FPU is a point of curiosity for me.
Perhaps AMD's method isn't consistent enough for a fully-featured vector FPU for what Sony wanted for its model SOC, or that power ceiling is notably constrained even against another console APU.



There may even be hints in the die shots that this cut happened during development of the PS5 APU. I think that another one of Nemez's tweets perhaps shows this:


"The full featured Renoir CCXs would only be margin-of-error larger, they would probably fit without major issues or redesigns."

I think, quite possibly, that PS5 started out with full fat FPUs but moved to these skinnier units later, and the footprint is still there. PS5 was probably deep into development and tons of layout work had already been done at this time.
Maybe that's the case, since there may have been at least one notable revision in the PS5 validation hardware leak, with no clear indications as to what was changed.
Another is that Sony may have only paid for a revamping of the FPU, and if AMD kept the rest of the core and CCX with the same layout, there's going to be spare space.

Lets say Sony were at the point of trying to balance performance, area and power with a given set of technologies. The cuts are probably nothing to do with area, and they're actually costing performance (in some areas), so that'd mean the gain was in the peak power they could consume. And that could be benefit maintaining boost locks across the rest of the system.
I'm holding out for more instruction analysis at some point. The cuts are pretty significant even outside the 256-bit realm Cerny mentioned.

Sounds plausible.
My guess would be, that they cut out some of the 256-bit units, because they are not often used but draw much power when they are used.
The 50% loss in SSE points to removing whole ports and the ALUs on them. However, doing this would require rebalancing the units on the remaining ports, as I don't think you can cut one or two ports from the Zen2 FPU without needing to put some functionality on other ports that would be lost entirely, or would lose more than 50%.
The vector division benchmarking so much slower is a sign of potentially other hardware changes in the unit, since AMD's FPUs only have one port for that.

I think this is exactly this. Because with 4 ports of FPU too much power used in a short time would maybe create a drop of frequency (that would impact the whole CPU). So I think the idea is to force developers at doing the same job but slower using 2 ports ideally without dropping the frequency. As 3dilettante wrote the very robust cooling should be enough to take care of heat density.
Which leaves me to wonder how much more generous the Series X power budget is for its Zen2 FPUs, or if they did something else to constrain consumption. They're promising constant and higher clocks without a liquid metal TIM.

Well whatever they're doing I agree it's got to be because of power. Zen 2 is one 256-bit unit per core, so I don't think they could have cut out any of the FPUs as such, but limiting the ability in some other way would physically guarantee lower power demands in some other way. I like the port reduction idea because I don't think it would cause a complete redesign of the entire unit, it would be more like selectively removing duplicated elements. Plus you'd still be left with additional room for any small layout changes (I guess).
I think Zen2 has more than one 256-bit unit. Depending on the instruction mix, it could go to 4 256-bit operations per clock. A 50% drop from that is still 2 256-bit operations per clock. The 50% drop in SSE points to losing whole units, and probably needing a re-balance of what's left.

Could there be some kind of hardware decompression unit in PS4 that's been removed or bypassed in PS5 due to it being superseded? Some kind of single threaded CPU fallback on PS5?
The PS5 has a superset of the PS4's compression support. Perhaps a conservative emulation of the low-level functionality or APIs is going through extra steps, or the backwards compatibility leads to a thicker container or worse data layout than native?

Correct me if I'm wrong but does that mean if developers start making heavy use of AVX instructions in game engines then PS5 will potentially perform worse than XBSS/XBSX/PC?
The raw numbers for non-AVX are substantially worse than similar Zen2 CPUs, not going into other things like higher memory latency and smaller L3 cache. Zen 3 is another class entirely in terms of FP performance.
There are some indications of CPU-limited scenarios where there is sometimes a modest shortfall versus the Series X, but it's not something that shows up as consistently as the FPU numbers would indicate.
There are other bottlenecks that both consoles would have, but we may need to keep an eye out for later games that could push AVX or non-AVX vector throughput in a way that's more obvious than early titles.

''Intel AVX-512 can accelerate performance for workloads and use cases such as scientific simulations, financial analytics, artificial intelligence (AI)/deep learning, 3D modeling and analysis, image and audio/video processing, cryptography, and data compression.''

It seems that while AVX(512) hasnt been used in games all that much but it sure can assist in certain tasks that seem applicable in future modern games.
AVX 512 is unlikely to find much use in games because AMD flat-out doesn't support it and Intel does not consistently implement it in consumer hardware (or even its server hardware for that matter).

If they started pushing FP256 instructions a lot, then yes it would, because the CPU cores would start throttling down heavily.
Realistically they won't, because Sony knows how often these instructions come up and that's why they probably used density-optimized transistors on those blocks.
I'm not sure about the density-optimized transistor claim, or rather I'm not sure if there was an additional tier of high-density transistor beyond the HD process AMD utilized for Zen2 already.
The math shortfall in 256 and 128 bits points to wholesale removal of hardware, which saves in ALU area, wiring for fewer ports, and smaller register cells because they don't need as many bit lines due to the cut in ports.
 
Slightly disappointing that they gimped the CPU like that. I'm guessing this stems from the decision to clock those CU's as high as possible and use fewer compared to XSX as a cost saving exercise.
 
Slightly disappointing that they gimped the CPU like that. I'm guessing this stems from the decision to clock those CU's as high as possible and use fewer compared to XSX as a cost saving exercise.
But why? Do you see any impact on any games performance? MS designed their machine with a dual purpose in mind: gaming and cloud services (focus on compute and CPU FPU but supposedly less CU efficiency). And Sony designed their box as a purely gaming device and a focus on 120hz gaming and for now they have succeeded at that.
 
Slightly disappointing that they gimped the CPU like that. I'm guessing this stems from the decision to clock those CU's as high as possible and use fewer compared to XSX as a cost saving exercise.

You wont notice much of that for this cross-gen period anyway, as games do not use much of the more advanced features that new generations bring (be it AVX, ray tracing, mesh shading etc).
 
Going by the instruction profiling, it's not just AVX. The FPU is half as effective at 128-bit and 256-bit code, hence the same performance drop in SSE operations.
Vector loads tend to stress AMD's boost speeds the most on the desktop, so power reduction would seem to be the motivator. However, whether this needed such a significant re-plumbing of the FPU points to a very significant constraint, like the GPU leaving an unusually limited amount of power for the CPU section.

Thanks for pointing this out. I'd got swept up in the AVX thing, but yeah that does seem pretty important. SSE takes a hammering too, and as far as I'm aware that's widely used in games engines.

Microsoft didn't resort to this, and promises consistent clocks with the apparently standard Zen2 FPUs even with higher clock speeds.
If there's ever a salvage SKU for that, perhaps we can get similar profiling to see if it's really that consistent or other less drastic methods were used to limit power, like instruction issue throttling or duty-cycling of the hardware.

I've been trying to have a look for clues about this. As far as I can tell from what's out there, it's just regular Zen 2 ("server class" as MS puzzlingly said). At Hotchips MS simply said of the CPU:

"2x SIMD FP/ pipes/core: 2 MUL and 2 ADD AVX256 per clock -> 32x SPFP ops/clk"

Looking on that wikichip place, it says of Zen 2:

"This improvement doubles the peak throughput of AVX-256 instructions to four per cycle, or in other words, up to 32 FLOPs/cycle in single precision or up to 16 FLOPs/cycle in double precision."

Which would appear to be the same, unless I'm missing something. At Hotchips MS also reckoned "AVX256 gives 972 GFLOP over CPU" (quoting Anand's live notes on the presentation).

972 / 8 cores / 3.8 = 31.97 FLOPs / cycle. Or basically the 32 FLOPs / cycle.

I think they'd have to be engaging in shenanigans if this wasn't basically true (in as much as any peak figures are) over a period of time.

Why something like those measures wouldn't be good enough versus a thinned custom FPU is a point of curiosity for me.
Perhaps AMD's method isn't consistent enough for a fully-featured vector FPU for what Sony wanted for its model SOC, or that power ceiling is notably constrained even against another console APU.

I've been thinking about the power sharing between GPU and CPU - iirc the balanced can be adjusted every 2 ms (can't find source now). Perhaps Sony found that a relatively small use of vector instructions could cause a relatively (i.e. up to 2 ms) window where the GPU was spending periods getting less power than was strictly necessary?

I'm holding out for more instruction analysis at some point. The cuts are pretty significant even outside the 256-bit realm Cerny mentioned.

This whole thing is getting even more interesting thing now, as the possible implications are quite widespread. I would have thought most games are using at least 128-bit vector operations to accelerate their physics engines.

I think Zen2 has more than one 256-bit unit. Depending on the instruction mix, it could go to 4 256-bit operations per clock. A 50% drop from that is still 2 256-bit operations per clock. The 50% drop in SSE points to losing whole units, and probably needing a re-balance of what's left.

Thanks for the correction, I should have checked before posting.
 
Status
Not open for further replies.
Back
Top