Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

mr magoo · Nov 30, 2020

BillSpencer said:
I think the differences in architectures are being looked at from an incorrect perspective. I don't believe it is correct to say Series X has more compute units per shader array, it should be:

Series X has 33% less shader arrays per compute unit compared to PS5

To make it more complete: Series X has 33% less shader arrays per compute unit compared to PS5, and those shader arrays are operating at an 18% lower frequency compared to PS5

That might sound weird at first, but it is in line with what, outside of MS and its' fans, everybody has been been saying; that the 12TF number is not a real measurement for actual game performance. There are around 45% more compute units though on Series X which is why it is able to keep up with PS5 games as good as it is, only showing lower actual resolution and performance in some scenes.

To me this makes a lot more sense than 'MS has bad tools, developers don't know how to utilise 12TF yet' and so on, as has been heard on many forums by now.

Just my 2 cents. Or rather, my 49900 cents

Interesting turn of events, so its actually XSX that is punching above it's weight

Globalisateur · Nov 30, 2020

BillSpencer said:
I think the differences in architectures are being looked at from an incorrect perspective. I don't believe it is correct to say Series X has more compute units per shader array, it should be:

Series X has 33% less shader arrays per compute unit compared to PS5

To make it more complete: Series X has 33% less shader arrays per compute unit compared to PS5, and those shader arrays are operating at an 18% lower frequency compared to PS5

That might sound weird at first, but it is in line with what, outside of MS and its' fans, everybody has been been saying; that the 12TF number is not a real measurement for actual game performance. There are around 45% more compute units though on Series X which is why it is able to keep up with PS5 games as good as it is, only showing lower actual resolution and performance in some scenes.

To me this makes a lot more sense than 'MS has bad tools, developers don't know how to utilise 12TF yet' and so on, as has been heard on many forums by now.

Just my 2 cents. Or rather, my 49900 cents

I already talked about that. It's "keeping the CUs busy" design Cerny talked about. It's the number of CUs by shader array, the lower, the better.

BRiT · Nov 30, 2020

Insight said:
I think XSX biggest issue is the memory set-up not the TFlops
Two devlopers complained about the "interleaved" memory publicly
And also the Crytek developer, both deleted their statements

Wrong, memory banks do not apply to developers on the Series S. They only have access to full 8GB of Fast memory.

AbsoluteBeginner · Nov 30, 2020

Globalisateur said:
I already talked about that. It's "keeping the CUs busy" design Cerny talked about. It's the number of CUs by shader array, the lower, the better.

I think we should wait a bit before we are absolutely sure who was right. I can very easily see XSX maximizing all CUs it has, 52CU is nothing to serialize in console.

To know exactly what is happening it is required to look at balancing and load distribution of these cross gen engines, which we dont have so it is a bit futile to talk about whys.

iroboto · Nov 30, 2020

Globalisateur said:
I already talked about that. It's "keeping the CUs busy" design Cerny talked about. It's the number of CUs by shader array, the lower, the better.

I'm pretty sure the statement you made won't hold at extreme end points. Even ignoring the power budget for a moment, you're going to have a hard time delivering data to less CUs clocked higher. Those clocks are just pissing away their potential waiting for data to be delivered by memory. And then assuming you managed to feed fewer CUs with a super high clock, you would run into cooling and power issues.

The trend so far for graphics for both AMD and Nvidia is to move towards a huge amount of compute and ALU. From the 3070-3090 and the 6700 to 6800XT. None of them are short on compute units. It's more than possible that work is sized well for 36 CUs today, given the nature of 4Pro, PS5 and X1X as we transition but the nature of GPUs is well into the 60CU+ count with TF ranges much higher.

But the likelihood that graphics will forever right size their work for 36CUs and focus purely on front end FF power, over taking advantage of the available ALU on all DX12U cards is unlikely.

With DX12U now being the baseline, developers can target those DX12U devices which all come plenty packed with tons of ALU.

I may not be right on a lot of tihngs, but one thing I can be sure of, the next generation of GPUs post RDNA 2 and post Ampere, are on trend to only go even wider, they most certainly won't go narrower.

chris1515 · Nov 30, 2020

jayco said:
I cannot imagine how you can have unified L3 with AMD's chiplet design. Seems like a massive customization of the Zen2 CCX. I guess we'll know soon enough.

AMD said some of the next generation mobile APU series U will be improved Zen 2 based maybe on the same CPU than PS5 with higher clock if the rumor is true. Higher range APU, the CPU will be zen 3 based.

Deleted member 13524 · Nov 30, 2020

Nisaaru said:
Is it really that hard to wait until we have the full picture?

Of course not. No one is setting anything in stone.
But you should consider that both consoles ending up with the same real-world performance is a very real alternative. And those that are considering only the scenarios where the SeriesX comes out on top may end up disappointed.

BillSpencer said:
Was that in regards to the Series X?

Low amount of memory refers to the Series S, but "split" memory zones with distinct performances is something that both Series consoles have.

BillSpencer said:
Not really the same, but remember the 970 having 3.5GB + 0.5 GB? With drivers this bottleneck was mitigated

Completely different. The last 512MB on the 970 were pretty much useless, with a single 28GB/s bus to the graphics chip without L2 in between.
I can only guess that all the updated drivers actually did was just preventing the GPU from ever using that memory, no matter what. The GTX 970 was, for almost all intents and purposes, a 3.5GB card.

chris1515 said:
AMD said some of the next generation mobile APU wthe series U will be improved Zen 2 based maybe this is the same CPU. Higher range APU, the CPU will be zen 3 based.

I think what they meant is just that some Ryzen 5000 APUs will be rebranded Renoir chips (now called Lucienne).
The Ryzen 5000U lineup will be consisted of Cezanne and Lucienne (Renoir rebranded) chips.
https://www.notebookcheck.net/AMD-s...ucienne-and-Zen-3-Cezanne-parts.498427.0.html

BRiT · Nov 30, 2020

ToTTenTranz said:
Low amount of memory refers to the Series S, but "split" memory zones with distinct performances is something that both Series consoles have.

Developers have no need to worry about "split" memory zones on Series S, since the OS reserves the 2 GB of slower memory for it's use, leaving only the faster memory pool of 8 GB to use.

Jay · Nov 30, 2020

iroboto said:
But the likelihood that graphics will forever right size their work for 36CUs and focus purely on front end FF power, over taking advantage of the available ALU on all DX12U cards is unlikely.

The move towards ALU based workloads has been, and will continue to be the way the industry is moving. I don't see that changing.

The interesting thing for me is that even if the crossgen workloads are optimised for less CU's(remember that these games also run on PC), that it still seems to be under utilizing and not scaling well on XSX.
Even though it may just be early launch window issues, I'm still surprised.
How many CU's does the low end big navi have.

iroboto · Nov 30, 2020

Jay said:
The move towards ALU based workloads has been, and will continue to be the way the industry is moving. I don't see that changing.

The interesting thing for me is that even if the crossgen workloads are optimised for less CU's(remember that these games also run on PC), that it still seems to be under utilizing and not scaling well on XSX.
Even though it may just be early launch window issues, I'm still surprised.
How many CU's does the low end big navi have.

According to recent benchmarks AC: Valhalla it's not scaling that well for Ampere either.
There are games with a large focus on rasterization performance (outside of CUs), RDNA 2 scales very well there, as they focused a lot on building in rasterization hardware (outside of CUs) and supporting with Infinity Cache.
And that's something that's just not available to XSX. It's lacking the respective rasterization hardware (outside CUs) and even if it had it, it wouldn't have the bandwidth to support it.

thicc_gaf · Nov 30, 2020

AbsoluteBeginner said:
I think we should wait a bit before we are absolutely sure who was right. I can very easily see XSX maximizing all CUs it has, 52CU is nothing to serialize in console.

To know exactly what is happening it is required to look at balancing and load distribution of these cross gen engines, which we dont have so it is a bit futile to talk about whys.

I think @iroboto 's post a few pages back about the fixed clocks possibly being the issue is pretty much spot-on. Going back and looking at what analogous advantages the other systems had over each other not leading to certain performance gains (lower latency memory, more ROPs, bigger net of cache, etc.), then seeing the disparity in power consumption of some multiplat games like DiRT 5 and Valhalla (I think) between PS5 and Series X, I think that shows a lot.

Also helped me better visualize the difference between the fixed frequency and variable frequency approaches, it's probably as simple to say that we're seeing the results of that play out in some of these multiplat games. When you think about it PS5 is pretty much operating at or near its theoretical peak the majority of the time and while I think the whole thing regarding power budgets needing to be managed will become that system's own issue to be dealt with as the gen goes on (devs will have to optimize their code for power efficiency to make sure they stay in the power budget, which is fixed), what we're probably seeing on Series X are the usual teething pains that come with trying to keep a system occupied with work on a fixed frequency to maximize the power budget. Some of the 3P games look like they're doing poorly at it, especially considering last-gen games optimized for the system like Gears 5 not only look and run better (quite better, in fact), but also consistently are consuming more expected power meaning they're probably maximizing or close to maximizing use of the power envelope by keeping the GPU busy with tasks and saturating it with optimized distribution of workloads.

So I'm suspecting what assistance MS gives to 3P devs will be along those lines, polishing up tools, and just letting devs have enough time to get the hang of final GDK. But I never once considered it could've been the fixed clocks being a barrier because, y'know, every console in the past has gone with fixed clocks. One thing variable clocks are immediately good for though seems to be in giving a "free boost" to game performance that isn't tasking the GPU with a lot of work.

That can tie in with your point, however; these are still cross-gen 3P multiplats designed mainly with older consoles and architectures in mind, they just happen to seemingly get some automatic benefits with PS5 because: A) the tools are mostly similar to PS4's but expanded on, B) any PS5 version up-porting the PS4 code to the new system has the automatic benefit of a better baseline (compared to Xbox One) and, C) variable frequency for these games (which aren't necessarily pushing the PS5 to its limits) means the system will just default at or near 2.23 GHz on the GPU the vast majority of the time (that's basically a lot of free lunch).

It doesn't work like that on Microsoft's system, though, the games would need more optimization to keep the system's GPU occupied at the expected clock or there'll be a lot of periods with sub-optimal resource utilization, and maybe the devs just haven't had the time to optimize like that for Series X considering the situation with devkits and that they have to split focus across 5 Xbox consoles (One, One S, One X, Series S, Series X), but only 3 PlayStation systems (PS4, PS4 Pro, PS5). Maybe we can look at it as kind of an inverse of the BC situation on the consoles, where in that case we often see Series X doing better because if many of those games are CPU-bound then the system disables SMT and clocks the CPU higher, plus XBO games have a lower floor on performance compared to PS4 titles, that means possibly lower quality textures to stream in which when combined with the faster CPU explains load times and framerate boosts in BC titles favoring Series X over PS5 (on average).

If fixed clocks are more of a factor into the performance differences with non-BC 3P multiplats, though, then it might be a little while before we see Series X 3P games fully match parity (let alone surpass) PS5 games. If it were just tools needing to be updated I could've seen 2-3 months post-launch at most. Now though, it could be about at least half a year, because there's going to be a bit more learning coming into play for devs to maximize consistent saturation of the GPU for peak performance, particularly with games that aren't necessarily taxing on the GPU in the first place. Because the way it seems like, if you've got a game saturating out the PS5's GPU then it's filling up 36 CUs with work at a clock of 2.23 GHz, but then that same game gets ported to Series X from the PS5 (if PS5 is lead platform), then it's probably only using 36 of the 52 CUs and what's more the CUs are clocked lower AND the clocks are fixed so there's no free lunch because whereas PS5's starting at the ceiling and drops during certain stress points exceeding the power budget, Series X starts at the floor and ramps up to its peak depending on the workload and how much power the workload will need.

Which is how all other consoles did it, just that PS5 is doing it the other way around with more immediate benefits to be had (but I suspect, its own difficulties in keeping tamed as workloads for next-gen games become a lot more demanding, something Series X might have a good deal easier time dealing with).

iroboto · Nov 30, 2020

thicc_gaf said:
I think @iroboto 's post a few pages back about the fixed clocks possibly being the issue is pretty much spot-on. Going back and looking at what analogous advantages the other systems had over each other not leading to certain performance gains (lower latency memory, more ROPs, bigger net of cache, etc.), then seeing the disparity in power consumption of some multiplat games like DiRT 5 and Valhalla (I think) between PS5 and Series X, I think that shows a lot.

It was a thought. But it's not necessarily right

Ampere and RDNA 2 both run very high boost clocks, and in certain games, we see RDNA 2 being able to pull way ahead. So clocking or boost clocks is not the whole story. AC Valhalla performs spectacularly terribly on Ampere relative to the available power output is. That part isn't clear, and there is something RDNA 2 is doing much better in certain setups.

MS cited that XSX should have a performance profile sitting around a 2080. And it's a little below that right now, but 2080 is quite a bit below 6800 and 6800XT in quite a few benchmarks as well. So there is something that happened with the way these games launched. And I don't know if it's a geometry just being done much better with RDNA 2. But it's worth investigating what is happening. NGG as of June SDK was not ready. I don't know if it was ready by the time it launched. That brief glimpse of the inner workings of the GDK is now getting quite old. And the information we borrow from there is becoming dated quickly

Deleted member 11852 · Nov 30, 2020

iroboto said:
I'm pretty sure the statement you made won't hold at extreme end points. Even ignoring the power budget for a moment, you're going to have a hard time delivering data to less CUs clocked higher. Those clocks are just pissing away their potential waiting for data to be delivered by memory. And then assuming you managed to feed fewer CUs with a super high clock, you would run into cooling and power issues.

All GPUs are built to the accommodate the latency of GDDR which is why job scheduling and smart cache usage is so important. But to your general point, and as somebody who has done a lot of parallelisation work, it's always easier to manage less work queues of any resource than more. The more of work resource in contention for every shared resource of that work queue, that's just how things are.

see colon · Nov 30, 2020

vjPiedPiper said:
Even the ability to upclock the GPU by 5% could have huge impact on the XSX GPU perf.
BUT, does anyone know of a precedent for this?

eg. a manufacturer increasing clocks on a product AFTER launch?
seems like realm of fantasy to me!

Nintendo did it with Switch. There are games (MK11?) that run higher clocks in general (maybe just in handheld mode), and they also added a boost to clocks specific to some games during loading screens. The former has an impact on battery life, the latter may have an impact, but it might also equalize with game time, since you will spend less time loading and more time playing.

DSoup said:
So a game supporting four Xbox hardware configurations and offering 120fps on two of them, which has a few missing plants and a minor LOD issue is a "poor showing"? Come on..

The LOD issue make the game instantly recognizable as having lessor visuals. Yeah, it's a poor showing. I also expect it to be fixed.

Insight said:
I think XSX biggest issue is the memory set-up not the TFlops
Two devlopers complained about the "interleaved" memory publicly

And also the Crytek developer, both deleted their statements

BK's comments are specific to the S, and I'm sure the lower amount of memory may be an issue, but the "split memory banks" is a non-issue. Applications only have access to 8GB, so they will never touch the slower memory. On X? Maybe. But I don't see how a patch of slower memory is going to affect your resolution as long as your render targets are held in the full speed memory. And if they are, most of what you would storing in the slower regions would be assets, that if they cause any sort of performance issue while being accessed, would cause the same issues regardless of resolution.

thicc_gaf · Nov 30, 2020

iroboto said:
It was a thought. But it's not necessarily right
Ampere and RDNA 2 both run very high boost clocks, and in certain games, we see RDNA 2 being able to pull way ahead. So clocking or boost clocks is not the whole story. AC Valhalla performs spectacularly terribly on Ampere relative to the available power output is. That part isn't clear, and there is something RDNA 2 is doing much better in certain setups.

MS cited that XSX should have a performance profile sitting around a 2080. And it's a little below that right now, but 2080 is quite a bit below 6800 and 6800XT in quite a few benchmarks as well. So there is something that happened with the way these games launched. And I don't know if it's a geometry just being done much better with RDNA 2. But it's worth investigating what is happening. NGG as of June SDK was not ready. I don't know if it was ready by the time it launched. That brief glimpse of the inner workings of the GDK is now getting quite old. And the information we borrow from there is becoming dated quickly

Guess it's better to say then that it's a collection of factors that are inhibiting some 3P performance on Series X then, best to not try pointing to any one factor and calling it a day. Some unfinished features in the GDK combined with whatever additions to RDNA2 PC GPUs the consoles are likely missing (Infinity Cache), plus any lack of time/resources from dev teams to familiarize themselves with updated GDK tools/features are maybe being exacerbated by the fixed clock profile, may have been a better way of phrasing it. Though thinking a bit more on it, it may've been erroneous to fixate so much on the fixed clocks being the "gotcha!" since fixed clocks are nothing new for console devs, it's been that way for decades.

At the same time though there's been postings of that Series X Twitter leak going on in the thread and some other breakdowns by other posters; seeing RDNA1 for the frontend and knowing that RDNA1's frontend wasn't necessarily the best (RDNA2's seems to be a lot better), I still think seeing that gets me by surprise but there has to be more to that, I'd find it in bad taste if MS simply took the RDNA1 frontend and did no custom optimizations to it.

Hopefully I didn't sound like I was pinning it all down to fixed clocks; when I see others bring up points that seem they lead somewhere there's a tendency on my behalf to overemphasize those points at first

. Though I think that it is a factor, it'd be much less a factor if GDK were stabilized and some time was given to devs to breath and familiarize with a stable API environment, which seems to be much moreso the case with Sony because their devkits were out the door earlier, and their SDK is pretty much the PS4 feature suite with some PS5 tools integrated on top. Plus, they started getting that out before the current <cough> situation took over screwing many things up.

flutter · Nov 30, 2020

vjPiedPiper said:
Even the ability to upclock the GPU by 5% could have huge impact on the XSX GPU perf.
BUT, does anyone know of a precedent for this?

eg. a manufacturer increasing clocks on a product AFTER launch?
seems like realm of fantasy to me!

PSP, on the CPU side. 266 to 333 Mhz

fehu · Nov 30, 2020

see colon said:
Nintendo did it with Switch. There are games (MK11?) that run higher clocks in general (maybe just in handheld mode)

Any not defective switch able to function in docked mode can be clocked higher when in handled mode, it's not the out of spec overclock that is being suggested here.
Plus, even if the hardware can sustain it, add little more heat and you can't predict what will happen in poorly ventilated spaces.

fehu · Nov 30, 2020

flutter said:
PSP, on the CPU side. 266 to 333 Mhz

But this was managed from the start, it's not like at some point someone screamed "the DS is too powerful we need risk increasing the cpu frequency!"

Deleted member 11852 · Nov 30, 2020

fehu said:
But this was managed from the start, it's not like at some point someone screamed "the DS is too powerful we need risk increasing the cpu frequency!"

I heard that's what happened; that Ken Kutaragi lost his shit in a meeting and started throwing chairs abut because of the awesome power of the DS. :yep2:

j^aws · Nov 30, 2020

iroboto said:
Im not seeing the connection to XSX. Or better worded I mean the literal connection.

So the fact that it’s labelled as Navi 21 lite and found in OSX drivers tells me that the product exists in the AMD line. While that could very well be what XSX is based upon, it does not imply that it is as per the driver states.
Wrt to the driver, or even that product they may have positioned the product to be specifically compute heavy. Reducing more on the front end to cater to that markets needs.

I don’t see this as a sure fire Navi 21 lite is XSX therefore all these other claims now apply.

Yep, your thoughts here were pretty much my thoughts when I first saw the driver leaks last month, and didn't think much of them and brushed them off. However, back then, we didn't have RDNA2 details and block diagrams for Navi21. We have the details for XSX and block diagram from Hotchips. And both clearly have different rasterisation specifications and change in pipeline.

Below is a slide confirming XSX triangle rasterisation rate, highlighted in red:

This is 4 triangles per cycle for XSX:

4x1.825 = 7.3 Gtri/sec or billion per second.

Now, XSX has 4 Scan Converters in total across 4 Shader Arrays for rasterisation (from driver leak and triangle throughput above), and maximum triangle throughput is 4 triangles per clock cycle. This is same as RDNA1 and Navi10. You can see the Raster Units containing scan converters below, 4 in total:

Navi21 has 8 Shader Arrays and 8 Scan Converters for rasterisation (from driver leak and twice as many as XSX for both), yet its maximum triangle throughput is still 4 triangles per clock cycle, and still has 4 Raster Units as below, where they span Shader Engines rather than Shader Arrays:

RDNA2, for Navi21 has each of its Raster Units capable of rasterising triangles that have coverage ranging from 1-32 fragments:
https://forum.beyond3d.com/posts/2176773/
XSX has Raster Units with RDNA1 capability of triangle coverage up to 16 fragments. With 4 Raster Units x 16 giving 64 fragments per cycle to match its 64 ROPs.

What RDNA2 is doing is taking 4 triangles, but is capable of finer granularity rasterisation for smaller triangles (using 2 Scan Converters per Raster Unit for coarse and fine rasterisation). This produces twice as many fragments from 4 triangles per cycle compared to XSX. RDNA2 is clearly not XSX for Raster Units.

iroboto said:
If that makes sense. Aside from 1 claim that seems disputed by RGT, I’m can’t make much more commentary.

What RGT claim are you referring to?

iroboto said:
I know of no method to declare what makes a CU RDNA2 or RDNA1. The likelihood that you can pull just the RT unit and not the whole CU with it is unlikely. I get we do arm chair engineering here; but this is an extremely far stretch. MS weren’t even willing to shrink their processors further and thus upgraded to Zen 2 because it would be cheaper. The consoles are semi-custom; not full custom. They are allowed to mix and match hardware blocks as they require but it’s clear there are limitations. But If you know the exact specifications you can share it, but I don’t.

I don't think there is much of a difference between RDNA1 and RDNA2 CUs, unlike something from GCN to RDNA1. From the driver leak, they still have the same wavefront granularity, it's just the maximum wavefronts per SIMD that has changed for RDNA2. This suggests the mapping of front-end changes to optimal instruction mix for the CUs, rather than a straight upgrade. And Command Processor tweaks.

Regarding blocks and modifications, the SIMD and Scalar units are programmable, and separate blocks from the fixed-function TMUs and RAs. AMD, MS and TSMC engineers should be more than capable of modifications to what is basically an electronic circuit, a complex one no doubt. There is nothing inherently impossible about this.

iroboto said:
Typically things like front end being RDNA 1, is a weird claim given Mesh shaders are part of that front end. The GCP needs to be outfitted with a way to support mesh shaders. The XSX also supports NGG geometry pipeline as per the leaked documentation (which as of June was not ready) so once again, I’m not sure what would constitute it to be RDNA1 vs RDNA2.

I will say RDNA1 for front-end and CUs doesn't mean the complete stages are these, rather a specific stage or component. So you will still have blocks like Geometry Engine and Mesh Shader logic as RDNA2, even though they are considered RDNA1 for front-end in the leak. This actually points to MS not putting as big a focus on fixed-function stages, and more a focus on newer features, which is where MS would like developers to focus their efforts.

All things considered, with RT defaulting to RDNA2 because it doesn't exist for RDNA1, and Render Backends being RDNA2 as well, the above sounds like a storm in a teacup, where most of the stages are upgraded. The biggest impact would be the inefficiency of rendering small triangles due to the older Scan Converters/ Raster Units.

Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

mr magoo

Globalisateur

Globby

BRiT

(>• •)>⌐■-■ (⌐■-■)

AbsoluteBeginner

iroboto

Daft Funk

chris1515

Deleted member 13524

Guest

BRiT

(>• •)>⌐■-■ (⌐■-■)

Jay

iroboto

Daft Funk

thicc_gaf

iroboto

Daft Funk

Deleted member 11852

Guest

see colon

All Ham & No Potatos

thicc_gaf

flutter

fehu

fehu

Deleted member 11852

Guest

j^aws

Similar threads