Digital Foundry Article Technical Discussion [2023]

Status
Not open for further replies.
I stopped reading your reply after this point because that's not what I said.
And I think XSX having one of the lowest bandwidth per CU ratio of any RDNA2 powered GPU is also a factor.

I’m not sure how else to interpret this. I don’t typically give small responses, clarity is an important thing. But if you’re upset I didn’t understand you correctly, I suggest you write more than 1 word responses.
 
Last edited:
Here is a summary:
PS5 went narrower and faster, whereas Series X went wider and slower. Cerny's approach was that it is easier to fill less CU's but utilise them better at higher clocks, whereas Series X went for wider for more computational power.
Both approaches have their pros and cons. The Series X's CU's are probably less utilised which is why the PS5 manages good parity, but we will see how things will look like if games require more CU's if that will ever be a thing.

This is a different subject to memory bandwidth though. PS5 went for narrower but faster because of what's happening inside the GPU, not for external BW reasons.

I wouldn't have thought there'd be any trouble filling more CUs. GPUs are inherently parallel and should just use the whole CU offering - how do you prevent a workload from actually being distributed across all 56 CUs on XBSX?

I'm not a great person to try and explain as it's a while since I read about it, I'm not that smart and I'm drunk ... but I think it's to do with how you allocate work to those CUs. You need to be able to create a job, allocate it to a CU, then when it's done retire that job. The job might be from the 3D pipeline or from some kind of compute queue as far as I can remember.

There's a limit iirc at which each shader engine can tell CUs to do stuff and then acknowledge that they've done it and then give them more stuff to do. Even if a scene is inherently able to support a wide level of parallelism, you need the front and back end to not bottleneck the compute units.

So basically iirc the Series X really suits long running shaders that get tons of work done without without overburdening some aspects of the shader engine. RT would appear to be such a task based on what MS have said about the SX and its need to have lots of BW to keep the CUs fed.
 
I’m not sure how else to interpret this. I don’t typically give small responses, clarity is an important thing. But if you’re upset I didn’t understand you correctly, I suggest you write more than 1 word responses.

What I original said was perfectly clear but I'll help you a long.

What I said......

And I think XSX having one of the lowest bandwidth per CU ratio of any RDNA2 powered GPU is also a factor

What you claim I said

XSX has the absolute worse bandwidth per CU metric of any GPU - and so the 5700XT should suffice

What you said is no where close to what I even said.

And if you can't understand my comment, which was a pretty simple sentence, there's no point even attempting a discussion with you.
 
PS5 has ~20% more BW per CU but is also clocked *up to* 17% higher.

So what BW advantage does XSX have with it's lower clocks as PS5's higher bandwidth per CU nullifies that.

But, back to talking about DF video's.
Doesn’t work like that. Higher GPU clocks means waits for memory requests burns more cycles.
 
It's been stated that Xbox doesn't take priority in development, which has affected its game performance. Case in point:


So, the PC with millions of configurations is taking more precedence over the X with a closed-end configuration? Sounds like X hardware/software is seriously problematic that dev's would rather tackle the PC first.

I guess by the time PS5 Pro is available, Series X will fall further behind if "development priority" is the actual case for it's issues.

So, the bigger question is... what does it mean for the future of Xbox hardware if PlayStation hardware takes priority the next-generation and the following generations after that?
 
So, the PC with millions of configurations is taking more precedence over the X with a closed-end configuration? Sounds like X hardware/software is seriously problematic that dev's would rather tackle the PC first.

I guess by the time PS5 Pro is available, Series X will fall further behind if "development priority" is the actual case for it's issues.

So, the bigger question is... what does it mean for the future of Xbox hardware if PlayStation hardware takes priority the next-generation and the following generations after that?
Someone here mentioned that John from DF said that Xbox certification takes longer.
 
I'm not a great person to try and explain as it's a while since I read about it, I'm not that smart and I'm drunk ... but I think it's to do with how you allocate work to those CUs. You need to be able to create a job, allocate it to a CU...
Woah, is that how it's done?? Devs have to manage CUs like CPU cores and load them up with work? I thought work was dispatched to the GPU and the schedulers allocated resources, distributed work around available CUs, interleaved work (async compute) etc. And that way, you can get the same PC code and run it on a 28 CU part or a 56 CU part and it'll distribute across available resources automatically. IIRC Naughty Dog used specific CU reservation on PS4 but then they could in closed hardware. Otherwise I thought you just throw work at the GPU and let it handle management, optimising the work you throw at it to make best use of its scheduling.

What level of GPU-resource-distribution management is there and what level is down to the developer's code?
 
Woah, is that how it's done?? Devs have to manage CUs like CPU cores and load them up with work? I thought work was dispatched to the GPU and the schedulers allocated resources, distributed work around available CUs, interleaved work (async compute) etc. And that way, you can get the same PC code and run it on a 28 CU part or a 56 CU part and it'll distribute across available resources automatically. IIRC Naughty Dog used specific CU reservation on PS4 but then they could in closed hardware. Otherwise I thought you just throw work at the GPU and let it handle management, optimising the work you throw at it to make best use of its scheduling.

What level of GPU-resource-distribution management is there and what level is down to the developer's code?
Developers would have no control over CUs on directX.

3D pipeline the scheduler has full control.

And with compute shaders would just span across all the CUs but have some control on how the kernel is sent via threads or in waves/blocks. But maybe that is too cuda-ish
 
Right. So how is it hard to populate more CUs with work? Got more CUs? They do the same work in less time, or maybe same time if clocked lower, depending of course on managing resources like cache. But again I don't see an inherent problem with keeping more CUs occupied such that fewer CUs clocked faster is more economical. Is it really a matter of CU count in relation to other aspects of the GPU, and that for Cerny's particular case he was talking about...oh, hang on, I can watch the actual video...

...Okay, two points are raised.

1) higher clocks affect all other aspects of the GPU. Yes.
2) When he says it's easier to occupy fewer CUs, he only mentions drawing triangles, not general compute work.

Quote: "When triangles are small, it's much harder to fill all those CUs with useful work."

Somewhat different to how Nesh phrased it :

Cerny's approach was that it is easier to fill less CU's but utilise them better at higher clocks, whereas Series X went for wider for more computational power.
So what sort of penalty does XBSX have in using CUs to draw triangles in something like Nanite versus PS5's higher clocked CUs and why? Or is that difference a little overstated? PC benchmarks should provide a nice indicator of how clock and width scale across architectures and highlight any obvious narrow+fast advantages.
 
So what sort of penalty does XBSX have in using CUs to draw triangles in something like Nanite versus PS5's higher clocked CUs and why? Or is that difference a little overstated? PC benchmarks should provide a nice indicator of how clock and width scale across architectures and highlight any obvious narrow+fast advantages.
Is Penalty the right word? I think the wording of the discussion is problematic. I don't think Series X's approach is wrong. They balanced the machines differently with pros and cons but overall reach a desired result that is pretty much very close.
 
Whatever word, what's the loss in efficiency more CUs at lower clocks actually experiences versus fewer CUs at higher clocks? Is there a very tangible difference behind Cerny's choice, or is it negligible in the real world? This isn't really about XBSX vs PS5 but GPU design and where the best balance lies, but AFAICS among IHVs, narrower and faster isn't a strategy any are chasing. that might be because faster peaks out at a certain width beyond which power consumption skyrockets, and the only way to get more performance then is wider requiring lower clocks.
 
Whatever word, what's the loss in efficiency more CUs at lower clocks actually experiences versus fewer CUs at higher clocks? Is there a very tangible difference behind Cerny's choice, or is it negligible in the real world? This isn't really about XBSX vs PS5 but GPU design and where the best balance lies, but AFAICS among IHVs, narrower and faster isn't a strategy any are chasing. that might be because faster peaks out at a certain width beyond which power consumption skyrockets, and the only way to get more performance then is wider requiring lower clocks.
This question was asked multiple times here and i dont think there is a general consensus here on which approach is better. When we had this conversation here i quoted one of the amd/Vulcan developer who stated in one of his blogs that more CUs = better. This statement was challenged then by one of the posters quoting this article:
Vega 64 vs. Vega 56 Clock-for-Clock Shader Comparison (Gaming) | GamersNexus - Gaming PC Builds & Hardware Benchmarks

So having more CU had no benefits at all in this test. So i sent an email to the amd dev if he can comment on this and got this reply:

"Unfortunately I don't have time to investigate the linked article, but speaking in general, there are many reasons why a specific graphics workload may not scale in performance with the number of CUs. First, you would need to determine whether this shader is really a bottleneck. Maybe the application is CPU-bound or GPU spends most of the time waiting on some barriers or other synchronization. If the shader is really the bottleneck, then it can be limited by memory reads/writes or many other factors. Efficient utilization of cache memories is also important. There is also a question whether the draw call or compute dispatch has sufficient parameters to fill the entire GPU - in terms of workgroup size, number of workgroups launched etc. Occupancy can also be a limit and usually results from the number of vector registers used by a shader. Shaders are rarely limited by pure ALU floating-point computation throughput."

Perhaps somebody else has some new information about the subject but i don't think we have any new evidence or new data to make any decision on which approach is better.
 
Whatever word, what's the loss in efficiency more CUs at lower clocks actually experiences versus fewer CUs at higher clocks? Is there a very tangible difference behind Cerny's choice, or is it negligible in the real world? This isn't really about XBSX vs PS5 but GPU design and where the best balance lies, but AFAICS among IHVs, narrower and faster isn't a strategy any are chasing. that might be because faster peaks out at a certain width beyond which power consumption skyrockets, and the only way to get more performance then is wider requiring lower clocks.
it’s all marketing.
There is no wide and slow vs fast and narrow design philosophy. All GPUs each generation get wider and faster and the performance has always been trending in this way.

ps5 went with variable clocks which has cost
Implications on yield. You pay more for chips that can all clock higher.

XSX was over the limit on costs, so they fixed the clock speed and that generates them more yield per chip.

It XSX had variable clocks, no one would be discussing this narrative of fast and narrow
Vs wide and slow.
 
Woah, is that how it's done?? Devs have to manage CUs like CPU cores and load them up with work? I thought work was dispatched to the GPU and the schedulers allocated resources, distributed work around available CUs, interleaved work (async compute) etc. And that way, you can get the same PC code and run it on a 28 CU part or a 56 CU part and it'll distribute across available resources automatically. IIRC Naughty Dog used specific CU reservation on PS4 but then they could in closed hardware. Otherwise I thought you just throw work at the GPU and let it handle management, optimising the work you throw at it to make best use of its scheduling.

What level of GPU-resource-distribution management is there and what level is down to the developer's code?

As @iroboto says, it's not the developers who have direct control over the CUs, it's the GPU itself (I should have used a better choice of language). It's the GPU that manages that, but the workloads that developers give the GPU will affect utilisation - as you point out.

Whatever word, what's the loss in efficiency more CUs at lower clocks actually experiences versus fewer CUs at higher clocks? Is there a very tangible difference behind Cerny's choice, or is it negligible in the real world? This isn't really about XBSX vs PS5 but GPU design and where the best balance lies, but AFAICS among IHVs, narrower and faster isn't a strategy any are chasing. that might be because faster peaks out at a certain width beyond which power consumption skyrockets, and the only way to get more performance then is wider requiring lower clocks.

There's no inherent loss in efficiency at lower frequencies for a given GPU - if anything the reduction in memory access latency might help a little bit.

What I think Cerny is alluding to is width relative to the stuff that supports the CUs. Both PS5 and SX have two shader engines, but the SX has more CUs per shader engine for more compute. PS5 has fewer CUs, but runs at a higher frequency, so all else being equal its Graphics Command Processor and Asynchronous Compute Engines will have fewer CUs to manage and so - I expect - be a little better at keeping CUs busy (I expect they have a maximum rate at which they can disptach and retire shaders). Likewise, if the rasteriser is the same between both the PS5 will probably be better at issuing work for shitty small triangle through the 3D pipeline. I expect ROPs are also less likely to be a bottleneck on PS5, so long as you aren't main memory BW limited.

PC GPUs do use greater and greater width on high end models as you say, but they also scale up more than just ALUs. PS5 has 10 CUs per Shader Engine, as do RDNA1 Navi 10 and RDNA2 Navi 21 and 22 iirc. AMD actually dropped this to 8 per SE on RDNA 3 again iirc.

Series X has 14. I don't know if the GCP or ACEs or whatever were modified to account for this, but it does seem logical that under some circumstances it may be harder to keep those CUs as busy. You probably could have added a third shader engine, but this would have cost quite a bit of silicon and power and probably have had less benefit than increasing the width of the two shader engines.

It will be interesting to see how UE5 progesses on the SX. Micropolys there are software rasterised, which hopefully will make good use of all that compute.
 
There is no wide and slow vs fast and narrow design philosophy.
It's a pretty well-investigated tradeoff in the server/datacenter world where power is a significant running cost. For the same performance, wide and slow is dramatically more power efficient than narrow and fast, but costs more. So it poses a really interesting tradeoff between fixed Si cost vs. variable runtime cost, and from what I've seen DC customers are more than happy to pay for Si in order to offset power costs.

In the client space it's not power but thermals that's a bigger issue, so instead of fixed-vs-variable cost tradeoff it's a fixed(Si)-vs-fixed(thermal design) cost tradeoff.

Impact of clock speed on yield is weird. Wide and slow is slower, sure, but you also have more Si area to clock so the probability of running into issues goes up despite demanding less from the Si. It all depends on how close the two options are to the design targets. But I agree with you that based on the evidence it does look like the PS5 was up there and then decided to turn on the afterburners (possibly as a reactionary measure, though we'll never know). Variable clocks in a fixed-spec design are ugh. I nearly spit coffee on my keyboard when Cerny declared that he would always prefer narrow-and-fast (and variable!) to wide-and-slow -- although he did have a solid point about the front end. All that said, it all seems to have worked out for the PS5 in the end as a well-rounded design that devs can extract good performance out of. Same with the XBSX for the most part (sans non-technical dev priority issues), despite its uglier heterogeneous memory system.
 
Status
Not open for further replies.
Back
Top