PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

Status
Not open for further replies.
id say a better analogy was you know what your friends position is and the main things that people in that position do. and you ask him what he did at work and he said "mostly data entry". you know normally those in his position do more. so why assume he only does data entry? because thats all he told you? when asked?

Therein lies the flaw in your reasoning - we have no rumours or other basis for believing that the PS4 audio hardware does more than the PS3's did (stream decoding) OR more than what Cerny has stated on more than one occasion when asked about its capabilities.

So sticking to the analogy, we don't know what our friend's position is, you're assuming he's the CEO when all he's said is he does data entry.

Since if you asked someone what he did for work and he said data entry, then you would assume he'd be in a data entry role.
 
And that's precisely what it is - you will see less and less benefit from using the additional CUs for rendering.



That's just it, the point of diminishing returns for PS4 is the 14 CU mark (at least for current AAA engines which AMD must have used to derive their numbers from). Hence the 14+4 recommendation to devs.

Highly dependent on the scenario, incredibly disingenuous to generalise like that.
 
Actually change that. Not 16 queues for XBONE nor 64 queues for PS4.

12 CU's / 24 queues (XBONE) [16 Compute + 8 Graphic] 2 Queues / CU
18 CU's / 72 queues (PS4) [64 Compute + 8 Graphic] 4 Queues / CU

That should put the queue argument to rest, and it seems that it is no coincidence that they both line up perfectly for the amount of CU's they have.
No.

The PS4 has 64 compute queues (8x8) and two graphics queues, one of which is usable by the game (the second high priority/VSHELL one is apparently reserved for the OS). The diagram of the PS4 setup actually shows this very clearly (the queues are named "rings" in there, a ringbuffer is a common implementation of a queue). For the XB1 it is not that clear from the vgleaks docs, it's never explicitly mentioned. The single block for compute commands could be either a single ACE with a single queue adhering to the GCN1.0 design or it could be a single MEC (compute microengine) consisting of 4 compute pipes each with 8 queues as in the GCN1.1 iteration (so 32 queues in total). The PS4 has two MECs (Kaveri will have also two), but Kabini and Bonaire for instance have just a single MEC, meaning basically 4 compute pipes (ACEs) and 32 queues in total. Bonaire with 14 CUs and 32 queues or Kabini with just 2 CUs and 32 queues also show that there doesn't need to be some fixed ratio of queues to CUs. If one looks at the PS4 diagram at vgleaks, one basically sees that all lines coming from the different pipes unite in the center and are then spread out to several pipes/wave controllers/wave dispatch units with a lot of arbitration on the way finally connecting to the shader arrays. This symbolizes basically a crossbar between all the pipes and the actual scheduling of wavefronts to the CUs. It means, that each single pipe can use the complete amount of CUs if necessary (if the dev wishes so), but that with some tweaking on the arbitration (priorities and/or static splits) you can also get multiple shaders to share the shader array or even the same CU.
In that light, a makes no sense to say x queues are assigned to a CU or something in that direction. It is completely flexible and basically under developer control (at least if Sony opens this up for them, which I would expect). On the PC side one has no direct control over this yet, it's managed by the driver (the device fission extension of OpenCL will probably allow at least static splits at some point).
In principle, you can fire up 12 compute tasks in 8 queues (4 queues get two tasks which have a dependency on each other, that is actually not strictly necessary, one can also fire up a dependent task in another queue and it will wait until the data is ready) while at the same time the normal draw calls are handled by the graphics pipe. Now a bit of conjecture follows. If you just assign lower priorities to the compute tasks, the graphics will probably not be influenced too much (but a slight hit is possible, a larger one if the compute tasks issue very long running wavefronts). The compute stuff gets mainly scheduled, when the draw calls don't fill up the CUs completely. But that may make the runtime of the compute kernels a bit unpredictable. To improve that, the dev could also decide, to set a certain amount of CUs aside for the compute tasks, let's say 3 (to avoid yet another round of the 14+4 discussion). Now only the 12 compute kernels (a maximum of 8 are running in parallel because of the 4 dependent ones) have to fight for the resources of the 3 CUs (one can let the hardware figure this out alone, maybe helped by setting different priorities for the different compute tasks), the execution can't be blocked by the fluctuating load of the graphics shaders which have 15 CUs for their own. This makes the runtimes much more predictable.
I hope this split can be changed pretty fast during different phases of the rendering. As said already, some tasks tend to get by with a low amount of CUs (like shadow map rendering iirc). When this is done, one could allot more CUs for compute tasks, but during the post processing stage (which one can perfectly distribute over as many CUs as there possibly can be) all CUs are devoted to this and no compute gets scheduled at all for example.
 
and yet in comparison to 7000 series cards its not ALU to memory bandwidth heavy at all. So from that perspective you can run another screen space / post process /etc shader that you otherwise wouldn't of had the time for.

I'm glad someone mentioned this. PS4 keeps getting described as "ALU heavy" when in fact it's just about the lightest ALU design in the GCN series (roughly even with the 7850 if comparing to memory bandwidth and ROP throughput) with the big glaring exception of XB1 which is by far the ALU lightest (in relation memory bandwidth) GCN design available by a wide margin when you account for the esram bandwidth (and ignoring tablet designs).

So in actuality, far from having too many CU's for rendering you could argue that if the rest of the GCN series is anything to go by, 18 may not be quite enough to make full use of the available bandwidth and ROP throughput. On the other hand, Cerny still argues (probably rightfully so) that you will see a greater return from using a percentage of your ALU resources on GPGPU work as opposed to pure rendering.

This doesn't mean there is some sudden drop off in return once you go past using 14 CU's. It merely means that you'll see a greater overall return from a sensible split between rendering and GPGPU than going all out on one or the other.
 
So in actuality, far from having too many CU's for rendering you could argue that if the rest of the GCN series is anything to go by, 18 may not be quite enough to make full use of the available bandwidth and ROP throughput. On the other hand, Cerny still argues (probably rightfully so) that you will see a greater return from using a percentage of your ALU resources on GPGPU work as opposed to pure rendering.

Then again the CPU isn't exactly very powerful either which complicates any comparison to a desktop GPU where most sites will benchmark with a significantly more powerful CPU. And even then, it isn't uncommon to be CPU limited at times.

So, within the GPU design itself, it may not be excessive, but when combined with a relatively low power CPU, it may be more difficult to get full use out of all 18 CUs. With that in consideration it's likely a good idea in many cases to use GPU compute to reduce the load on your CPU which then boosts your GPU useage numbers.

Regards,
SB
 
Then again the CPU isn't exactly very powerful either which complicates any comparison to a desktop GPU where most sites will benchmark with a significantly more powerful CPU. And even then, it isn't uncommon to be CPU limited at times.

So, within the GPU design itself, it may not be excessive, but when combined with a relatively low power CPU, it may be more difficult to get full use out of all 18 CUs. With that in consideration it's likely a good idea in many cases to use GPU compute to reduce the load on your CPU which then boosts your GPU useage numbers.

Regards,
SB


Well, again though, isn't all this relative? I cannot imagine that the Cell was a performance monster in most real world situations compared to top of the line desktop CPU's in 2006? Obviously it was stellar in a few areas, even amazingly so.

Given all of the 'helper' modules, what all does the CPU really need to do outside of AI and physics (could this not be offloaded as well?) tasks that would really suck up its performance? There's audio as well, but I remember hearing that the biggest resource investment comes in compression\decompression which is done by the audio DSP onboard..
 
And that's precisely what it is - you will see less and less benefit from using the additional CUs for rendering.



That's just it, the point of diminishing returns for PS4 is the 14 CU mark (at least for current AAA engines which AMD must have used to derive their numbers from). Hence the 14+4 recommendation to devs.



Wasn't the 14+4 thing debunked by DF then self long ago.?

Also if the PS4 uses 400Gflops for compute on a multiplatform game doesn't that game requires the same on the other platform to run equally.?
 
Wasn't the 14+4 thing debunked by DF then self long ago.?

Also if the PS4 uses 400Gflops for compute on a multiplatform game doesn't that game requires the same on the other platform to run equally.?

Depends on what you mean by equally, the additional compute could be used for advanced physics for example. The version that lacks the additional physics isn't handicapped per se, it just lacks some of the polish afforded the other version.

Besides if cloud compute turns out to have an validity perhaps some of that additional compute can be simulated remotely.
 
Wasn't the 14+4 thing debunked by DF then self long ago.?

It was debunked by Mark Cerny back in April during his interview with Gamasutra:

Mark Cerny said:
"There are many, many ways to control how the resources within the GPU are allocated between graphics and compute. Of course, what you can do, and what most launch titles will do, is allocate all of the resources to graphics. And that’s perfectly fine, that's great. It's just that the vision is that by the middle of the console lifecycle, that there's a bit more going on with compute."

Also if the PS4 uses 400Gflops for compute on a multiplatform game doesn't that game requires the same on the other platform to run equally.?
I think there are two likely outcomes for multi-platform games:
  • the additional six compute units in the PS4 will largely go unused, other than possibly manifesting slightly higher frames rates / native rendering resolutions.
  • critical game compute, such as collision detection and other mechanics affecting gameplay will be balanced for both consoles. if the middleware or engine supports it, you may see additional non-gameplay stuff like additional particles in effects, more fragments in explosions. stuff that doesn't affect the game in any way.
 
Highly dependent on the scenario, incredibly disingenuous to generalise like that.

Well that's what AMD and Sony did since those are not my words ;)

No.

The PS4 has 64 compute queues (8x8) and two graphics queues, one of which is usable by the game (the second high priority/VSHELL one is apparently reserved for the OS).

VShell seems to be the internal name for the OS - or at least i've heard VShell used offhand as if it were the name.

Wasn't the 14+4 thing debunked by DF then self long ago.?
What's been debunked is speculation that the 14+4 thing is due to hardware differences between the CUs - all CUs are identical.

Also if the PS4 uses 400Gflops for compute on a multiplatform game doesn't that game requires the same on the other platform to run equally.?
Yup, though ESRAM should give a significant boost to compute on XB1.
But you would still expect the XB1 game to be using say 9+3 if the PS4 version is 14+4
 
Last edited by a moderator:
Well that's what AMD and Sony did since those are not my words ;)



VShell seems to be the internal name for the OS - or at least i've heard VShell used offhand as if it were the name.


What's been debunked is speculation that the 14+4 thing is due to hardware differences between the CUs - all CUs are identical.


Yup, though ESRAM should give a significant boost to compute on XB1.
But you would still expect the XB1 game to be using say 9+3 if the PS4 version is 14+4

Well then all modern GPU's AMD make are horrible for games I guess. Are you seriously trying to say that?. As mentioned in a previous post the PS4 GPU is lower on ALU/memory then a lot of the high end desktop cards so your either saying that AMD doesn't know how to make there GPU's properly, or you are wrong.

Also are you working off vgleaks or do you actually know.

Also I fail to see how the PS4 can be so limited yet it has proportionally the same or more resources for nigh on everything compared to the XBONE which no one has mentioned such a limitation about.
 
Last edited by a moderator:
Well then all modern GPU's AMD make are horrible for games I guess. Are you seriously trying to say that?. As mentioned in a previous post the PS4 GPU is lower on ALU/memory then a lot of the high end desktop cards so your either saying that AMD doesn't know how to make there GPU's properly, or you are wrong.

Also are you working off vgleaks or do you actually know.

Also I fail to see how the PS4 can be so limited yet it has proportionally the same or more resources for nigh on everything compared to the XBONE which no one has mentioned such a limitation about.

it also has the exact same CPU, yet, as we've been told, 40% more gpu flops...HMMMM

could be weird internal limiters too besides the obvious biggies of cpu/bw. such as cache/registers...

also i believe there's been some discussion recently to the effect that southern islands starts to scale poorly as you ramp the cu's on pc, although i haven't personally followed that line of discussion too much...

the 14+4 thing is pretty well known, and you can take Cerny's "not entirely round" statement as more or less a public confirmation.

I'm not so sure what it really means in practice, I'd be a bit surprised if PS4 really starts suffering dramatically diminished returns when using more than 14 CU's for GFX. But it's possible.
 
it also has the exact same CPU, yet, as we've been told, 40% more gpu flops...HMMMM

could be weird internal limiters too besides the obvious biggies of cpu/bw. such as cache/registers...

also i believe there's been some discussion recently to the effect that southern islands starts to scale poorly as you ramp the cu's on pc, although i haven't personally followed that line of discussion too much...

the 14+4 thing is pretty well known, and you can take Cerny's "not entirely round" statement as more or less a public confirmation.

I'm not so sure what it really means in practice, I'd be a bit surprised if PS4 really starts suffering dramatically diminished returns when using more than 14 CU's for GFX. But it's possible.

Kicking off extra shaders to 50% more CU's shouldn't really burden the CPU too much imo, whilst there will be a difference its not going to be drastic that you can't use those CU's for graphics, you'd still need to kick off work for compute as well..

Cache and registers scale linearly with the number of CU's (atleast L1 does) and the other GCN chips contain many more CU's with the same amount of L2 cache so that shouldn't be a issue either.

I honestly doubt it starts to scale badly for just no reason, there are desktop cards that have nearly twice the CU's but no where near twice the bandwidth, there might be reasons for this that exist on the desktop market when you have to try and structure you're code and game for a large number of platforms but they would mostly disappear with a fixed platform.

The 14+4 is only ever been mentioned by vgleaks with 0 technical explanation and until someone can provide one I see no reason to believe it was anything but a suggestion/scenario of what you can do with it.
 
Maybe Aristotele said it best “The whole is greater than the sum of its parts.”

Assumptions coming up.

CPU has to feed the GPU and do compute jobs. Now if for some reason the GPU is going idle since the CPU is unable to feed the GPU and do its "own" compute jobs. Then reassigning a couple of CU might give you a better overall end product. Ie insted of 18 CU being idle 25% of the time, 15 are running at 100% with 3 offloading the CPU with GGPU work? In this scenario, the CPU is able to feed the GPU compute and graphics work and do its own compute work. Could be as simple as variation in FPS to be rock solid FPS, wether its 60 or 30 vs 60-45 and 30 - 20.

Then in my book the GPGU works is golden :D

This would work exactly the same for the X1, only difference they might not have 3 CU to retask, but maybe 2 since they have less CU in total or they might have more due to eSRAM.
Then again this is all based what workloads you need to process. Does anybody think that the the games coming out in the first 12 months will use 100% of the CPU/GPU?
 
Maybe Aristotele said it best “The whole is greater than the sum of its parts.”

Assumptions coming up.

CPU has to feed the GPU and do compute jobs. Now if for some reason the GPU is going idle since the CPU is unable to feed the GPU and do its "own" compute jobs. Then reassigning a couple of CU might give you a better overall end product. Ie insted of 18 CU being idle 25% of the time, 15 are running at 100% with 3 offloading the CPU with GGPU work? In this scenario, the CPU is able to feed the GPU compute and graphics work and do its own compute work. Could be as simple as variation in FPS to be rock solid FPS, wether its 60 or 30 vs 60-45 and 30 - 20.

Then in my book the GPGU works is golden :D

This would work exactly the same for the X1, only difference they might not have 3 CU to retask, but maybe 2 since they have less CU in total.
Then again this is all based what workloads you need to process. Does anybody think that the the games coming out in the first 12 months will use 100% of the CPU/GPU?

My understanding was that now days feeding jobs to the GPU, even more so with a 8 core CPU is a trivial task, this should not really be a limitation and either way, a lot of graphics work could easily run over every single CU in the GPU.
 
I suspect that in the end we will find that the overall "14+4CU" or "hardware balanced at 14cu" will depend on CPU.

Probably the underpowered CPU could act as a bottleneck for the system.
And for this reason, probably, the CPU & GPU realtion will be balanced at 14CU for rendering task.
It could depend on cpu clock speed? Onion & Onion+ low bandwith? More technical stuff?

Maybe this is also the reason why MS, with a similar CPU, choses a 12CU GPU.

I have also my personal theory:as I have said many time, I suspect that Sony want from the beginning to reserve 4CU in order to compensate the weak CPU (because, as it seems, PS4 lacks of some X1 dedicated hardware, that are intended to help the CPU).
 
My understanding was that now days feeding jobs to the GPU, even more so with a 8 core CPU is a trivial task, this should not really be a limitation and either way, a lot of graphics work could easily run over every single CU in the GPU.

Well it was an assumption, but if the CPU compute jobs is basically sleep for 29 ms it might not be enough time/resources left over to feed the GPU :D
Not that will happen, but if some new über technique shows up that hogs all CPU for way to long and can not be done with GGPU then we might in this situation, yes speculation but not wildly impossible scenarios.
 
...do you really think AMD would have been replicated the ACE/frontend just for 4CU only? No, it is clearly not possible, as it would mean that's impossible to rebalance work.

Do you really think that a GPU front-end will schedule compute tasks cherry picking the 4CU instead of just looking for available resources and push things there - especially where more worker items are needed to make CU works 100%?

The 14+4 is pure madness to me. You have your queues, your data flowing in it, and your shaders executed and scheduled with some dynamic priority.
 
Well it was an assumption, but if the CPU compute jobs is basically sleep for 29 ms it might not be enough time/resources left over to feed the GPU :D
Not that will happen, but if some new über technique shows up that hogs all CPU for way to long and can not be done with GGPU then we might in this situation, yes speculation but not wildly impossible scenarios.

I suspect that in the end we will find that the overall "14+4CU" or "hardware balanced at 14cu" will depend on CPU.

Probably the underpowered CPU could act as a bottleneck for the system.
And for this reason, probably, the CPU & GPU realtion will be balanced at 14CU for rendering task.
It could depend on cpu clock speed? Onion & Onion+ low bandwith? More technical stuff?

Maybe this is also the reason why MS, with a similar CPU, choses a 12CU GPU.

I have also my personal theory:as I have said many time, I suspect that Sony want from the beginning to reserve 4CU in order to compensate the weak CPU (because, as it seems, PS4 lacks of some X1 dedicated hardware, that are intended to help the CPU).

I think that you will both, as everyone else in this thread, and as all the developers working on both the next gen consoles that how you partition the system and how you use its resources is highly dependent on what you are trying to achieve and how you are trying to achieve there is no blanket case you state for all engines that says that the PS4 will not be able to use all 18 of the CU's for graphics.
 
I suspect that in the end we will find that the overall "14+4CU" or "hardware balanced at 14cu" will depend on CPU.

Probably the underpowered CPU could act as a bottleneck for the system.
And for this reason, probably, the CPU & GPU realtion will be balanced at 14CU for rendering task.
It could depend on cpu clock speed? Onion & Onion+ low bandwith? More technical stuff?

Maybe this is also the reason why MS, with a similar CPU, choses a 12CU GPU.

I have also my personal theory:as I have said many time, I suspect that Sony want from the beginning to reserve 4CU in order to compensate the weak CPU (because, as it seems, PS4 lacks of some X1 dedicated hardware, that are intended to help the CPU).

It all makes sense now. Let me put on my goggles. The XB1 is perfectly balanced (DF regurgitated that right?). The PS4 has extra CUs for compute to make up for its weak CPU and its lack of dedicated hardware the XB1 has. It is all coming together now, 14+4, Garlic, Onion -it is all a cryptic code.. we finally have the decoder. (Why didn't Sony just put in more CPU cores or dedicated DSPs, certainly they are cheaper and smaller than CUs?)

I have an insane theory why MS chose 12CUs, they don't have the memory bandwidth to feed more. They 'balanced' the system and shifted BOM resources to Kinect. Crazy I know.

:D
 
Status
Not open for further replies.
Back
Top