PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

Metal_Spirit · Jan 22, 2015

Shifty Geezer said:
There's an image from VGLeaks that shows clearly a second graphics command processor, but it's named as a system component for use by the OS.

I saw the diagram. And VGLeaks is, so far, a reputable source.
But there is no such thing as hardware reserved for the OS. If the limitation exists it is placed by software, and that can change!
Currently the Second Xbox GCP is also for OS only and there are plans to open it on 2016 (at least that was stated on that same Thread).

Regardless, the use of something like SPURS implemented on the PS4 hardware seems to adress the same kind of optimization.

Shifty Geezer · Jan 22, 2015

Metal_Spirit said:
I saw the diagram. And VGLeaks is, so far, a reputable source.

So why doubt the existence of it then? :???:

Metal_Spirit · Jan 22, 2015

Shifty Geezer said:
So why doubt the existence of it then?

Not doubting it. But since it is currently reserved for the OS, i wanted to talk about the SPURS like optimization that will be working on the other Ring, and how does it compare to the use of a second GCP.

Shifty Geezer · Jan 22, 2015

No-one's even determined what exactly the second GCP accomplishes, so we can't much debate it!

i wanted to talk about the SPURS like optimization that will be working on the other Ring

The 'SPURS type optimization' is asynchronous compute via the ACEs, not GCP. If you think the second GCP will result in better ALU utilisation, compute achieves the same thing.

iroboto · Jan 22, 2015

I wish I knew more about ps4 to really talk about it. The only thing that VGleaks diagram shows that we have still yet to confirm on the XBO side is that little bubble that says "arbitration". I assume if both GCPs are going on ps4 there is some arbitration of which GCP is given control over the resources.

bgroovy · Jan 22, 2015

The vgleaks documents imply to me the OS GCP always has priority but during games it is typically idle or using a negligible amount of resources for notifications, etc. There isn't even a credible theory as to why having access to two command processors would offer any meaningful advantage to games whereas the use for the OS is obvious so it's not really worth so much speculation.

iroboto · Jan 22, 2015

bgroovy said:
The vgleaks documents imply to me the OS GCP always has priority but during games it is typically idle or using a negligible amount of resources for notifications, etc. There isn't even a credible theory as to why having access to two command processors would offer any meaningful advantage to games whereas the use for the OS is obvious so it's not really worth so much speculation.

It's certainly a wait and see approach. We have one poster on the X1 thread claiming inside sources whispering of what it may become, which is fine, but it has not become or may not become, so at this moment the second GCP means nothing for performance.

Metal_Spirit · Jan 22, 2015

bgroovy said:
The vgleaks documents imply to me the OS GCP always has priority but during games it is typically idle or using a negligible amount of resources for notifications, etc. There isn't even a credible theory as to why having access to two command processors would offer any meaningful advantage to games whereas the use for the OS is obvious so it's not really worth so much speculation.

Well, if the GCP is there, and if there are future advantages for using it also on games, it can, and certainly will, be used!

But the Christophe Riccio explanation show very well the advantages on the use of multiple command processors:

I'll repeat them!

"There is a lot of room for tasks parallelism in a GPU but the idea of submitting draws from multiple threads in parallel simply doesn’t make any sense from the GPU architectures at this point. Everything will need to be serialized at some point and if applications don’t do it, the driver will have to do it. This is true until GPU architectures add support for multiple command processors which is not unrealistic in the future. For example, having multiple command processors would allow rendering shadows at the same time as filling G-Buffers or shading the previous frame. Having such drastically different tasks live on the GPU at the same time could make a better usage of the GPU as both tasks will probably have different hardware bottleneck."

Since Xbox One only has 2 ACE units and no confirmed or known arbitration, the second GCP can be a great help on optimizing the pipelines usage!

Unfortunately, i doubt very much Microsoft can do without the second GCP on the OS, so availability for games, if it happens, will be limited, and maybe that explains Sony decisions on choosing the 8 ACE units and adding arbitration given by the 'SPURS type optimization'.

Deleted member 11852 · Jan 22, 2015

iroboto said:
I wish I knew more about ps4 to really talk about it.

The only thing you need to know about PlayStation 4 is that Sony thought this was important enough to think up, design, have manufactured specially then placed inside where most people will never see it.

You can't rationalise with people who think like this. Don't even try to understand the PS4 :nope:

3dilettante · Jan 22, 2015

Metal_Spirit said:
Since Xbox One only has 2 ACE units and no confirmed or known arbitration, the second GCP can be a great help on optimizing the pipelines usage!

ACEs by default are capable of arbitration and synchronization. It's a big part of what they were created for, and AMD has been harping on that compute train for years.

I'd question more how well the graphics pipeline can handle this, since it has to carry the domain-specific and fixed-function portions, and many of those design elements have changed very slowly and have exhibited some of the poorest behavior when it comes to modernizing the hardware, like virtualization and QoS.

Shifty Geezer · Jan 22, 2015

Metal_Spirit said:
...explains Sony decisions on choosing the 8 ACE units and adding arbitration given by the 'SPURS type optimization'.

To be clear, this 'SPURS type opimisation' is what GCN does anyway via ACEs. Sony just felt more and smaller jobs would (possibly) be the way things go so put in more ACEs, but it's nothing particularly SPURS or Sony AFAICS. As Cerny even admits, it's more a case of giving devs options and seeing where they go. 8 may be ridiculous overkill and AMD's analysis and choice of 2 as standard could be all devs need and use.

3dilettante · Jan 22, 2015

If eight is that much of an overkill situation, it was for whatever reason shared with Hawaii and Kaveri. Hawaii has more GPU resources to allocate work to, but that's not a situation Kaveri has relative to Orbis.

There is some oddness when it comes to a high-level overview of Durango's ACEs, which have the enhanced number of queues per hardware pipeline, but it diverges from how CI ACEs are in groups of four.

Rurouni · Jan 23, 2015

Is 1 Kaveri ACE can manage 8 queues like PS4/X1 or just 1? It got 8 ACE, but I can't find any information about how many queues the Kaveri ACE can manage.

3dilettante · Jan 23, 2015

Rurouni said:
Is 1 Kaveri ACE can manage 8 queues like PS4/X1 or just 1? It got 8 ACE, but I can't find any information about how many queues the Kaveri ACE can manage.

Various websites cited 8 ACEs and some further described 8 queues per ACE. The following shows how the queues are distributed for enumeration:

http://cgit.freedesktop.org/~airlied/linux/tree/drivers/gpu/drm/radeon/cik.c?h=drm-next

/*
* KV: 2 MEC, 4 Pipes/MEC, 8 Queues/Pipe - 64 Queues total
* CI/KB: 1 MEC, 4 Pipes/MEC, 8 Queues/Pipe - 32 Queues total

Rurouni · Jan 23, 2015

The more reason this compute thing must be used more often so the iGPU on my Kaveri doesn't become a dead weight when I eventually buy a dGPU.
So the question would be if the ACE on PS4 is a bit too much (vs X1) or too little perhaps (vs Kaveri).

3dilettante · Jan 23, 2015

GPU-wise, Kaveri is in many ways inferior to both consoles. Kaveri has a similar compute front end loadout coordinating a very inferior amount of compute and vastly poorer bandwidth. As to whether or not it would become dead weight, Kaveri is not compliant with the release specification of HSA, which may put a damper on any long-term future if HSA is taken seriously.
Carrizo fixes certain shortcomings in Kaveri (possibly in the consoles, with some kludgy hacks that might fix things partially) that have received some harsh criticism in discussions about Kaveri's Linux kernel driver.

Kaveri has a number of strong similarities to Orbis, and given the rumored abandoned early version of a Steamroller-based PS4 APU, perhaps AMD decided to make some temporary use of the abandoned direction.

chris1515 · Jan 30, 2015

Shifty Geezer said:
To be clear, this 'SPURS type opimisation' is what GCN does anyway via ACEs. Sony just felt more and smaller jobs would (possibly) be the way things go so put in more ACEs, but it's nothing particularly SPURS or Sony AFAICS. As Cerny even admits, it's more a case of giving devs options and seeing where they go. 8 may be ridiculous overkill and AMD's analysis and choice of 2 as standard could be all devs need and use.

I don't think AMD and Sony are stupid to use 8 ACE without any chance to use them (7 for games and 1 for OS). If it is not useful why not use only 2 ACE and improve other part of the GPU. Async compute is useful when the graphic pipeline doesn't use all ALU or is stall because the graphic pipeline is synchronous and some task wait another task completion.

My understanding reading some devs like sebbbi on B3D or graphical slides from GDC or other conference is that the limit of async compute task is that it can trash the L2 cache of graphical task. When it is needed to flush the L2 cache of graphical task by using async compute... With 'volatile bit' it helps if I understand well what Cerny said in Gamasutra interview:

"Next, to support the case where you want to use the GPU L2 cache simultaneously for both graphics processing and asynchronous compute, we have added a bit in the tags of the cache lines, we call it the 'volatile' bit. You can then selectively mark all accesses by compute as 'volatile,' and when it's time for compute to read from system memory, it can invalidate, selectively, the lines it uses in the L2. When it comes time to write back the results, it can write back selectively the lines that it uses. This innovation allows compute to use the GPU L2 cache and perform the required operations without significantly impacting the graphics operations going on at the same time -- in other words, it radically reduces the overhead of running compute and graphics together on the GPU."

Maybe without all the overhead they can schedule more compute task and this the reason of the 8 ACE choice or maybe I didn't understand anything

Other things I understand is that compute shading task will replace some vertex and pixel task for graphics in the year to comes because it is more efficient(60 to 70% of graphics will use compute). And compute can be useful for non graphical task if they are parralellisable and working on big dataset at least 64 data at a time if I remember well the Ubi soft presentation about compute task...

iroboto · Jan 31, 2015

chris1515 said:
I don't think AMD and Sony are stupid to use 8 ACE without any chance to use them (7 for games and 1 for OS). If it is not useful why not use only 2 ACE and improve other part of the GPU. Async compute is useful when the graphic pipeline don't use all ALU or is stall because the graphic pipeline is synchronous and some task wait another task completion.

My understanding reading some devs like sebbbi on B3D or graphical slides from GDC or other conference is that the limit of async compute task is that it can trash the L2 cache of graphical task. When it is needed to flush the L2 cache of graphical task by using async compute... With 'volatile bit' it helps if I understand well what Cerny said in Gamasutra interview:

Maybe without all the overhead they can schedule more compute task and this the reason of the 8 ACE choice or maybe I didn't understand anything

Other things I understand is that compute shading task will replace some vertex and pixel task for graphics in the year to comes because it can be more efficient to do some part of graphic with compute(60 to 70% of graphical task will use compute). And GPGPU can be useful for non graphic task if they are parralellisable and working on big dataset at least 64 data at a time if I remember well the Ubi soft presentation about compute task...

I agree with Shifty. You can find 8 ACE on newer GPUs that have a lot more horsepower, but I don't think there is enough available resources for all 8 ACEs to be working on heavy compute jobs for PS4. As Shifty suggests, the idea that there could be many small compute jobs would lend well to having a lot of ACE queues.

On the other end of the debate Microsoft felt 2 ACE was sufficient for their setup and mention in their interview that having too many concurrent threads could at time cause issues.

As for why add more ACE and not focus the rest on the GPU, well if you can spot an ACE on a chip diagram I'd be surprised. I think from a space, a thermal, and a wattage perspective they put in what they could for the PS4. Having additional ACEs was about getting more in, not about leaving things out.

chris1515 · Jan 31, 2015

iroboto said:
I agree with Shifty. You can find 8 ACE on newer GPUs that have a lot more horsepower, but I don't think there is enough available resources for all 8 ACEs to be working on heavy compute jobs for PS4. As Shifty suggests, the idea that there could be many small compute jobs would lend well to having a lot of ACE queues.

On the other end of the debate Microsoft felt 2 ACE was sufficient for their setup and mention in their interview that having too many concurrent threads could at time cause issues.

As for why add more ACE and not focus the rest on the GPU, well if you can spot an ACE on a chip diagram I'd be surprised. I think from a space, a thermal, and a wattage perspective they put in what they could for the PS4. Having additional ACEs was about getting more in, not about leaving things out.

I think the issue Microsoft rep speak about is thrashing the cache and it is less an issue with kaveri and PS4 architecture.

Kaveri have 8 ACE and it is less powerful than PS4 GPU.

chris1515 · Jan 31, 2015

http://dl.acm.org/citation.cfm?id=626749

Parallel processing systems with cache or local memory in the memory hierarchies are considered. These systems have a local cache memory in each processor and usually employ a write-invalidate protocol for the cache coherence. In such systems, a problem called 'cache or local memory thrashing' can arise in executions of parallel programs, when the data unnecessarily moves back and forth between the caches or local memories in different processors

An explanation about cache trashing. It is not only a problem for GPU with Async compute but for CPU threads sharing same cache memory like hyperthreading.

With 'volatile' bits for compute task the read/write invalidate only affects compute data marked in the cache not the graphical data.

PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

Metal_Spirit

Shifty Geezer

uber-Troll!

Metal_Spirit

Shifty Geezer

uber-Troll!

iroboto

Daft Funk

bgroovy

iroboto

Daft Funk

Metal_Spirit

Deleted member 11852

Guest

3dilettante

Shifty Geezer

uber-Troll!

3dilettante

Rurouni

3dilettante

Rurouni

3dilettante

chris1515

iroboto

Daft Funk

chris1515

chris1515

Similar threads