The capabilities of the 4 special CUs in Orbis

So people were supposed to believe every rumor out of VGLeaks (MOST of which have proven true) except the one that implies the PS4 GPU might be compromised (or optimized) in some way?

"Buffing" it doesn't cause issues but increased costs, but "nerfing" it does. You're compromising a well-thought design.
Why on earth would AMD or Sony want to nerf the 4 CUs? I don't see a logical explanation for doing that.

Everything else out of VGleaks at least made sense in some way. Making these 4 CUs sub-par doesn't.
 
"Buffing" it doesn't cause issues but increased costs, but "nerfing" it does. You're compromising a well-thought design.
Why on earth would AMD or Sony want to nerf the 4 CUs? I don't see a logical explanation for doing that.

Everything else out of VGleaks at least made sense in some way. Making these 4 CUs sub-par doesn't.

I think the rumor mentioned that they enhanced the 4 CUs somehow (extra scalar ALU). But whatever, if you can schedule all the CUs any way you want, then there is no real need to skew the CUs for any particular app. The PS4 OS doesn't really need to hog the GPU for any always-on tasks anyway. The dedicated decoders and secondary ARM chip seem sufficient.
 
It's all good. The challenge is to understand how things fit together. Even if the hardware is flexible, sometimes it is obscured by software layers for assorted reasons. The key thing here is how does libGCM work ? Is it as flexible as Cyan mentioned ? If so, then the 18 unified CU setup is "perfect". There is no need to structure the CUs artificially.
You don't even need to schedule this by yourself (and shouldn't do it). Just throw graphics and compute tasks at the array at the same time (these are handled by different queues in hardware) and the hardware will sort out the work distribution to the CUs on its own (and is quite flexible in that, GCN can simultaneously run different tasks on the same CU). One can help in prioritizing important tasks though (GCN supports giving priorities). Only if one has really good reasons one could think about interfering with the mechanisms the hardware provides, i.e. reserving a few CUs for a special task, as this decreases the flexibility and likely the overall performance.
 
The ACEs will monitor for when CUs and other resources are no longer needed. Once a kernel completes, the ACE can evaluate what queued tasks can fit in the freed up pool and work through things based on priority and other factors like task age.

There's no elegant way to handle the possibility that the CUs are running long-lived kernels without preemption. This is why I speculated on the OS or runtime calling dibs on a certain number of wavefronts and local store ahead of time with some kind of placeholder shader or runtime trickery.

The game wouldn't really know unless it purposefully ran FP loops and saw the throughput numbers didn't match 18 CUs.
 
The ACEs will monitor for when CUs and other resources are no longer needed. Once a kernel completes, the ACE can evaluate what queued tasks can fit in the freed up pool and work through things based on priority and other factors like task age.

There's no elegant way to handle the possibility that the CUs are running long-lived kernels without preemption. This is why I speculated on the OS or runtime calling dibs on a certain number of wavefronts and local store ahead of time with some kind of placeholder shader or runtime trickery.

The game wouldn't really know unless it purposefully ran FP loops and saw the throughput numbers didn't match 18 CUs.
Actually, if one could instruct the command processor to schedule just up to a maximum of let's say 36 Wavefronts per CU instead of 40 and reserve a few kB register file and local storage space, one can always squeeze a few additional wavefronts to a CU if the need arises (then likely given a high priority so it gets always scheduled if the operands are ready). Other than that the latency hiding capabilities of the CUs are slightly diminished, the throughput seen by the game wouldn't change at all. So one doesn't need to preempt long running wavefronts on a CU, if one can still schedule another one which then takes the fast lane because of its high priority.
 
Actually, if one could instruct the command processor to schedule just up to a maximum of let's say 36 Wavefronts per CU instead of 40 and reserve a few kB register file and local storage space, one can always squeeze a few additional wavefronts to a CU if the need arises (then likely given a high priority so it gets always scheduled if the operands are ready). Other than that the latency hiding capabilities of the CUs are slightly diminished, the throughput seen by the game wouldn't change at all. So one doesn't need to preempt long running wavefronts on a CU, if one can still schedule another one which then takes the fast lane because of its high priority.

If that facility is present, that could be a way. I haven't run across that particular capability, but I haven't analyzed the ISA docs to that depth and Orbis may have customized that part.

My idea was more conservative in that a larger amount of storage on fewer CUs was reserved, and that would be more noticeable as occupancy restrictions are more likely to kick in.
The amount would be larger to provide some certainty that system functions have a generous supply just in case.

The SI document mentioned the capability for the GPU to monitor for when a memory address is accessed. A constantly waiting placeholder shader could be used to bypass queueing delays as well. Write a pointer and relevant arguments to a monitored address for that shader, and it can branch and run even if the queues are backed up.
 
There's no elegant way to handle the possibility that the CUs are running long-lived kernels without preemption. This is why I speculated on the OS or runtime calling dibs on a certain number of wavefronts and local store ahead of time with some kind of placeholder shader or runtime trickery.

Aye, software reservation ? ^_^

I can't think of a task that will require Sony to reserve 1-4 CUs at a sustained fashion at the moment. Perhaps when the OS steal resources for Gaikai LAN streaming while my wife is watching video on the PS4 ? In this case, the video decoder should be more involved. And we can bear a little setup time for Gaikai game streaming.

Unless Orbis could run 2 or more games at the same time.
 
It's not a matter of reserving the whole CU, just enough to capture any worst-case storage or register needs for whatever functions Sony wants to run through them. This can be sufficient to cause the CUs to be off-limits to kernels that require significant local data share or register allocations, and instruction issue would be reduced as well during the times the reserved functions are running.

Resource allocations can't resize, so there's no stealing of resources. Either there are resources available for a kernel's full register and storage needs, or it can't be allocated until some other kernel completes at some point in the future.
 
Last edited by a moderator:
ic. If it's not too often, they can also send an OS notification event to the app/game and have the app make an OS call to do what's necessary at their most convenient time.

But yes, perhaps the OS will have some "protocol" to ensure it can use/hijack the GPU where needed.

... assuming the developers can use all the CUs any way they want.
 
Back
Top