Does PS4 have excess graphics power intended for compute? *spawn*

Being CPU limited is different area of analysis than whether fixed-function graphics loads experience diminishing returns at 14 CUs.
Compute kernels have their own dependence on inputs from the CPU, so the "extra" CUs would still experience problems if the CPU is limiting the GPU in a general fashion.

The CUs are part of the GPU, and shuffling allocations of a throttled whole doesn't change things much.
It is also not that difficult to flip things around by upping some graphical feature that significantly expands the burden on the GPU or a specific portion of it with little effect on the CPU.

Can you give us a few ideas of which fixed function elements of the GPU might be holding back the CU effectiveness? And whether or not they scale up in bigger GPU's or would pose an even greater limitation in those cases?
 
Can you give us a few ideas of which fixed function elements of the GPU might be holding back the CU effectiveness? And whether or not they scale up in bigger GPU's or would pose an even greater limitation in those cases?

The fixed-function portions are decoupled from the CUs, and usually go through some sort of specialized bus or memory to interface with them.
From the CU array's point of view, they are just another client or a resource to negotiate with, and how much they can do with that finite capacity is a limit. The low-level details of how much they can send/receive is not discussed much.

One known value is the number of ACEs versus command processors. Each front end can negotiate for one new wavefront a cycle, so at least in that regard there's a difference in raw numbers. If there weren't prioritization the compute side could negotiate at a higher rate than the command processor. The command processor could be a limit, although it doesn't seem to be that much of one in practice ahead of all the other bottlenecks.

The ROPs can still be a limit, hence why there are preferred buffer formats or tiling considerations. There's an export bus from the CUs to the ROPs, so there is contention there and in the buffering stages. A wavefront will stall at the export phase if the ROPs are committed or the bus is being used.

The global data share is heavily relied upon for wavefront launch, and conflicts there were profiled as one reason why the graphics portion couldn't always utilize the ALUs in earlier architectures.
The caveat there is that this was done for pre-GCN GPUs and I've not seen a similar article since.
The counter-caveat is that AMD hasn't done that much to change that portion of the GPU or its glass jaws, so it's quite possible similar problems persist.
However, in that case, the GDS is used by the compute front ends as well for wavefront launch, so problems may hit both sides of the divide.

Primitive setup and rasterizer output is a limit, and AMD has not done as much as the competition to improve it for generations.
Tessellation or geometry handling seems to be a sore point. Rendered triangle output during tessellation for AMD GPUs going all the way up to Hawaii is a pretty serious point of non-scaling.
It's probably not a coincidence that Cerny noted as an example of the PS4's architectural features the use of compute shaders to do a better job at culling geometry.

There may be miscellaneous things like constant updates and other fixed-function requirements. Units there may have requirements for what memory they can and cannot address, and the command processor and other such units tend to sit on a common and lesser link to the L2 or can't hit the cache at all.
The ROPs don't touch the L2, but they have direct links to the memory controllers and their own caches.
 
Cheers, tons of great material for this thread in there. It'll probably take me another 5 or 6 re-reads before I vaguely understand it all though ;)
 
Aren't cache sizes and register file sizes (presumably we're talking about memory that isn't per CU here) the same on both the 18 CU PS4 and the 44 CU 290X?
No.

CU's don't work that way. You can't partition some off to do some work. You just send work to the GPU and it spreads the workload across available CUs.
In general you should let the GPU do its thing, but GPUs are quite programmable. ;)
 
Maybe because you chimed in to the 14+4 after the initial "14+4 confirmed by Japanese third party".
The fact that you appear to argue that there's something material in the 14+4 beyond the fact that it's just a recommendation/example of how devs may balance their workload also didn't help.

Anyway, enough of digging up history...
We should all be clear by now that the correct balance will certainly differ from project to project so picking anything wacky like 6+12, 8+10 10+8, 12+6, 14+14, 16+2, 17+1 are all good balances if that's exactly what the project requires.

There is nothing in that post about hardware separated CUs or different CUs at all. I have explained my point of view in many posts before this one. The entire discussion started because that specific developer talked about 14+4 set-up which was Sony suggested set-up for using PS4 GPU (some users were not agree with me even in saying that 14+4 is a suggested set-up). I started the discussion to see other people opinions on this matter (actually I hoped that they will participate in this discussion and fortunately they did) which leads to this topic.

-I never tried to say that PS4 CUs are different from each other.
-I never tried to say that PS4 CUs are hardware separated on the chip for different purpose.
-I never tried to say that 14+4 suggested set-up is a forced set-up for developers.

But I tried to say that there should be some reasons that Sony suggested (as I said some users weren't/aren't agree with me on this matter) such a set-up to developers for using their hardware, however every developer can use the ALUs as they want (graphic rendering, compute or a mix of them).

I'm not saying that my point of view was entirely correct but I never said what you deem. Maybe my poor English is the reason for this confusion, I don't know.
 
It is also not that difficult to flip things around by upping some graphical feature that significantly expands the burden on the GPU or a specific portion of it with little effect on the CPU.
Thanks a lot. This seems logical.

The fixed-function portions are decoupled from the CUs, and usually go through some sort of specialized bus or memory to interface with them.

...
So command processors, ROBs, Rasterizers, GDS, and of course bandwidth. But this applies to all GPUs in general, I remember how often we argued in this forum about how flag ship GPUs tend to give lower fps than expected despite doubling of hardware resources and ALUs in particular. In effect we were arguing the law of diminishing returns, probably Amdahl's law as well.

So the catch here is that -at least on consoles-, developers has to optimize their game code to balance load on fixed/general function hardware, and not prioritize one over the other. I guess this should take precedence over any compute/graphics code divide, no?
 
There is nothing in that post about hardware separated CUs or different CUs at all. I have explained my point of view in many posts before this one. The entire discussion started because that specific developer talked about 14+4 set-up which was Sony suggested set-up for using PS4 GPU (some users were not agree with me even in saying that 14+4 is a suggested set-up). I started the discussion to see other people opinions on this matter (actually I hoped that they will participate in this discussion and fortunately they did) which leads to this topic.

-I never tried to say that PS4 CUs are different from each other.
-I never tried to say that PS4 CUs are hardware separated on the chip for different purpose.
-I never tried to say that 14+4 suggested set-up is a forced set-up for developers.

But I tried to say that there should be some reasons that Sony suggested (as I said some users weren't/aren't agree with me on this matter) such a set-up to developers for using their hardware, however every developer can use the ALUs as they want (graphic rendering, compute or a mix of them).

I'm not saying that my point of view was entirely correct but I never said what you deem. Maybe my poor English is the reason for this confusion, I don't know.

OK, point taken.

I think the 14+4 is just a guess of how they might "balance" the workload in a hypothetical project that they think may be mainstream in a few years down the line, and thinking too hard into it isn't going to be too helpful.
 
So command processors, ROBs, Rasterizers, GDS, and of course bandwidth. But this applies to all GPUs in general, I remember how often we argued in this forum about how flag ship GPUs tend to give lower fps than expected despite doubling of hardware resources and ALUs in particular. In effect we were arguing the law of diminishing returns, probably Amdahl's law as well.
The most general formulation of Amdahl's Law states that runtime becomes dominated by the portion of the system that is not improved, and the GPU's graphics front end and geometry handling is something some GPU architectures have been slower to improve.

So the catch here is that -at least on consoles-, developers has to optimize their game code to balance load on fixed/general function hardware, and not prioritize one over the other. I guess this should take precedence over any compute/graphics code divide, no?
Also, the console makers can't rely on sometimes unreasonably high resolutions to hide the parts of the pipeline they cannot scale.
AMD GPUs across generations have relied on settings that push the balance on things like bandwidth, memory capacity, or resolution-linked facets like pixel shader throughput that raise the amount of pixel work done per primitive or draw call submission.
More reasonable resolutions or tessellation tended to bring the supposedly weaker GPUs with better drivers and geometry front ends to the fore.
I'm assuming most of the driver-related bottlenecking is of massively reduced significance for the PS4's extrapolated future workloads, but the other parts remain.

This may be the root of some of Sony's projections.
With an assumed 720-1080p resolution range and projected ALU load per pixel, Sony could be betting that something in the setup pipeline or front end will leave some fraction of the CU complement underutilized.
This would be an educated guess as to an overall trend (6-8 years is a long time, and the design decision-making started several years prior to that), one which could turn out to be wrong after some time, and one for which outliers could exist in either direction.
 
In general you should let the GPU do its thing, but GPUs are quite programmable. ;)

D'oh! in the "Radeon Southern Islands Acceleration" there's no mention of anything that allows you to filter out the 'CU' mask in the PM4s.

Nice to know that it is at least theoretically possible..
 
So, if it's possible for devs to partition CUs for dedicated jobs, would that actually be beneficial. What would the impact be on the memory subsystem etc? Could it be a net win or just introduce overhead over letting the GPU handle the workloads?
 
So, if it's possible for devs to partition CUs for dedicated jobs, would that actually be beneficial.

I don't think so. What would be beneficial is having prioritized (as well as not-prioritized) compute job queues over 3d jobs, so that devs can put i.e. SMALL audio tasks in prioritized queue and standard, and not-latency-sensitive jobs in the default one.
That would elegantly solve the audio issue pointed by devs without worsening the scheduler's job (imho).
 
So, if it's possible for devs to partition CUs for dedicated jobs, would that actually be beneficial. What would the impact be on the memory subsystem etc? Could it be a net win or just introduce overhead over letting the GPU handle the workloads?

Aside from the L1 caches of the CUs in question, the rest of the memory subsystem would might see some changes in miss rates, but that might depend more on the locality of the memory behavior for the wavefronts in question. The GPU's scheduling would already mix things up quite a bit, and the small L2 is a common resource that isn't going to know much of the difference.
It could go either way. Reserved CUs that are quiescent or highly local in behavior would thrash the L2 less than if the GPU is switching things in and out, while thrash-happy loads running on the reserved CUs will interfere with other workloads on a consistent basis.
Since the general suggestion is that a minority of CUs be doing this, it would be a minor effect.

The reserved CUs' L1s could see a drop in miss rates, if there are highly local long-lived kernels allowed to set up residence there instead of being switched out regularly.

Reserving CUs would generally cause a loss of peak utilization, as the GPU at large would not have those CUs readily available for wavefronts belonging to kernels that aren't in the reserved workload.



I don't think so. What would be beneficial is having prioritized (as well as not-prioritized) compute job queues over 3d jobs, so that devs can put i.e. SMALL audio tasks in prioritized queue and standard, and not-latency-sensitive jobs in the default one.
That would elegantly solve the audio issue pointed by devs without worsening the scheduler's job (imho).

It still requires that the resources be available. The highest-priority commands in the queues can't make already occupied CUs drop what they're doing, and CUs can run for however long they want.
There seems to be some limited context-switching mentioned in various spots, but it looks like it could limited to when individual wavefronts in a multi-wavefront kernel complete, allowing the kernel to switch out when the active context data is gone from the CU array.
 
It still requires that the resources be available. The highest-priority commands in the queues can't make already occupied CUs drop what they're doing, and CUs can run for however long they want.

...Just avoid writing huge and lengthy shaders, moving to a more coarse grained flavour. It will favour the scheduler that will be able to schedule more things in flight and make a better usage of queue priorities.
Controlling the CUs with a mask might make sense if you can limit them to a single DCT, but then you would have the occupancy problem of underutilized shaders, I think.
 
Also, the console makers can't rely on sometimes unreasonably high resolutions to hide the parts of the pipeline they cannot scale.
Some would argue the PS4 is doing just that, often times it scales up resolution from 900p to 1080p without performance loss, but that could also be a developer doing a cautious estimation of the hardware resources.

AMD GPUs across generations have relied on settings that push the balance on things like bandwidth, memory capacity, or resolution-linked facets like pixel shader throughput that raise the amount of pixel work done per primitive or draw call submission.
Yeah I remember the 8XAA and early 1440p benchmarks debate at the time when 1080p was just starting to become mainstream.

one which could turn out to be wrong after some time, and one for which outliers could exist in either direction.
Guess time will tell then, as always I can't thank you enough.
 
Back
Top