Does PS4 have excess graphics power intended for compute? *spawn*

Compute is as dependent on these internal resources as graphics, so if the internals are optimised for 14 CUs, the other 4 would be equally restricted whether working on graphics or compute (and they can't be isolated for those workloads either, so you'd have an 18 CU part incapable of running full tilt).

Is there any reason to think these aren't the same in PS4? AMD would have designed the chip to actually be functional, providing the necessary cache BWs etc. that the CUs need to operate. If they didn't, the CUs would be equally as weak for compute as they are for graphics. In terms of internal hardware 'balance', PS4's GPU is designed for 18 CUs (until we hear differently, which would be shocking as it means AMD provided a gimped part, or Sony messed the design up and gimped it themselves).
 
Compute is as dependent on these internal resources as graphics, so if the internals are optimised for 14 CUs, the other 4 would be equally restricted whether working on graphics or compute (and they can't be isolated for those workloads either, so you'd have an 18 CU part incapable of running full tilt).

Is there any reason to think these aren't the same in PS4? AMD would have designed the chip to actually be functional, providing the necessary cache BWs etc. that the CUs need to operate. If they didn't, the CUs would be equally as weak for compute as they are for graphics. In terms of internal hardware 'balance', PS4's GPU is designed for 18 CUs (until we hear differently, which would be shocking as it means AMD provided a gimped part, or Sony messed the design up and gimped it themselves).

That was an example to say what I was thinking/talking in my last few posts. The question is if Sony designed PS4 like other AMD GPUs in mind why they said GPU is balanced for 14 CUs (assuming that slide being official, it's good for me to know your opinion about it's credibility)? There was no need to say something like that on a meeting for developers or internal staff. Otherwise there is no need to do more investigation on this matter.
 
I doubt there's anything like that going on, at least not to a big degree. You can't design a modern console as 30 or 60 fps*, as what appears on screen is down entirely to software.

Of course you can. Accepting that developers are at liberty to balance graphical fidelity with frame rate, graphically games are built from of hundreds/thousands/millions of smaller primitives (pixels, vertices, triangles, shaders etc) and the way they combine into a finished scene isn't some unknown quantity - there are thousands of games which you can profile. Those primitives have a CPU management/setup, memory overhead, GPU computation cost and vram bandwidth cost. Games will differ how they mix the balance of use of these but you can extroploate low-highs of model complexities, overall polygons drawn, amount of shaders produced compared to the resolution and frame rates.

Look at last gen. Was PS3 a 720p60 console? Or a 720p 30? Or a 1080p30? Or 1080p60 ish? Because there are games of all those types on there.

Don't get hung up on my use of '60fps 1080' to describe a hardware target, I used this only to simplify what was already becoming a long post. How you chose to express a performance target could be expressed in any number of ways, i.e objectively V polygons comprising W textures of X size in kilobytes applying Y shaders utilising Z cycles, or subjectively, i.e. Crysis at 1080p at 60fps.

I don't know if there is a standardised objective measure of graphical performance.

AMD offer a decently balanced part based on their GPU options with some feedback from Sony about what tweaks they'd like (32 ROPs being one obvious decision. We have Sebbbi telling us that's basically overkill and compute is more important going forwards, so that seems imbalanced).

This is what sebbbi said:
sebbbi said:
ROPs don't matter that much anymore in modern graphics rendering. On any modern GPU, you should use compute shaders for anything else than triangle rasterization. Compute shaders write directly to memory (instead of using ROPs). I would estimate that a future engine would use at least 2/3 of the GPU time running compute shaders (for graphics rendering).

So apart from raw triangles, compute shaders ability are taking load from traditional ROPs hardware and future engines may lighten that load further. That's one view. Sony's view, as per the Cerny quote above, seems contrary to that. That more non-graphics task will be using more Compute time. If this pans out, will that leave more or less reliance on ROPs hardware? Compute isn't completely negating the need for many ROPs units as the R9 290X comes with 64. What about tessellation? You're flooding the system with triangles. ROPs.

If Sony made their choice based on what devs were trying to accomplish on DX9 hardware and what they may be trying to target in future, they'd have gone about the task poorly.

There's been no paradigm shift in graphics technologies that invalidates profiling PlayStation 3 games. What developers are wanting to do on a platform without DirectX11 getting in the way is as, or more, important that what game devs are fighting with in PC space. Many PlayStation 4 developers are going to be approaching the hardware like a supercharged PlayStation 3 and there's nothing wrong with that. A supercharged PlayStation 3 running The Last of Us with more polygons, more shaders at a higher resolution, could have put many PC games to shame. It's not like the jump from PlayStation 2 to PlayStation 3 where many new graphics techniques, like pixel shaders, were introduced.

That's not to say you should ignore technological trends occurring in PC space either - but that's what AMD bring to the table.
 
That was an example to say what I was thinking/talking in my last few posts. The question is if Sony designed PS4 like other AMD GPUs in mind why they said GPU is balanced for 14 CUs (assuming that slide being official, it's good for me to know your opinion about it's credibility)? There was no need to say something like that on a meeting for developers or internal staff. Otherwise there is no need to do more investigation on this matter.

My guess was that they initially thought that 14 CUs would be enough for what they were initially aiming for (thus their balance), and then they tacked on another 4, jumping from the 7790 realm to the 7850 realm just so they're sure they wouldn't be underestimating what the load would be like in a few years when GPGPU might be much more relevant.
 
If you believe the original leaks, the PS4 supposed to have 192GB/sec bus (using 6.0 Gbps GDDR5 instead of the 5.5 it's now)? http://www.vgleaks.com/playstation-4-architecture-evolution-over-time/

That would imply to me either that the original BW was overkill so they lowered it or they couldn't use the higher BW GDDR chips for what ever reason and they had to lower the system BW. With assumption one, everything is still balanced (graphics compute to bandwidth). With assumption 2, there's excess compute relative to bandwidth that perhaps allows for the CPU to use the coherent buses to better use the under-utilized compute units.
 
If you believe the original leaks, the PS4 supposed to have 192GB/sec bus (using 6.0 Gbps GDDR5 instead of the 5.5 it's now)? http://www.vgleaks.com/playstation-4-architecture-evolution-over-time/

That would imply to me either that the original BW was overkill so they lowered it or they couldn't use the higher BW GDDR chips for what ever reason and they had to lower the system BW. With assumption one, everything is still balanced (graphics compute to bandwidth). With assumption 2, there's excess compute relative to bandwidth that perhaps allows for the CPU to use the coherent buses to better use the under-utilized compute units.

I don't think balance or so was the only reason. There is also power consumption, price and heat to be dissipated by the cooling system to be put into the equation.

The PS4 in fact already consumes 150W with Killzone SF. If they could have given more bandwidth "freely", they would have done but more here is not free and GDDR5 at 5.5 Gbps is certainly less expensive and less power hungry than at 6Gbps.
 
Compute isn't completely negating the need for many ROPs units as the R9 290X comes with 64. What about tessellation? You're flooding the system with triangles. ROPs.
A new PC GPU needs to run the current games as fast as possible, because all the reviewers will use the currently available games as benchmarks. No PC GPU sells because it might be good fit for future. It will not sell if it gets trounced in current game benchmarks by the competition. When 290X was released, high end PC gaming was mostly about playing last generation (Xbox 360 and PS3) ports with high frame rate (60 fps+) and high resolution 1080p / 1440p / 1600p with some extra PC specific effects. Last generation games have simple shaders, and simple shaders benefit from massive amount of ROPs (because simple shaders are not ALU bound). The extra PC specific effects are usually post processing. These extra effects don't gain performance from extra ROPs, but that doesn't matter much since majority of the frame gets faster.

Tessellation doesn't use ROPs, it mainly stresses the fixed function primitive units and vertex/domain/hull shaders (= ALU and BW). GCN 1.1 does 2 primitives per clock (290X does four). That's the main limitation of tessellation. NVIDIA cards are better in tessellation than AMD cards, because NVIDIA cards can push more primitives per clock (because of distributed geometry engines).

Tessellating to tiny small triangles is not smart, because it decreases the pixel shader quad efficiency, and that means that the shader needs to be running more times than necessarily. However this increase of pixel shader cost increases all the pixel based costs equally (not just the ROP cost), so the shader doesn't get any more ROP bound.

ROPs are also good for solving many things easier than compute shader based solutions. I haven't yet seen any studio using compute based particle gathering (single pass, super BW effective) instead of filling hundreds of alpha planes on top of each other with ROPs. But things will change in the future. We will see solutions like this for problems that are currently brute forced with ROPs.
 
This topic is not going to gain any traction or serve any purpose until those asking questions can really clarify them. I'm thinking that the real question here is, "what did Cerny mean by 'balance'?" and it's that single talking point that's spawned this rather odd view of PS4's graphics processor.

Cerny went so far as to say the 14+4 slide was not part of their evangelization, which may mean that one slide was not meant to be universally applicable or that the slide could have miss-stated some kind of design rule of thumb.

One conservative interpretation is that the CU count is sized such that the PS4 was projected to have a high probability of having sufficiently large numbers of wavefront issue cycles free when running hypothetical present and future workloads in a target time frame.

There's a bunch of caveats:
1) It's heavily dependent on statistical analysis of workloads and their extrapolations to their future needs.
Sony could have projected a set of representative use cases for the fixed-function portions of the graphics pipeline, saw what number of CUs it could probably get away with without impacting their performance too much, and then opted to add a few more.

The GPU hardware is very good at allocating CUs, and it's not a heroic feat to burden all 18 CUs with graphics work. The GDC slide alone shows major parts of a frame where all 18 get used.
However, at least some of the time and with modest effort, some cycle time can be freed up.
Since this is forward-looking, only time will tell how accurate their predictions will be.

2) The GPU's granularity for injecting compute kernels into the overall workload is still somewhat coarse. Other analyses of AMD's architecture and the little slivers of unallocated compute in the GDC slide point to this. In such case, it helps to have extra margin so that there is enough contiguous spare capacity, instead of it being fragmented so heavily across the whole frame that the overheads of splitting a kernel over many wavefronts and over long periods of time beings to dominate.


AMD offer a decently balanced part based on their GPU options with some feedback from Sony about what tweaks they'd like (32 ROPs being one obvious decision. We have Sebbbi telling us that's basically overkill and compute is more important going forwards, so that seems imbalanced).

This might depend on Sony's long-term goals and projected workloads, which may not be the same as a developer of any specific engine would be using.
There may be internal quirks to the architecture where doubling up on ROPs can help reduce periods of contention when CUs need to use the export bus to send data to the ROPs. That can buy small slivers of time where the CUs are able to export pixels faster or where they aren't waiting on other CUs to do the same.
Sony's desire to push a 3D headset might figure into that. The 1000 foot view has the ALU and pixel output side growing more in relation to geometry and the graphics front end, and the GDC graph somewhat aligns with it since the time frame is the all-important 16ms.
For a single camera and 33ms, it might not seem as balanced as when the architecture has to fight for far fewer ms while in stereoscopic mode.


Probably the only balancing consideration that came into it was what CPU cores to have. Sony could have gone Piledriver, maybe held off for Steamroller, but that'd have taken up more die space meaning less room for CUs or a higher cost.
They might have dodged a bullet there. Given Kaveri's incomplete launch lineup, the PS4 might not have launched yet, and it probably would have been dependent on Globalfoundries' specialized 28nm APU-focused process--restricting production to a single fabrication partner (that AMD at one point paid hundreds of millions of dollars not to use).
Steamroller has an order of magnitude more of process-specific custom macros built into it compared to Jaguar, so the chances of seeing a Steamroller PS4 coming anywere but GF seem small.


For example, normally, bandwidth of L1, LDS and Registers (in every CU) is tailored to the number of all CUs in GPU (considering the graph). What happens if the bandwidth of L1, LDS, Registers (for each CU) and L2 on a GPU being tailored for a lower number of CUs that GPU has?
Registers, L1, and LDS bandwidth are fixed. Any graph of them is going to scale with the number of CUs and their clock.
L2 bandwidth will scale with the number of memory channels and the GPU clock.

It's also helpful to use graphs with many data points that are normalized to the same number of CUs--with the added point that normalizing design bandwidths to some arbitrary number of CUs is not particularly meaningful.


I don't think balance or so was the only reason. There is also power consumption, price and heat to be dissipated by the cooling system to be put into the equation.
Another point that was raised was that clamshell mode impinges on the top clock a given bin can reach. Clamshell makes GDDR5 devices share various signal lines, which is not as clean electrically.

DRAM capacity doesn't really have much effect on power consumption, and clamshell mode means the power-hungry interface isn't growing.
 
I don't think balance or so was the only reason. There is also power consumption, price and heat to be dissipated by the cooling system to be put into the equation.

The PS4 in fact already consumes 150W with Killzone SF. If they could have given more bandwidth "freely", they would have done but more here is not free and GDDR5 at 5.5 Gbps is certainly less expensive and less power hungry than at 6Gbps.

The point I was trying to make is that originally you had a system design with 1.84Tflops of compute and 192GB/sec BW. For whatever, reason the BW was reduced, but the compute capability stayed the same.

Now, you either you have excess compute relative to the new BW or you had excess BW before. If you have excess compute now, it makes sense to me to utilize that compute on data that can stay cache resident through the coherency buses and not go off chip through the external bus.
 
The point I was trying to make is that originally you had a system design with 1.84Tflops of compute and 192GB/sec BW. For whatever, reason the BW was reduced, but the compute capability stayed the same.
Possibly, this is because Sony went with 8GB of RAM.
Which is more valuable, 4 GB of capacity or 16GB/s of bandwidth?

Now, you either you have excess compute relative to the new BW or you had excess BW before. If you have excess compute now, it makes sense to me to utilize that compute on data that can stay cache resident through the coherency buses and not go off chip through the external bus.

If you are using the coherent bus between the GPU and CPU, there is limited residency advantage.
GPU output over Onion+ works by not caching anything in the GPU caches, and writing over both forms of Onion invalidates shared lines in the CPU caches and writes back to memory.

There might be some bandwidth savings when the GPU reads directly from CPU caches, but that is more for the sake of correctness than bandwidth savings. The latency of GPU compute and limited capacity of CPU caches are such that it would be difficult to keep things tightly linked enough such that the tiny cache can supply the massive bandwidth draw of the GPU without making one or the the other stall a lot.

The process is generally more hands-off and usually requires a round trip through main memory.
 
Now, you either you have excess compute relative to the new BW or you had excess BW before. If you have excess compute now, it makes sense to me to utilize that compute on data that can stay cache resident through the coherency buses and not go off chip through the external bus.
I don't think there's any such thing as 'excess' when it comes to hardware specs, within limits. It's only an excess if you can't use it. So 50 TB/s of main RAM BW would be an excess as there's no way any game engine implementation would need that much, but anything under 200 - 250 GB/s is probably usable. As software can solve a problem in numerous ways, it'll be written to the system. If there's more processing power, devs will rely on that. If there's more RAM and less processing power, devs will precompute and use lookups. If there's more BW, they'll use overdraw, and if there's more graphics processing power, they'll use more complex shaders.

The idea of 'excess' and 'balance' only works if you measure the specs against a preconception of what the engine should be doing. Well, every generation devs have had to reinvent the wheel as the HW's been completely different and not been compatible with the old ways of doing things. It's worth noting that the PC doesn't represent a good reference because it's having to deal with fairly arbitrary software implementations that scale to fairly arbitrary HW targets. A game can find itself on a system that doesn't suit it well (CPU heavy, GPU heavy, BW heavy). On console, the developers can know exactly what the HW offers and build to that, balancing the workloads to what's available. On PC, IHVs look at what games are doing and may try to balance to that (eg. look at legacy code and why it runs slow on x84, and fill x86 up with tricks to speed up weak code). On console, the console companies can provide any design, from simple PC like architecture to massive throughput overdraw designs to really beefy parallel SIMD CPUs to split pools to unified pools to eDRAM caches, and the final games will be balanced to that HW design after a bit of experience.
 
assuming that slide being official, it's good for me to know your opinion about it's credibility

Well I think the issue is one of investing "credibility" in things that have the most conext. What is on a slide without the context of the presentation itself may have less credibility than say someone answering a question with a bit more context involved. Similarly a sentence quoted in an article without a lot more context and with some unknowns in terms of translation, technical or otherwise, might have less "credibility" as well.

This whole balance issue seems to be one of framing not a real metric to work with. Balance was framed as a strength of one platform and so it can easily be misframed when it's context is really just a matter of opinion.

Now what does
"The point is the hardware is intentionally not 100 percent round. It has a little bit more ALU in it than it would if you were thinking strictly about graphics."

mean well that is another question. More ALU's per CU or only on 4 CUs or ... I mean any major changes to the way CUs have been laid down as compared to other CUs may have an effect on yields ( how much I don't know ) as opposed to what is expected if one is using "off the shelf CUs" for lack of a better term.
 
Now what does...mean well that is another question.
It doesn't really matter what it means because it's untrue. :p Compute is fully programmable meaning devs can use it for anything they want. If they want to use it on new, never-before-thought-of rendering techniques, they can, and suddenly all that ALU is being used for graphics. It's daft to think of flexible computing resources as destined for a particular job as I mention above. Devs can and will commandeer the resources for their own ends, whether that's processing graphics on the CPU, or gameplay on the GPU, or vice versa.
 
290X was released, high end PC gaming was mostly about playing last generation (Xbox 360 and PS3) ports with high frame rate (60 fps+) and high resolution 1080p / 1440p / 1600p with some extra PC specific effects. Last generation games have simple shaders, and simple shaders benefit from massive amount of ROPs (because simple shaders are not ALU bound). The extra PC specific effects are usually post processing. These extra effects don't gain performance from extra ROPs, but that doesn't matter much since majority of the frame gets faster.

Saying that though, the 290x despite having double the ROP's of the PS4 still has a higher ALU:ROP ratio than that console (although a little less than the XB1). So if ALU holds relatively more importance than ROPs for future game engines then it could be argued that the 290x is a pretty forward looking design.

I've no idea how that fits in with memory bandwidth though since the 290x obviously has a lower bandwidth:ROP/ALU ratio than the PS4 (and much lower than the XB1).

Perhaps you could give us your thoughts on the relative import of bandwidth vs ALU for future game engines?
 
Perhaps you could give us your thoughts on the relative import of bandwidth vs ALU for future game engines?
Console developers will always adapt to the available hardware in the long run. In the console world, there isn't unbalanced hardware, only unbalanced software. Blaming the hardware gains you nothing. That time should instead be spent in optimizing your code. Console hardware never changes, but you can change your code to run better on that hardware.
 
Console developers will always adapt to the available hardware in the long run. In the console world, there isn't unbalanced hardware, only unbalanced software. Blaming the hardware gains you nothing. That time should instead be spent in optimizing your code. Console hardware never changes, but you can change your code to run better on that hardware.

Thread closed ;)
 
Now what does

mean well that is another question. More ALU's per CU or only on 4 CUs or ... I mean any major changes to the way CUs have been laid down as compared to other CUs may have an effect on yields ( how much I don't know ) as opposed to what is expected if one is using "off the shelf CUs" for lack of a better term.

The ALU throughput numbers and design description have been definitely clear that there is no additional ALU capacity beyond what is standard for an array of 18 active CUs.

The straightforward interpretation is that there are more CUs than would be strictly necessary to get most of the performance out of the fixed function pipeline for the some averaged set of graphics loads that don't use compute shaders.
This would not be true for every possible situation, but would be true on average over the majority of the cases they profiled.

The inverse of this interpretation is that there is some rough line where they reasoned you could get what counts as "good enough" purely fixed-function utilization without noticeable degradation in performance, and they went a certain number of CUs beyond it.

It's not guaranteed at this point that their estimates will turn out to be where the ideal balancing point might be, and there will likely be examples where games can go either way, or possibly certain stretches of time within games where the inflection point changes. The GPU should be flexible enough to allocate CUs in the event that a graphics load manages to hit pure graphics functionality in such a manner that more than 14 CUs would be needed to avoid a performance hit, or where the fixed-function pipeline is so lightly burdened that fewer would be necessary.
 
The ALU throughput numbers and design description have been definitely clear that there is no additional ALU capacity beyond what is standard for an array of 18 active CUs.

The straightforward interpretation is that there are more CUs than would be strictly necessary to get most of the performance out of the fixed function pipeline for the some averaged set of graphics loads that don't use compute shaders.
This would not be true for every possible situation, but would be true on average over the majority of the cases they profiled.

Totally agree I'm all about this answer and all of the other similar answers that have come about when 14-4 rears it's head :yes:
 
A new PC GPU needs to run the current games as fast as possible, because all the reviewers will use the currently available games as benchmarks. No PC GPU sells because it might be good fit for future. It will not sell if it gets trounced in current game benchmarks by the competition. When 290X was released, high end PC gaming was mostly about playing last generation (Xbox 360 and PS3) ports with high frame rate (60 fps+) and high resolution 1080p / 1440p / 1600p with some extra PC specific effects. Last generation games have simple shaders, and simple shaders benefit from massive amount of ROPs (because simple shaders are not ALU bound). The extra PC specific effects are usually post processing. These extra effects don't gain performance from extra ROPs, but that doesn't matter much since majority of the frame gets faster.

Tessellation doesn't use ROPs, it mainly stresses the fixed function primitive units and vertex/domain/hull shaders (= ALU and BW). GCN 1.1 does 2 primitives per clock (290X does four). That's the main limitation of tessellation. NVIDIA cards are better in tessellation than AMD cards, because NVIDIA cards can push more primitives per clock (because of distributed geometry engines).

Tessellating to tiny small triangles is not smart, because it decreases the pixel shader quad efficiency, and that means that the shader needs to be running more times than necessarily. However this increase of pixel shader cost increases all the pixel based costs equally (not just the ROP cost), so the shader doesn't get any more ROP bound.

ROPs are also good for solving many things easier than compute shader based solutions. I haven't yet seen any studio using compute based particle gathering (single pass, super BW effective) instead of filling hundreds of alpha planes on top of each other with ROPs. But things will change in the future. We will see solutions like this for problems that are currently brute forced with ROPs.

Maybe Sucker Punch, they use GPGPU for Particles. And the particles are native res maybe because it is more BW effective. And it is very good because quarter res particles are highly visible... :)
 
Last edited by a moderator:
Back
Top