GDC paper on compute-based Cloth Physics including CPU performance

No wonder the XBone settled for a less capable GPU with fewer shaders. Additional shaders would have been sitting idle starved for BW.
Of course. You would never allocate a bunch of die space to 6T-SRAM if you thought that your planned huge external pool would keep a bigger processor well-fed. That would be silly.
 
They only use 5 SPUs because 5 SPUs is all you have available for games!
I'm sure it's 6 available for games. The PS3 Cell is fabled with 8 SPUs, 1 SPU there for redundancy so there are only ever 7 active SPUs in the shipping machines and 1 reserved for the OS.
 
Running some numbers

PS4 and Xbox are both basically the same generation GCN. Latency on X1 main memory is lower, but latency is not significant in this benchmark (as stated in the presentation). It doesn't appear that esram is used - or perhaps could even be used - in accelerating this workload. Looking at main memory:

PS4: 176 GB/s; 1.84 TF; 95.65 GB/TF
X1: 68 GB/s; 1.31 TF; 51.91 GB/TF

X1 GB/TF / PS4 GB/TF = 51.91 / 95.65 = 0.54

In other words, X1 has 0.54 times the main memory bandwidth per flop as the PS4 if you're working outside of that little pool of esram. So in a main memory BW limited compute benchmark - given the identical architectures, we should expect to see the PS4 outperform the X1 by far beyond its usual margin and even far beyond the theoretical TF difference. By up to 1.84 times in terms of performance/TF, infact. So how does it actually compare? Is this bad?

Dancer benchmark, 5 ms GPU compute time:

PS4: 1600 dancers / (1.84 TF/s / 200) = 1600 dancers / 9.2 GF = 173.91 dancers/GF
X1: 830 dancers / (1.31 TF/s / 200) = 830 dancers / 6.55 GF = 126.72 dancers/gigaflop

126.72 / 173.91 = 0.73

Yes. This is bad. X1 is only running at 73% of the efficiency of PS4 in this test, meaning PS4 performs an enormous 37% faster per flop, as measured by dancers per gigaflop.*

Conclusion: without esram X1 GPU be like :-x

*
lol. Dancers per gigaflop.
 
I'm sure it's 6 available for games. The PS3 Cell is fabled with 8 SPUs, 1 SPU there for redundancy so there are only ever 7 active SPUs in the shipping machines and 1 reserved for the OS.

I think that at least originally games had to be hand back another of the SPUs if the OS wanted it. I've seen references in developer slides to using 5 - 6 SPUs for tasks...?
 
Cell clearly was a physics beast, and it was fabled for its high efficiency, which is in a large part thanks to its memory architecture, the SPU Local Stores in particular. And that too would correlate with this test being strongly influenced by memory bandwidth, and the ESRAM in this case not being able to help much (but in other cases it could). Super Stardust still stands out to me as a nice example of a game showing it off, though the well-optimised Havok in Motorstorm was a good example too (and some of the effects in Uncharted as well).

How surprising and refreshing to see such a benchmark in the open though - that so rarely happens.

And glad to see that compute is the real deal on the new consoles. Should make for some interesting stuff (like Resogun, of course, but I'm sure we'll see more)
 
Maybe I'm getting it wrong, but aren't these performance measurements recorded with the entire system doing nothing but the cloth stuff? That'd hardly be representative of real life cases, where there'd also be a complete game with all it's components running on the same resources. The additional bottlenecks and such could alter the real world results significantly.

For example, how much main memory bandwidth would be left after game code, rendering and such?
Or, how many characters would / could a game work with in the end, and so how much actual performance advantage would remain?
 
Maybe I'm getting it wrong, but aren't these performance measurements recorded with the entire system doing nothing but the cloth stuff? That'd hardly be representative of real life cases, where there'd also be a complete game with all it's components running on the same resources. The additional bottlenecks and such could alter the real world results significantly.

For example, how much main memory bandwidth would be left after game code, rendering and such?
Or, how many characters would / could a game work with in the end, and so how much actual performance advantage would remain?

They give this example 2 ms for 624 characters on PS4 with cloth simulation. If they want 30 fps per second it means 31 ms for doing other stuff with the GPU.

And it will be use in Assasin's Creed Unity, Far Cry 4, The Division...

With 8 ACE, bus Onion +, volatile bits and probably other GPGPU optimization in PS4, the result is not so surprising. ..
 
They give this example 2 ms for 624 characters on PS4 with cloth simulation. If they want 30 fps per second it means 31 ms for doing other stuff with the GPU.

And it will be use in Assasin's Creed Unity, Far Cry 4, The Division...

With 8 ACE, bus Onion +, volatile bits and probably other GPGPU optimization in PS4, the result is not so surprising. ..

Question is the same Cloth movements simulated on all the 624 dancers, or different movements by cloth?
 
From this Naughty Dog presentation on Uncharted 2 development, slide 142 states they use all six SPUs.

Yeah, and iirc ND use a jobs based system to allow them to allocate work to all cores, and do DICE iirc who talked about using 5 - 6 spu for some tasks. This should be able to handle the possibility of folding work back onto other SPUs if time was taken away from one of the SPUs.

One of the (very early) comments I'm 98% sure I remember was that one of the SPUs you couldn't guarantee having all the time, so for performance critical work it wasn't suitable. That would seem to fit with a hard coded thread system.

It'd be interesting to know if I'm remembering correctly. It might have been 1 SPU = hypervisor only, 1 SPU could have some time taken to do stuff like the guide or background downloads or something, but was normally fully available.
 
It'd be interesting to know if I'm remembering correctly. It might have been 1 SPU = hypervisor only, 1 SPU could have some time taken to do stuff like the guide or background downloads or something, but was normally fully available.
Where's Joker when you need him? Perhaps there was a reservation that disappeared later along with the diminishing OS RAM footprint, or perhaps there was always a reservation but that it wasn't onerous so as long as you didn't stress the 6th SPU it wasn't an issue.

JOKER?
 
They give this example 2 ms for 624 characters on PS4 with cloth simulation. If they want 30 fps per second it means 31 ms for doing other stuff with the GPU.

And it will be use in Assasin's Creed Unity, Far Cry 4, The Division...

With 8 ACE, bus Onion +, volatile bits and probably other GPGPU optimization in PS4, the result is not so surprising. ..

Any ideas as to how the 8 Asynchronous Compute Engines and volatile bits, or the Onion + HSA bus are leading to these 'unsurprising' results, given that they don't appear to be seem to be using asynchronous compute, or HSA unique features?

Or are you just attributing the results to these things because you've heard about them, and because they're PS4 specific (at least in the console space)?
 
Where's Joker when you need him? Perhaps there was a reservation that disappeared later along with the diminishing OS RAM footprint, or perhaps there was always a reservation but that it wasn't onerous so as long as you didn't stress the 6th SPU it wasn't an issue.

Maybe it disappeared. Or maybe it wasn't an issue once developers moved to more flexible jobs based systems where it would always be a significant win to use it. Or perhaps I went mad trying to calculate dancers per gigaflop.


I think you have to say it backwards while looking into a mirror ...
 
Would any of this a asynchronous GPU computing work on Wii U? I know VLIW5 was not considered real good as this type of work, but if its simply making use of the GPU down time, then any additional work done is beneficial. For example, even if the Wii U GPU could only complete 25 dancers in the allotted time, if that didn't hurt graphics rendering performance at all, then that would be less work the CPU would have to do.
 
Any ideas as to how the 8 Asynchronous Compute Engines and volatile bits, or the Onion + HSA bus are leading to these 'unsurprising' results, given that they don't appear to be seem to be using asynchronous compute, or HSA unique features?

Or are you just attributing the results to these things because you've heard about them, and because they're PS4 specific (at least in the console space)?

I search reason for this result because unoptimized compute shader was bandwith bound and after they use some compression and the LDS, they gain nearly 100% of performance and they optimize for better CU efficiency for the two consoles and PC...

If it is not bandwith, it is probably something else...

And they didn't give any details about the Xbox One version. We have much more details about PS4 version. No details about ESRAM or main RAM usage...

Only performance number...
 
Last edited by a moderator:
Or are you just attributing the results to these things because you've heard about them, and because they're PS4 specific (at least in the console space)?

I Onion+ lacking from the XBO? That's surprising considering it's also in Kaveri. I'd have assumed the 3 were virtually identical aside from the unit numbers.
 
I Onion+ lacking from the XBO? That's surprising considering it's also in Kaveri. I'd have assumed the 3 were virtually identical aside from the unit numbers.

No Onion +, no volatile bits in Xbox One GPU. GPGPU PS4 architecture is the same than Kaveri not the Xbox One GPU...
 
Back
Top