Of course. You would never allocate a bunch of die space to 6T-SRAM if you thought that your planned huge external pool would keep a bigger processor well-fed. That would be silly.
I'm sure it's 6 available for games. The PS3 Cell is fabled with 8 SPUs, 1 SPU there for redundancy so there are only ever 7 active SPUs in the shipping machines and 1 reserved for the OS.
PS4 and Xbox are both basically the same generation GCN. Latency on X1 main memory is lower, but latency is not significant in this benchmark (as stated in the presentation). It doesn't appear that esram is used - or perhaps could even be used - in accelerating this workload. Looking at main memory:
In other words, X1 has 0.54 times the main memory bandwidth per flop as the PS4 if you're working outside of that little pool of esram. So in a main memory BW limited compute benchmark - given the identical architectures, we should expect to see the PS4 outperform the X1 by far beyond its usual margin and even far beyond the theoretical TF difference. By up to 1.84 times in terms of performance/TF, infact. So how does it actually compare? Is this bad?
Yes. This is bad. X1 is only running at 73% of the efficiency of PS4 in this test, meaning PS4 performs an enormous 37% faster per flop, as measured by dancers per gigaflop.*
I'm sure it's 6 available for games. The PS3 Cell is fabled with 8 SPUs, 1 SPU there for redundancy so there are only ever 7 active SPUs in the shipping machines and 1 reserved for the OS.
I think that at least originally games had to be hand back another of the SPUs if the OS wanted it. I've seen references in developer slides to using 5 - 6 SPUs for tasks...?
Cell clearly was a physics beast, and it was fabled for its high efficiency, which is in a large part thanks to its memory architecture, the SPU Local Stores in particular. And that too would correlate with this test being strongly influenced by memory bandwidth, and the ESRAM in this case not being able to help much (but in other cases it could). Super Stardust still stands out to me as a nice example of a game showing it off, though the well-optimised Havok in Motorstorm was a good example too (and some of the effects in Uncharted as well).
How surprising and refreshing to see such a benchmark in the open though - that so rarely happens.
And glad to see that compute is the real deal on the new consoles. Should make for some interesting stuff (like Resogun, of course, but I'm sure we'll see more)
Maybe I'm getting it wrong, but aren't these performance measurements recorded with the entire system doing nothing but the cloth stuff? That'd hardly be representative of real life cases, where there'd also be a complete game with all it's components running on the same resources. The additional bottlenecks and such could alter the real world results significantly.
For example, how much main memory bandwidth would be left after game code, rendering and such?
Or, how many characters would / could a game work with in the end, and so how much actual performance advantage would remain?
I think that at least originally games had to be hand back another of the SPUs if the OS wanted it. I've seen references in developer slides to using 5 - 6 SPUs for tasks...?
Maybe I'm getting it wrong, but aren't these performance measurements recorded with the entire system doing nothing but the cloth stuff? That'd hardly be representative of real life cases, where there'd also be a complete game with all it's components running on the same resources. The additional bottlenecks and such could alter the real world results significantly.
For example, how much main memory bandwidth would be left after game code, rendering and such?
Or, how many characters would / could a game work with in the end, and so how much actual performance advantage would remain?
They give this example 2 ms for 624 characters on PS4 with cloth simulation. If they want 30 fps per second it means 31 ms for doing other stuff with the GPU.
And it will be use in Assasin's Creed Unity, Far Cry 4, The Division...
With 8 ACE, bus Onion +, volatile bits and probably other GPGPU optimization in PS4, the result is not so surprising. ..
They give this example 2 ms for 624 characters on PS4 with cloth simulation. If they want 30 fps per second it means 31 ms for doing other stuff with the GPU.
And it will be use in Assasin's Creed Unity, Far Cry 4, The Division...
With 8 ACE, bus Onion +, volatile bits and probably other GPGPU optimization in PS4, the result is not so surprising. ..
Yeah, and iirc ND use a jobs based system to allow them to allocate work to all cores, and do DICE iirc who talked about using 5 - 6 spu for some tasks. This should be able to handle the possibility of folding work back onto other SPUs if time was taken away from one of the SPUs.
One of the (very early) comments I'm 98% sure I remember was that one of the SPUs you couldn't guarantee having all the time, so for performance critical work it wasn't suitable. That would seem to fit with a hard coded thread system.
It'd be interesting to know if I'm remembering correctly. It might have been 1 SPU = hypervisor only, 1 SPU could have some time taken to do stuff like the guide or background downloads or something, but was normally fully available.
It'd be interesting to know if I'm remembering correctly. It might have been 1 SPU = hypervisor only, 1 SPU could have some time taken to do stuff like the guide or background downloads or something, but was normally fully available.
Where's Joker when you need him? Perhaps there was a reservation that disappeared later along with the diminishing OS RAM footprint, or perhaps there was always a reservation but that it wasn't onerous so as long as you didn't stress the 6th SPU it wasn't an issue.
They give this example 2 ms for 624 characters on PS4 with cloth simulation. If they want 30 fps per second it means 31 ms for doing other stuff with the GPU.
And it will be use in Assasin's Creed Unity, Far Cry 4, The Division...
With 8 ACE, bus Onion +, volatile bits and probably other GPGPU optimization in PS4, the result is not so surprising. ..
Any ideas as to how the 8 Asynchronous Compute Engines and volatile bits, or the Onion + HSA bus are leading to these 'unsurprising' results, given that they don't appear to be seem to be using asynchronous compute, or HSA unique features?
Or are you just attributing the results to these things because you've heard about them, and because they're PS4 specific (at least in the console space)?
Where's Joker when you need him? Perhaps there was a reservation that disappeared later along with the diminishing OS RAM footprint, or perhaps there was always a reservation but that it wasn't onerous so as long as you didn't stress the 6th SPU it wasn't an issue.
Maybe it disappeared. Or maybe it wasn't an issue once developers moved to more flexible jobs based systems where it would always be a significant win to use it. Or perhaps I went mad trying to calculate dancers per gigaflop.
Would any of this a asynchronous GPU computing work on Wii U? I know VLIW5 was not considered real good as this type of work, but if its simply making use of the GPU down time, then any additional work done is beneficial. For example, even if the Wii U GPU could only complete 25 dancers in the allotted time, if that didn't hurt graphics rendering performance at all, then that would be less work the CPU would have to do.
Any ideas as to how the 8 Asynchronous Compute Engines and volatile bits, or the Onion + HSA bus are leading to these 'unsurprising' results, given that they don't appear to be seem to be using asynchronous compute, or HSA unique features?
Or are you just attributing the results to these things because you've heard about them, and because they're PS4 specific (at least in the console space)?
I search reason for this result because unoptimized compute shader was bandwith bound and after they use some compression and the LDS, they gain nearly 100% of performance and they optimize for better CU efficiency for the two consoles and PC...
If it is not bandwith, it is probably something else...
And they didn't give any details about the Xbox One version. We have much more details about PS4 version. No details about ESRAM or main RAM usage...
Or are you just attributing the results to these things because you've heard about them, and because they're PS4 specific (at least in the console space)?
I Onion+ lacking from the XBO? That's surprising considering it's also in Kaveri. I'd have assumed the 3 were virtually identical aside from the unit numbers.
I Onion+ lacking from the XBO? That's surprising considering it's also in Kaveri. I'd have assumed the 3 were virtually identical aside from the unit numbers.