GDC paper on compute-based Cloth Physics including CPU performance

Optimised shader was still BW limited on the vastly more bw bound X1. Possibly still was to some extent on the ps4 also - just look at the enormous speedup from packing the vertex and normal data!

My logic is that I don't know and you too what is the memory allocation for the compute shader for Xbox one you decided that it is in main memory.

They never talk of any memory allocation for Xbox One on the presentation when they give intermediary result for optimization they only give a generic CPU and GPU.. They speak a little more in detail of PS4 version...

After maybe they decided to use main ram but I don't see anything about it in the document...

It is not write anywhere in the in the presentation...
 
And for clarification the optimized shader is after all optimization, data compression (packing of data), use of LDS and so on... It is the compute shader who gives the final performance..

And without talking of PS4 and Xbox One maybe the unoptimized compute shader is bandwith bound on PC GPU with more bandwith..

They talk about all machine for compute shader mostly bandwith bound
 
Last edited by a moderator:
150%???? the ESRAM theorical bandwith is
[..]
If the performance of one system is limited by bandwith it is the PS4


Theoretical bandwidth is only useful for advertising. Real numbers and real workload matters.

PS4 offers a much more convenient way to work than ESRAM w/DMA controller.

Also, multiplatform engines are a different beast than 1st party titles.
Managing resources as MS wants to do with ESRAM requires additional changes in backend and likely in the front-end, to accomodate for this.
It is probably much simpler for game devs to just cache buffers in it, at least for now..
 
A general rule is compute shading is bandwith bound most of the time from the slide 53... For cloth simulation they pack the data, use LDS... The optimization reduce pressure on VRAM for PC GPU, Xbox One main or ESRAM and GDDR5 on PS4...
 
Last edited by a moderator:
And the more bandwidth bound system is going to suffer the most from a bandwidth constrained workloads.

And once again, please explain how given the information in slides 58 and 103, 'esram' fits with running all 800+ dancers in a single, straight run through 5ms of GPU time. And remember, the move engines have 25.6 GB/s of BW to move data in and out of esram. That's lower than DDR3 BW.

They aren't talking about DMAing in a few megabytes while working on other stuff, then "bursting" on that workload (possible use for async anyone?), then switching back to other duties while they set up the next workload. This is a 5 ms straight run to get as many dancers updated as they can.

Offer your hypothesis for how esram could be offering significant gains for this particular test.
 
And the more bandwidth bound system is going to suffer the most from a bandwidth constrained workloads.

And once again, please explain how given the information in slides 58 and 103, 'esram' fits with running all 800+ dancers in a single, straight run through 5ms of GPU time. And remember, the move engines have 25.6 GB/s of BW to move data in and out of esram. That's lower than DDR3 BW.

They aren't talking about DMAing in a few megabytes while working on other stuff, then "bursting" on that workload (possible use for async anyone?), then switching back to other duties while they set up the next workload. This is a 5 ms straight run to get as many dancers updated as they can.

Offer your hypothesis for how esram could be offering significant gains for this particular test.

I forget about the 5 ms at the beginning of the presentation. Maybe the problem is a a bandwith one...

Edit: I agree with you it is 99% chance the problem is main RAM bandwith.
 
Last edited by a moderator:
Running compute on an IP level older than Cayman would require a context switch of the GPU, basically wiping or writing back a good portion of the chip's context and reinitializing it in a compute mode for a little while, then doing another flush and reinitialization to graphics.
The latencies for that operation are brutal.
Pre-Cayman hardware wasn't that bad. Launch of compute work was synchronous with the graphics pipeline, but it didn't require a flush or writing context data out to memory. The Evergreen generation was more capable than r7xx though.
 
Correct me if wrong!
The code seems to be DirectX 11, and that means that no low level advantages of the console APIs in terms of CPU usage are used on the CPU test. Right?
 
In page 103 they give a general advice to use LDS when there is multiple access of a data or random access of a data by a compute shader. Maybe some compute shader will better work in main ram...
 
Last edited by a moderator:
I guess diminishing returns are a big thing now.

Looking at these GDC slides, you can see that a lot of work (and therefore money) went into allowing for better - or at least larger scale - simulation of cloth. And that's really cool, but the work, flops and even power (watts) required for what will likely be a relatively minor perceived improvement to the overall visual impact of the game, and that will likely have no effect on gameplay mechanics, is pretty telling.

Now I'm sure they've learned a lot about compute on GCN that they can apply to solving or accelerating other problems, but still it's pretty easy to start thinking that big perceived jumps like Virtua Fighter, Mario 64, Gran Tourismo, Shenmue, MGS2 or Halo 1 might simply be beyond the rate of technology improvements and budget increases. At least, without something new really shaking things up.
 
Back
Top