GDC paper on compute-based Cloth Physics including CPU performance

Bagel seed · Oct 13, 2014

Is this asynchronous compute or regular compute? Or is the former implied?

Infinisearch · Oct 13, 2014

Shifty Geezer said:
Cell == 230 GFlops. Liverpool GPU == 1840 Gflops. Liverpool GPU == Cell * 8. Number of dancers = 16x Cell.

How much bandwidth was available to cell again? I also wonder what's the perf/watt advantage?

function · Oct 13, 2014

Bagel seed said:
Is this asynchronous compute or regular compute? Or is the former implied?

It isn't mentioned in the slides.

Take a look at some GCN benchmarks, paying particular attention to 290X with it's 8 ACEs and 5.6 Gflops vs the 7970 / 280X with its 2 ACEs and ~3.9 Gflops.

http://www.tomshardware.com/reviews/radeon-r9-290x-hawaii-review,3650-34.html

http://www.anandtech.com/show/7457/the-radeon-r9-290x-review/18

Not even close to a 100% performance increase.

The Xbox One is bottlenecked in this case, badly, and it has to be by main memory bandwidth.

Edit: this might actually be a good fit for asynchronous compute on the Xbox One!

Bagel seed · Oct 13, 2014

Yes it doesn't seem like any of this is relevant to asynchronous compute. Same with those benchmarks too though, so the efficacy of the added ACE won't be shown there. The added ACE will shine when compute is jammed in together with traditional GPU workloads.

One of the slides mentioned creating a huge compute shader in order to avoid too many CPU dispatch requests. I remember reading somewhere that that is a big no no, in that scheduling long running compute processes causes headaches for scheduling the rest of the normal tasks.

TomRL · Oct 13, 2014

That first slide looks like it might be oversimplifying things. With compute, the general CPU power can be brought up to the same levels as the GPU? How accurate is this?

onQ · Oct 14, 2014

function said:
Does asynchronous compute work on top of synchronous compute? I was under the impression that it for work alongside the traditional rendering pipeline.

In other words, how are you proposing that asynchronous compute is adding 100% performance on top of compute shaders? Additionally, where are you seeing that they're actually using asynchronous compute in this benchmark? I can't see a single reference to it.

Bagel seed said:
Is this asynchronous compute or regular compute? Or is the former implied?

Isn't the dancers being rendered by the traditional GPU pipeline?

Shifty Geezer said:
Cell == 230 GFlops. Liverpool GPU == 1840 Gflops. Liverpool GPU == Cell * 8. Number of dancers = 16x Cell. So Cell ends up being less efficient than the GPU in this case. I guess that shows what compute is capable of these days!

That's just granularity. PS4 can extract more unused performance when things are busy. It shouldn't be generating a higher utilisation in a benchmark test where the GPU is focussed on the one task.

Seems that PS4 is able to get a lot out of it's 1843.2 Gflops GPGPU.

iroboto · Oct 14, 2014

function said:
It isn't mentioned in the slides.

Take a look at some GCN benchmarks, paying particular attention to 290X with it's 8 ACEs and 5.6 Gflops vs the 7970 / 280X with its 2 ACEs and ~3.9 Gflops.

http://www.tomshardware.com/reviews/radeon-r9-290x-hawaii-review,3650-34.html

http://www.anandtech.com/show/7457/the-radeon-r9-290x-review/18

Not even close to a 100% performance increase.

The Xbox One is bottlenecked in this case, badly, and it has to be by main memory bandwidth.

Edit: this might actually be a good fit for asynchronous compute on the Xbox One!

You lost me on the edit part. I take it that you're asserting that the GPU would be busy with doing something on ESRAM and when it has a breather moment while waiting for whatever it is waiting for, it will Async GPGPU off DRAM? And so it's a good fit because there is no reason to DMA between the two?

iroboto · Oct 14, 2014

TomRL said:
That first slide looks like it might be oversimplifying things. With compute, the general CPU power can be brought up to the same levels as the GPU? How accurate is this?

You didn't read the graph right. The first post has a graph showing the compute power in GLOPS. It shows the CPU, and then the GPU.

Offloading the compute onto the GPU would provide 23x and 15x more performance than just doing it traditionally on the CPU (provided it's a good fit for mass parallel processing)

iroboto · Oct 14, 2014

Bagel seed said:
Yes it doesn't seem like any of this is relevant to asynchronous compute. Same with those benchmarks too though, so the efficacy of the added ACE won't be shown there. The added ACE will shine when compute is jammed in together with traditional GPU workloads.

One of the slides mentioned creating a huge compute shader in order to avoid too many CPU dispatch requests. I remember reading somewhere that that is a big no no, in that scheduling long running compute processes causes headaches for scheduling the rest of the normal tasks.

Yea I wonder if it'll bung up your budget. They made it this way to max the number of cloth physics dancers that could actually be going concurrently, but I don't know if this would be ideal with a real game going on. I guess it [having a very long shader] would be terrible as 'async' for sure, but could be ok if the game properly budgets for synchronous version of it.

Chance of this being used in AC:Unity? I wonder if compute shaders to do things like this is eating up GPU time. If so, I like what the future holds

onQ · Oct 14, 2014

I think I know why there is a big difference between the number of Xbox One & PS4 GPU dancers. Maybe it's because about 600 Gflops is being used for the rendering leaving the Xbox One with about 700 Gflops for compute & the PS4 GPU with about 1.2 Tflops for compute.

function · Oct 14, 2014

onQ said:
Isn't the dancers being rendered by the traditional GPU pipeline?

Probably, but it's common to switch between the two. E.g. compute then render.

They're talking about 5ms of time for the compute. Leaves lots of time for render.

Remember that this is a benchmark of their compute code and not how fast they can render dancers.

iroboto said:
You lost me on the edit part. I take it that you're asserting that the GPU would be busy with doing something on ESRAM and when it has a breather moment while waiting for whatever it is waiting for, it will Async GPGPU off DRAM? And so it's a good fit because there is no reason to DMA between the two?

Bingo!

Imagine doing a few ms worth of post processing (not uncommon) using the traditional pipeline where you're operating entirely on depth and frame buffers contained within the esram. Lots of CU time is being wasted as you're limited by the ~130 GB/s (or whatever) that you're getting from the esram. The CPU is working away on the upcoming updates and isn't using even half of the available main memory BW.

Seem like a good time to do your next frame's cloth simulation using async compute, perhaps?

Or how about if you're in a stage of the frame creation where you're ROP limited, perhaps during shadow map updates? Put the shadow maps you're building in esram and get working on something using async compute?

Given the quirks of the Xbox one I don't see why it couldn't gain an awful lot from asynchronous compute ...

iroboto · Oct 14, 2014

Makes sense. From my fiddling with CUDA yea the results are always meant to be copied back to system ram after the results sit around in VRAM. So it would never put GPGPU in esram cause the results ultimately need to come back to system ram and in case the consoles the results need to be returned to unified ram.

Not to say you couldn't put it into esram I just don't see much point about it. Once the data is reduced you will setup your next GPGPU with the reduced set. Unless there was a way to store reduced results in esram to be run again; so a several staged one then you would be hauling huge efficiencies. But you have so little space to work with at the same time.

I'm not even sure if it could be done. But if it could that would make for some interesting things.

Bagel seed · Oct 14, 2014

iroboto said:
Yea I wonder if it'll bung up your budget. They made it this way to max the number of cloth physics dancers that could actually be going concurrently, but I don't know if this would be ideal with a real game going on. I guess it [having a very long shader] would be terrible as 'async' for sure, but could be ok if the game properly budgets for synchronous version of it.

Chance of this being used in AC:Unity? I wonder if compute shaders to do things like this is eating up GPU time. If so, I like what the future holds

It seems like a safe bet, yeah. Just have a look at the main character's cape, it flows more realistically this time. And not only on him but the people in the crowds too. I've no doubt the reports of the horrible frame rates in recent demos are due this and not their super advanced AI

onQ said:
I think I know why there is a big difference between the number of Xbox One & PS4 GPU dancers. Maybe it's because about 600 Gflops is being used for the rendering leaving the Xbox One with about 700 Gflops for compute & the PS4 GPU with about 1.2 Tflops for compute.

The presentation isn't very clear but I think they just reworked the cloth physics and made it run on the GPU. There's no indication that they're in game benches at all.

iroboto · Oct 14, 2014

Bagel seed said:
It seems like a safe bet, yeah. Just have a look at the main character's cape, it flows more realistically this time. And not only on him but the people in the crowds too. I've no doubt the reports of the horrible frame rates in recent demos are due this and not their super advanced AI

Lol the shroud is indeed unveiling slowly. As a puzzle piece this one seems to fit well with the description of why resolution is being held back more so than CPU budget.

LightHeaven · Oct 14, 2014

onQ said:
I think I know why there is a big difference between the number of Xbox One & PS4 GPU dancers. Maybe it's because about 600 Gflops is being used for the rendering leaving the Xbox One with about 700 Gflops for compute & the PS4 GPU with about 1.2 Tflops for compute.

Take a look at the scene, where are those hypothetical 600 gflops going?
They are not selling a rendering engine, they are selling a cloth simulation tool, they want to keep the rendering as light as possible to have the best possible results.

Case in point, how many games does you see on Ps360 that came close to having 30 or 90 chats on screen with that level of physics simulation?

LightHeaven · Oct 14, 2014

function said:
Because if you're bandwidth limited on data that is largely a single read and a single write (or a copy in and a copy out) you're going to be limited by how fast you can DMA into and out of esram anyway (i.e. by the speed of the DDR3).

It doesn't seem to be the case of a single read and a single write, tough. More like:

Input: Lots of vertex positions.
Shader: Tons of checks for collisions and other stuff to get the simulation done (many reads/writes)
Output: The same verities from the input, but with their positions updated.

(At least that is what it seemed to me). Though, if that is the case I have no idea why Xbone was so handicapped by BW...

Edit: On a second though, the shaded phase happens in the local cache of the CUs, so I dunno XD

AmFreak · Oct 14, 2014

Shifty Geezer said:
Cell == 230 GFlops. Liverpool GPU == 1840 Gflops. Liverpool GPU == Cell * 8. Number of dancers = 16x Cell. So Cell ends up being less efficient than the GPU in this case. I guess that shows what compute is capable of these days!

That's just granularity. PS4 can extract more unused performance when things are busy. It shouldn't be generating a higher utilisation in a benchmark test where the GPU is focussed on the one task.

I think people take these things far to literal.
We don't have all information here.
This is a paper to show what you can do with the GPU's in these consoles and how they did it.
We don't know how much time they spent optimizing any of these systems.
And the end result was also clear -> gpu wins.

They also only used 5 SPU's for their ps3 code, what equals 128 GFLOPS.

MetalSpirit · Oct 14, 2014

AmFreak said:
I think people take these things far to literal.
We don't have all information here.
This is a paper to show what you can do with the GPU's in these consoles and how they did it.
We don't know how much time they spent optimizing any of these systems.
And the end result was also clear -> gpu wins.

They also only used 5 SPU's for their ps3 code, what equals 128 GFLOPS.

What you mention is true! And I wonder about what optimization on the code was made for the PS4 CPU.

We can see they made the translation from HLSL to PLSL, And we see optimizations for the GFX hardware.
But what about the code for CPU? Was he made to use and optimized for the Onion and Garlic buses?. Or are the conditions for the memory access just generic.

There are a lot of questions in the air. Anyone has access to the video of the presentation? Does the video show the questions made in the end by the audience?

MetalSpirit · Oct 14, 2014

AmFreak said:
They also only used 5 SPU's for their ps3 code, what equals 128 GFLOPS.

I missed this one!
They only use 5 SPUs because 5 SPUs is all you have available for games!

mc6809e · Oct 14, 2014

function said:
Xbox CPU is ~9% faster, but also has ~15% lower latency access to main memory. If you're hitting main memory a lot that probably makes a difference too.

That's what I thought too -- the difference in the CPU results is explained in part by the difference in memory type, though the XBone's higher clock probably plays a part, too.

Still, using GDDR5 seems to give the PS4's GPU nearly twice the speed.

No wonder the XBone settled for a less capable GPU with fewer shaders. Additional shaders would have been sitting idle starved for BW.

GDC paper on compute-based Cloth Physics including CPU performance

Bagel seed

Infinisearch

function

None functional

Bagel seed

TomRL

onQ

iroboto

Daft Funk

iroboto

Daft Funk

iroboto

Daft Funk

onQ

function

None functional

iroboto

Daft Funk

Bagel seed

iroboto

Daft Funk

LightHeaven

LightHeaven

AmFreak

MetalSpirit

MetalSpirit

mc6809e

Similar threads