GDC paper on compute-based Cloth Physics including CPU performance

And that's a good thing or a bad thing?

It depends on what you want to do with them. For what we know, the Onion+ thing should allow the CPU to extract GPGPU results without annoying cache flushes, thus improving the ability to do GPGPU without constantly burdening your rendering pipeline jobs.

In this case, it looks like it might not be relevant.
We have no public info about Durango, but I assume we could suppose that you get the same effect as long as you work within ESRAM.
 
Sony was pretty clear about the modifications they wanted to GCN for the PS4

The three "major modifications" Sony did to the architecture to support this vision are as follows, in Cerny's words:

  • "First, we added another bus to the GPU that allows it to read directly from system memory or write directly to system memory, bypassing its own L1 and L2 caches. As a result, if the data that's being passed back and forth between CPU and GPU is small, you don't have issues with synchronization between them anymore. And by small, I just mean small in next-gen terms. We can pass almost 20 gigabytes a second down that bus. That's not very small in today’s terms -- it’s larger than the PCIe on most PCs!

  • "Next, to support the case where you want to use the GPU L2 cache simultaneously for both graphics processing and asynchronous compute, we have added a bit in the tags of the cache lines, we call it the 'volatile' bit. You can then selectively mark all accesses by compute as 'volatile,' and when it's time for compute to read from system memory, it can invalidate, selectively, the lines it uses in the L2. When it comes time to write back the results, it can write back selectively the lines that it uses. This innovation allows compute to use the GPU L2 cache and perform the required operations without significantly impacting the graphics operations going on at the same time -- in other words, it radically reduces the overhead of running compute and graphics together on the GPU."

  • Thirdly, said Cerny, "The original AMD GCN architecture allowed for one source of graphics commands, and two sources of compute commands. For PS4, we’ve worked with AMD to increase the limit to 64 sources of compute commands -- the idea is if you have some asynchronous compute you want to perform, you put commands in one of these 64 queues, and then there are multiple levels of arbitration in the hardware to determine what runs, how it runs, and when it runs, alongside the graphics that's in the system."
When we see the fruits of this is anyone's guess but I think 2nd and 3rd generation 1st party titles would be a good bet. 3rd party is cloudier.

Anyway, consoles are going to be where the shift is for compute assist in games and perhaps that will trickle to PC but the architectural differences and APIs will be probably be rocky to say the least. Regardless, its going to be exciting to watch over the next 2-4 years.
 
Sony was pretty clear about the modifications they wanted to GCN for the PS4

When we see the fruits of this is anyone's guess but I think 2nd and 3rd generation 1st party titles would be a good bet. 3rd party is cloudier.

Anyway, consoles are going to be where the shift is for compute assist in games and perhaps that will trickle to PC but the architectural differences and APIs will be probably be rocky to say the least. Regardless, its going to be exciting to watch over the next 2-4 years.

1 is definitely in Kaveri, I'm not sure about 2 but I suspect so and 3 is standard fair for all GCN1.1 and above GPU's.
 
1) Xbone has 30 GB/s coherent bus. Don't see how it could help here.

2) No indication this is using async compute. And yes GCN 1.1 has VB too.

3) As above. This is one compute shader, not sure how ppl are expecting all those aces to be adding 40% performance per flop. Look at some GCN 1 vs 1.1 compute benchmarks.

No point looking for that special taste of secret source when there's a whopping great 150%+ bw difference starting us in the face, imo.
 
Would any of this a asynchronous GPU computing work on Wii U? I know VLIW5 was not considered real good as this type of work, but if its simply making use of the GPU down time, then any additional work done is beneficial. For example, even if the Wii U GPU could only complete 25 dancers in the allotted time, if that didn't hurt graphics rendering performance at all, then that would be less work the CPU would have to do.

If it's a GPU that exposes the compute capability of an unmodified VLIW5 GPU, it's quite possibly a serious performance regression.

Cayman's VLIW4 architecture was the first one announced to have asynchronous dispatch, which is what gave the option for the GPU to run more than one kernel at a time and the ability for more than one CPU thread to send commands to its own compute kernel.
This seems like a fledgling or partially obscured implementation of what became the explicitly exposed ACEs in GCN.
Aside from the launch hype of the feature, I'm not sure as to how well it was exposed.

Running compute on an IP level older than Cayman would require a context switch of the GPU, basically wiping or writing back a good portion of the chip's context and reinitializing it in a compute mode for a little while, then doing another flush and reinitialization to graphics.
The latencies for that operation are brutal.

It's for similar reasons that the early GPU recomendations for Nvidia's GPU PhysX product were to have two separate boards, one for graphics and the other for physics. The hardware available at the time was not able to run multiple kernels, like VLIW5, and the system would spend so much time bringing up and tearing down device contexts that it was a giant performance negative.

This is on top of other documented problems on AMD's older hardware like the very poor cache subsystem, very bad VLIW compute code generation, and very rigid clause-based execution model.
 
Didn't Nintendo themselves talk about Compute on Wii U at some point?
 
3rd party is cloudier.

Based on the fact that both consoles and PCs all can do compute I think 3rd parties won't mind looking into compute since much of the work ( but certainly not all of the work ) could be shared between consoles and in theory PCs but the later isn't as big a win. Ubisoft's AC CPU headroom issues may speed things along on this front.
 
1) Xbone has 30 GB/s coherent bus. Don't see how it could help here.

2) No indication this is using async compute. And yes GCN 1.1 has VB too.

3) As above. This is one compute shader, not sure how ppl are expecting all those aces to be adding 40% performance per flop. Look at some GCN 1 vs 1.1 compute benchmarks.

No point looking for that special taste of secret source when there's a whopping great 150%+ bw difference starting us in the face, imo.

150%???? the ESRAM theorical bandwith is 218 Gb/s more than PS4 bandwith. The compute shader was bandwith bound and they decided to work with data in main RAM. It is not logical... They had choice between 68Gb/s share with CPU and other DMA access and 218 GB/s, it is an easy choice when the limit is the bandwith...

If the performance of one system is limited by bandwith it is the PS4
 
Last edited by a moderator:
If they use ESRAM maybe the limiting factor is the copy of the data in main ram for CPU access.... Without details of implementation everything is possible...

But I highly doubt they use main RAM for compute shading and store the vertices there...
 
Last edited by a moderator:
The high-level description of the algorithm does a number of things that may make the ESRAM less compelling. The way the workload is subdivided, the emphasis on coalescing of memory accesses, and then the heuristic for LDS usage all show a lot of the ESRAM's work is being done at a CU level.

This covers for GCN's very coarse wavefront granularity and it makes use of the CU memory pools, which in aggregate bandwidth vastly outmatch either the ESRAM or external DRAM.
The emphasis is on presenting a pretty linear access pattern on a decently sized amount of data.
The ESRAM doesn't pay dividends unless multiple accesses hit the same data, and it might be less compelling if the total amount of data is large enough that a real-world scenario would have the data being spilled out before it can be significantly reused.

This does seem to be something the PS4 can handle well. The slides don't say as much about the Xbox One to say if there would be an alternate version that could do significantly better on that platform.
This might be a scenario where they have one algorithm that does very well on both consoles versus working to implement one algorithm that might perform very well on one console, but then they have to develop the this algorithm anyway for the other.
 
Yet, the multiplatform games between them, are telling a whole other story.

I speak about about this case not every game... And only about the unoptimized version with no use of LDS for storing the data...
 
Last edited by a moderator:
The high-level description of the algorithm does a number of things that may make the ESRAM less compelling. The way the workload is subdivided, the emphasis on coalescing of memory accesses, and then the heuristic for LDS usage all show a lot of the ESRAM's work is being done at a CU level.

This covers for GCN's very coarse wavefront granularity and it makes use of the CU memory pools, which in aggregate bandwidth vastly outmatch either the ESRAM or external DRAM.
The emphasis is on presenting a pretty linear access pattern on a decently sized amount of data.
The ESRAM doesn't pay dividends unless multiple accesses hit the same data, and it might be less compelling if the total amount of data is large enough that a real-world scenario would have the data being spilled out before it can be significantly reused.

This does seem to be something the PS4 can handle well. The slides don't say as much about the Xbox One to say if there would be an alternate version that could do significantly better on that platform.
This might be a scenario where they have one algorithm that does very well on both consoles versus working to implement one algorithm that might perform very well on one console, but then they have to develop the this algorithm anyway for the other.

It is why I said unoptimized compute shader was bandwith bound (PS4, Xbox One, probably best card on PC) I don't think it is the case with the optimized version. I don't know where is the botteneck on Xbox One for optimized version of the compute shader but probably not the bandwith of main memory or ESRAM same thing for PS4 with GDDR5 ...

If you read the slide they give advice to do quick proof of concept and iterate to choose the best solution for GPGPU maybe they didn't find better solution...
 
Last edited by a moderator:
Didn't Nintendo themselves talk about Compute on Wii U at some point?
I think it's more likely the hardware of the Wii U was meant to release earlier than it did perhaps as a Wii HD on a par with 360/PS3 level graphics which explains the outdated gpu being based on an r700 design. Though because the Wii was still selling like gangbusters at that stage they shelved it for another 2-3 years.

Nintendo themselves though did not do anything to disabuse people of the idea that it was intended in any way to compete on a hardware level with the new generation of consoles, and by that I mean sit somewhere in between the two generations.
 
Last edited by a moderator:
150%???? the ESRAM theorical bandwith is 218 Gb/s more than PS4 bandwith. The compute shader was bandwith bound and they decided to work with data in main RAM. It is not logical... They had choice between 68Gb/s share with CPU and other DMA access and 218 GB/s, it is an easy choice when the limit is the bandwith...

If the performance of one system is limited by bandwith it is the PS4

Look at the slides, it doesn't look like they're using the esram. Data that's being accessed multiple times is going in the faster LDSs instead. Not much point copying stuff to esram only to read it once, and then separately write to esram only to have to copy it out again. How is that supposed to make this faster?

Move engines top out at about 25 GB/s btw. Esram isn't magic. And this is a computer shader intended to be used across three platforms.
 
The high-level description of the algorithm does a number of things that may make the ESRAM less compelling. The way the workload is subdivided, the emphasis on coalescing of memory accesses, and then the heuristic for LDS usage all show a lot of the ESRAM's work is being done at a CU level.

This covers for GCN's very coarse wavefront granularity and it makes use of the CU memory pools, which in aggregate bandwidth vastly outmatch either the ESRAM or external DRAM.
The emphasis is on presenting a pretty linear access pattern on a decently sized amount of data.
The ESRAM doesn't pay dividends unless multiple accesses hit the same data, and it might be less compelling if the total amount of data is large enough that a real-world scenario would have the data being spilled out before it can be significantly reused.

This does seem to be something the PS4 can handle well. The slides don't say as much about the Xbox One to say if there would be an alternate version that could do significantly better on that platform.
This might be a scenario where they have one algorithm that does very well on both consoles versus working to implement one algorithm that might perform very well on one console, but then they have to develop the this algorithm anyway for the other.

Basically this! Except you actually know properly what you're talking about ... :(
 
It is why I said unoptimized compute shader was bandwith bound (PS4, Xbox One, probably best card on PC) I don't think it is the case with the optimized version. I don't know where is the botteneck on Xbox One for optimized version of the compute shader but probably not the bandwith of main memory or ESRAM same thing for PS4 with GDDR5 ...

Optimised shader was still BW limited on the vastly more bw bound X1. Possibly still was to some extent on the ps4 also - just look at the enormous speedup from packing the vertex and normal data!
 
Back
Top