Asynchronous Compute : what are the benefits?

...
Obviously if you use GPGPU to offload some simulation work (such as physics or ocean simulation) from CPU to GPU, you need to copy that data back to CPU in order to update the CPU data structures.

Except in some cases where you can use the shared memory available for both the CPU and GPU on PS4, right? (with a 20GB/s maximum bandwidth)

wx7XRap.png


First, we added another bus to the GPU that allows it to read directly from system memory or write directly to system memory, bypassing its own L1 and L2 caches. As a result, if the data that's being passed back and forth between CPU and GPU is small, you don't have issues with synchronization between them anymore. And by small, I just mean small in next-gen terms. We can pass almost 20 gigabytes a second down that bus. That's not very small in today’s terms -- it’s larger than the PCIe on most PCs!
 
Except in some cases where you can use the shared memory available for both the CPU and GPU on PS4, right? (with a 20GB/s maximum bandwidth)

The quote you were responding to was still in the context of a PC with dedicated graphics card, I believe, and was a continuation of the idea presented in this quote from an earlier post

Asynchronous compute will be mostly used for rendering related tasks. It will improve the graphics quality and the frame rate. Some games will use asynchronous compute for non-graphics related tasks. However, as we have already seen, compute shaders have been most successfully used for graphics processing. Asynchronous compute will make compute shaders even more useful for rendering purposes, meaning that there's will not be much free GPU cycles to spare for other purposes. This is especially true on PC, since the data transfer from CPU memory to GPU memory and back is expensive, and has high latency. Rendering related compute work doesn't need to be transferred to CPU memory at all.
 
Except in some cases where you can use the shared memory available for both the CPU and GPU on PS4, right?
Yes, you have unified memory on consoles. However if you design your engine around fast CPU<->GPU communication using unified memory, it becomes very hard to port to PC. You have at least one additional frame of GPU roundtrip latency on PC (and multiple frames with SLI/Crossfire). This is partly because of the separate CPU and GPU physical memories, and partly because of the API abstractions (no direct way to control GPU memory and data transfers, hard to ensure enough work for various GPUs when working at near lockstep). DirectX 12 (and Vulkan) will certainly help, but they cannot still remove the need to move data between the CPU and GPU memories.

The CPU and the GPU run asynchronously. A PC game will never be able to have as low latency as a console game, because the GPU performance is unknown. You want to ensure that there's always enough work on GPUs command queues (faster GPUs empty their queues faster). To prevent GPU idling (= empty queues), you push more data to the queues, meaning longer average wait time until each command gets out. Asynchronous compute has priority mechanisms to fight against this issue, but only time will tell how much these mechanisms can lower the GPU roundtrip latency on PC. For game play code, the worst case latency is of course the most important one (large fluctuating input lag is the worst). It remains to be seen whether ALL the relevant Intel, Nvidia and AMD GPUs provide low enough latency for high priority asynchronous compute. If the latency is not predictable across all the manufacturers, then I expect cross platform games to continue using CPU SIMD (SSE/AVX) to do their game play related data crunching.

Insomniac (Sunset Overdrive developer) had a GDC presentation about CPU SIMD:
https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/

In page 4 Andreas explains why they continue using CPU SIMD instead of GPU compute for game play related things. Latency is the key.
 
Yes, you have unified memory on consoles. However if you design your engine around fast CPU<->GPU communication using unified memory, it becomes very hard to port to PC. You have at least one additional frame of GPU roundtrip latency on PC (and multiple frames with SLI/Crossfire). This is partly because of the separate CPU and GPU physical memories, and partly because of the API abstractions (no direct way to control GPU memory and data transfers, hard to ensure enough work for various GPUs when working at near lockstep). DirectX 12 (and Vulkan) will certainly help, but they cannot still remove the need to move data between the CPU and GPU memories.

The CPU and the GPU run asynchronously. A PC game will never be able to have as low latency as a console game, because the GPU performance is unknown. You want to ensure that there's always enough work on GPUs command queues (faster GPUs empty their queues faster). To prevent GPU idling (= empty queues), you push more data to the queues, meaning longer average wait time until each command gets out. Asynchronous compute has priority mechanisms to fight against this issue, but only time will tell how much these mechanisms can lower the GPU roundtrip latency on PC. For game play code, the worst case latency is of course the most important one (large fluctuating input lag is the worst). It remains to be seen whether ALL the relevant Intel, Nvidia and AMD GPUs provide low enough latency for high priority asynchronous compute. If the latency is not predictable across all the manufacturers, then I expect cross platform games to continue using CPU SIMD (SSE/AVX) to do their game play related data crunching.

Insomniac (Sunset Overdrive developer) had a GDC presentation about CPU SIMD:
https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/

In page 4 Andreas explains why they continue using CPU SIMD instead of GPU compute for game play related things. Latency is the key.

I think sometimes people forget exclusives PS4 dev don't have the same constraint than multiplatform dev working with PC.

edit: the bullet API PS4 version keep the gameplay physics on CPU.
 
Last edited:
Yes, you have unified memory on consoles. However if you design your engine around fast CPU<->GPU communication using unified memory, it becomes very hard to port to PC. You have at least one additional frame of GPU roundtrip latency on PC (and multiple frames with SLI/Crossfire). This is partly because of the separate CPU and GPU physical memories, and partly because of the API abstractions (no direct way to control GPU memory and data transfers, hard to ensure enough work for various GPUs when working at near lockstep). DirectX 12 (and Vulkan) will certainly help, but they cannot still remove the need to move data between the CPU and GPU memories.

If you don't mind sharing (and of course if your NDA allows you to) , in this situation would you choose to NOT design your engine around fast CPU<->GPU because you want to support PC (thus handicapping the engine on Consoles) ... ?!

Just interested in how you would tackle this situation :)
 
If you don't mind sharing (and of course if your NDA allows you to) , in this situation would you choose to NOT design your engine around fast CPU<->GPU because you want to support PC (thus handicapping the engine on Consoles) ... ?!

Just interested in how you would tackle this situation :)

Is it really handicapping the engine on consoles? sebbbi already described how it's often (usually?) possible to fill the GPU with rendering based async compute jobs which don't need the fast CPU <->GPU interconnect.

And the slide deck he posted explained (very well I think) why CPU SIMD should be used for the low latency jobs where possible.

What I found particularly interesting about that slide deck is on the one hand, the fact that AVX performance on AMD CPU's is crippled to the point that Insomniac don't even bother with it on the PS4 and use SSE4.2 instead. While on the PC, despite some PC's offering AVX2 capability (potentially 4x SSE performance), the fragmentation in the PC market would generally limit them to between SSE2 and SSE4.1 depending on how high the hardware target is.

That is unless some kind of abstraction software is use that can automatically switch between SSE and AVX (and AVX2?) as described in the deck.
 
So basically saying, the benefits of GPGPU like how Ubisoft presented in their slides is not possible with their games because the PS4's set up is incapable of being compatible with PC/XB1 due to the specific advantage the PS4 offers :/

That sucks, that means a majority of devs won't utilize such a thing even though it has some potential. Thanks for the info as usual sebbbi.

On the other hand, maybe CC2 will be able to manage some kind of approximation, they have been looking into this very thing for implementation in their games.
 
So basically saying, the benefits of GPGPU like how Ubisoft presented in their slides is not possible with their games because the PS4's set up is incapable of being compatible with PC/XB1 due to the specific advantage the PS4 offers :/

That sucks, that means a majority of devs won't utilize such a thing even though it has some potential. Thanks for the info as usual sebbbi.

On the other hand, maybe CC2 will be able to manage some kind of approximation, they have been looking into this very thing for implementation in their games.

No Xbox One is good too for this too. The problem is more on PC side with PCI EXPRESS bus linking the GPU to main RAM and API level of control not available on PC.
 
True, but i was more referring to XB1's lack of compute resources in comparison...its going to be a hard sell to try and put compute into multiplat games where the effect is limited on a platform you have to get working as best as possible to the same level. Is it even worth it at that point to use compute?

From DF's observations, FF15 uses Infamous SS's type of GPGPU compute effects for particles from everything to the summons dispersal to the damage effect of ordinary slashes. In that case, XB1 is lagging behind. Do they just cut that utilization back or something?
 
True, but i was more referring to XB1's lack of compute resources in comparison...its going to be a hard sell to try and put compute into multiplat games where the effect is limited on a platform you have to get working as best as possible to the same level. Is it even worth it at that point to use compute?
If they're on PC, they still have an even more limited baseline to do a cut-off.

XB1 is lagging behind. Do they just cut that utilization back or something?
Some already reduce resolution.
 
So basically saying, the benefits of GPGPU like how Ubisoft presented in their slides is not possible with their games because the PS4's set up is incapable of being compatible with PC/XB1 due to the specific advantage the PS4 offers :/

That sucks, that means a majority of devs won't utilize such a thing even though it has some potential. Thanks for the info as usual sebbbi.

On the other hand, maybe CC2 will be able to manage some kind of approximation, they have been looking into this very thing for implementation in their games.
My understanding from Sebbbi is that he prefers/would like to see all game code to not be run at all on the GPU. All graphics related items run on the GPU and game code stays on the CPU.

If you heavily tune your engine to purposefully calculate work between cpu and gpu it would be hard to port it to PC due to the round trip latency between the CPU and GPU. If you leverage async compute for graphics however that round trip latency doesn't exist: it goes to the GPU and stays there.

There are tools on the CPU side that could be explored without having to use the GPU to perform the calculations as Sebbbi has mentioned. Since what your attempting to do is asynchronous compute you run into issues of the data does not return in time the CPU is stalled waiting for the GPU to return results. And you also take up clock cycles that could have been used for graphics.

He does make a good case in this statement; gameplay programmers should be leveraging and optimizing their CPU and memory as much as you'd have to do on the GPU side.
 
True, but i was more referring to XB1's lack of compute resources in comparison...its going to be a hard sell to try and put compute into multiplat games where the effect is limited on a platform you have to get working as best as possible to the same level. Is it even worth it at that point to use compute?

From DF's observations, FF15 uses Infamous SS's type of GPGPU compute effects for particles from everything to the summons dispersal to the damage effect of ordinary slashes. In that case, XB1 is lagging behind. Do they just cut that utilization back or something?

iroboto already said it more or less but it's worth pointing out that as sebbbi said earlier, tons of games, on both consoles and on PC already use GPU compute (for graphics work) and have done since DX11 became standard. So there's no fear about it being used on consoles since it already is.

When DX12 lands the PC will also widely support async compute (I say widely as it already does through Mantle on AMD GPU's). The question I'm still not sure about though is whether a game can be developed to use synchronous compute and automatically use async compute if/when it's available in the hardware, (and vice versa) or whether the game needs to be specifically coded to make use of either async or sync compute specifically. Because as far as I'm aware, no Intel GPU's support async compute so that could greatly hinder it's takeup, at least in the PC space if it requires full hardware support and has no fall back option. Judging from sebbbi's enthusiasm for this though, I'm assuming that wouldn't pose too much of a barrier.
 
I see, thank you for the information and clarification.
I didn't actually answer the second part of your question, or really the first which was whether or not the shared memory bus would be used. The answer is yes it would be used in mutliplats. In both scenarios where compute is headed to only the Gpu that bus can be leveraged. And in the scenario that a round trip needs to be made it can leveraged with less much less latency than PC would receive and X amount less latency than XO would receive.

But the scenario where you are particularly designing an engine for that round trip performance, it could only be leveraged on pS4 as your code would eventually become dependent on the faster latency of that shared memory space. I think this was what Sebbbi's stance was with regards to difficulty porting out.
 
Last edited:
When DX12 lands the PC will also widely support async compute (I say widely as it already does through Mantle on AMD GPU's). The question I'm still not sure about though is whether a game can be developed to use synchronous compute and automatically use async compute if/when it's available in the hardware, (and vice versa) or whether the game needs to be specifically coded to make use of either async or sync compute specifically. Because as far as I'm aware, no Intel GPU's support async compute so that could greatly hinder it's takeup

I'm going to have to respond in terms of Xbox hardware, but I believe this is the difference between high priority compute queue and low priority queue. My understanding is that compute shaders/directCompute is high priority while async compute is low priority. If this slide is to be followed:: then we see that the intent for asynchronous compute is meant for small multiple jobs to render faster and fit into gaps (the CPU overhead of dx12 is small and parallel rendering enables such a solution that didn't previously exist on DX11 due to CPU overhead, documented by Ubisoft presentation). Instead, they wrote a very long shader code with sync points to complete it's job, not a very good use of resources, but good for determining the maximum capabilities of the hardware.

I believe Intel Skylake will be DX12 ready and therefore support asynchronous compute.
1sphIc6.jpg
 
No, if you design it correctly, it will die very fast when synchronizing GPU <-> CPU.

I don't understand where the synchronisation comes in? These are CPU tasks in the first place that you're moving to the GPU on the consoles without a latency hit because of the shared memory. But on the PC you're just leaving them on the CPU in the first place so why would you need a low latency sync to the GPU? How is it any different to how games have been splitting tasks between the CPU and GPU (with a slow interconnect) for years?
 
Well not everything sim side needs to make its way to the GPU for drawing, and even for the things that do, many are latency tolerant at the point they are queued up for drawing.

Don't think there's a Windows build for GPU yet, and even when there is, serial x86 legacy apps will run pretty bad on a GPU based x86 emulator! :eek:

PC form factor needs to evolve beyond 65~95 W APUs before discrete products go away in gaming PCs.
 
When DX12 lands the PC will also widely support async compute (I say widely as it already does through Mantle on AMD GPU's). The question I'm still not sure about though is whether a game can be developed to use synchronous compute and automatically use async compute if/when it's available in the hardware, (and vice versa) or whether the game needs to be specifically coded to make use of either async or sync compute specifically.

Well, that depends quite a bit on what you're using async compute for. Currently the dominant use of async compute is for optimizing graphics-related tasks that can be run in parallel with other graphics tasks. So for instance might update your particle simulation using async compute jobs that are kicked off at the beginning of your frame, and while that's going on your primary graphics pipe is processing draw calls for a depth prepass. For a situation like this, where async compute is just an optimization and the results are still consumed by the GPU, it's pretty trivial to just kick off your compute job on your main graphics pipe instead. All you really need to do is just make sure that it gets submitted before whatever graphics tasks consume the results of the compute job. It won't run as optimally as if you had async compute, but things will still basically work without any major problems.

Where things get tricky is if you're using async compute for low-latency, non-graphics tasks. The typical game setup goes like this: frame 1 starts on the CPU, by updating the state of all in-game entities. Once this is done, the CPU then builds GPU command buffers to draw the entities at their current state. At this point the CPU is done with frame 1, and so it submits command buffers to the GPU so that it can render frame 1. While the GPU is cranking away on frame 1, the CPU moves on to frame 2 and repeats the process. The consequence of this setup is that if the entity update phase wants to do some quick compute jobs on the GPU, it might have to wait all the way until the end of the frame for the GPU to finish processing the previous frame before it can actually submit something and have the GPU start executing it. On the PC it might even require more than 1 frame of waiting, since by default the driver will buffer up 2-3 frames worth of command buffers before submitting them. Async compute offers a nice way around this problem, since it essentially lets you say "Hey GPU, I know you're doing other stuff right now, but go ahead and execute these compute jobs whenever you have some spare time" (or right now, if you set the priority high enough". This, together with low-latency readback of results into CPU-accessible memory, essentially opens the door for low-latency GPGPU jobs. If you were using async compute to realize this, you can't really just fall back to synchronous compute unless the system that kicked off the task is capable of tolerating multiple frames of latency. For such cases I would imagine that you would need to have an optimized CPU-only path that you could use instead.
 
Back
Top