Asynchronous Compute : what are the benefits?

Well, that depends quite a bit on what you're using async compute for..

Thanks, that was an excllent explanation. So it does seem to reinforce my previous thinking that for graphics related tasks (and potentially -highly- latency tolerant non-graphics tasks, e.g. physx style physics), once DX12 lands, the PC should be able to handle async compute just fine with no special penalties - or in Intels case fall back to synchronous compute. But for latency sensitive CPU tasks then using async compute on the PC is basically a non-starter.

And so in that instance the developr has a choice either to build a dedicated CPU only path for those tasks for the PC, but continue to use async compute for the consoles, or just build one CPU centric path for all machines and use async compute for graphics only tasks (and perhaps latency tolerent GPGPU tasks).
 
But you may need to do it for some "real next-gen" graphics. Take for example Tomorrow's Children, they do 3 draw calls per each voxel in the scene! I think PC would die very very fast there. And they do it not because they want to cripple something, but because it gets them real-time GI with fully dynamic environment (destruction and such), and it looks gorgeous.

If we're talking purely in terms of DX11 then I don't disagree.
 
jd1ss.jpg
 
AMD Details Asynchronous Shaders In DirectX 12, Promises Performance Gains

Asynchronous Shaders are something that should have arrived a long time ago, because all we've been doing is throwing more power at the problem, rather than using that power more efficiently.

graphics cards will support some of those features. For example, there are the so-called "Asynchronous Shaders," which are a different way of handling task queues than was possible in older graphics APIs and is potentially much more efficient.

In DirectX 11, there are two primary ways of synchronous task scheduling: multi-threaded graphics and multi-threaded graphics with pre-emption and prioritization, each with their advantages and disadvantages.

Before we continue, we must clarify a couple of terms. The GPU's shaders do the drawing of the image, computing of the game physics, post-processing and more, and they do this by being assigned various tasks. These tasks are delivered through the command stream, which is the main command queue of tasks that the shaders need to execute. The command stream is generated through merging individual command queues, which consist of multiple tasks and break spaces.


These empty parts in the queue exist because tasks in a single queue aren't generated one right after another in multi-threaded graphics; tasks in one queue are sometimes only generated after tasks in another queue. Due to these break spaces, a single queue cannot utilize the shaders to their full potential.

Generally speaking, there are three command queues: the graphics queue, the compute queue, and the copy queue.

The simplest way to describe how synchronous multi-threaded graphics works is that the command queues are merged by switching between one another on time intervals – one queue will go to the main command stream briefly, and then the next queue will go, and so on. Therefore, the gaps mentioned above remain in the central command queue, meaning that the GPU will never run at 100 percent actual load. In addition, if an urgent task comes along, it must merge with the command queue and wait for the rest of the command queues to finish executing. Another way of thinking of this is multiple sources at a traffic light merging into a single lane.

In DirectX 12, however, a new merging method called Asynchronous Shaders is available, which is basically asynchronous multi-threaded graphics with pre-emption and prioritization. What happens here is that the ACEs (Asynchronous Compute Engines) on AMD's GCN-based GPUs will interleave the tasks, filling the gaps in one queue with tasks from another, kind of like merging onto a highway where nobody moves to the side for you. Despite that, it can still move the main command queue to the side to let priority tasks pass by when necessary. It probably goes without saying that this leads to a performance gain.

On AMD's GCN GPUs, each ACE can handle up to eight queues, and each ACE can address its own fair share of shaders. The most basic GPUs have just two ACEs, while more elaborate GPUs carry eight.company is working closely with Microsoft to ensure the best support possible. During the briefing, the spokesman mentioned that he had seen no such information regarding support from its competitors, but we know that Nvidia is always very "hush hush" about unannounced products. It should be noted, however, that Asynchronous Shaders isn't something new only to DirectX 12; it will also be a part of the new Vulkan API as well as LiquidVR, and it exists in AMD's Mantle.

http://www.tomshardware.com/news/amd-dx12-asynchronous-shaders-gcn,28844.html
 
Last edited by a moderator:
Very interesting. Microsoft may be working with them to support this in DX12 but with regards to the consoles it seems that the PS4 hardware is much more in position to take advantage of this with it's 8 ACE's versus 2 on Xbox. Wonder if Sony has this technique in their api already?

It's already being used in some PS4 games also it's funny that AMD also used the cars in traffic analogy to explain Asynchronous Shaders like I did 2 years ago.

Async_Games.png


3.PNG
 
Really interesting article from Anandtech on async here:

http://www.anandtech.com/show/9124/amd-dives-deep-on-asynchronous-shading

No new info on AMD's capabilities but some really interesting bits on Nvidia. They basically confirm that all Nvidia GPU's prior to Maxwell 2 are incapable of async compute. This is big news IMO and has big implications for how much this might be used on the PC over the next few years. Bare in mind no Intel GPU supports it either at this stage.

Given though that we're seeing an average of between 15-20% performance improvement when Async is used, this could to some extent re-arrange the performance stack between AMD and Nvidia in everything below Maxwell 2 in any games that use it. Of course how relevant that will be by the time we see the first DX12 games using async compute is up for question. I'm guessing there's a reasonable chance we'll have Pascal on the market by then.
 
I'd expect to see this on the PC. Since PS4 and Xbox One support it, it will be used. With Xbox One being DX12 in the future I'd guess those features would be available in multiplatform titles on PC.
 
KZDTM1T.jpg

I have a question are amd and nv compute engines roughly equal does a 780 have 4x the compute power of a r290 ?
 
No, because the 780 can't use the graphics and compute queues simultaneously unlike the R290. i.e. it doesn't support async compute. So even GCN1.0 GPU's are superior in that regard.
 
Last edited:
The table lists out counts for the front end, which is not a measure of compute power. The PS4 has as many compute queues as Hawaii and Kaveri, so there is no real way to compare. However, the article is muddled in what it is describing in text and for that table. The description in text for Nvidia discusses queues, not front-end processors. If there are 32 processors managing queues, that is one thing. If the count is queues, the GCN 1.1 and greater counts are low by a factor of 8.
 
I have a question are amd and nv compute engines roughly equal does a 780 have 4x the compute power of a r290 ?
Multiple compute queues is a bit like hyperthreading. Going from 1 to 2 threads (independent instructions streams) helps the most. Modern Intel CPU (Haswell, Broadwell, etc) have 2-way hyperthreading. Some IBM and Sun CPUs have 4-way and 8-way hyperthreading but that doesn't give them much additional advantage. Same is true on GPU asynchronous compute.
 
Back
Top