Asynchronous Compute : what are the benefits?

Little late in sharing this bit of information but I think I've figured out where that 14+4 CU line of thinking came from finally.

When reading very thoroughly through the Xbox leaked SDK usage of Async compute warrants that the default setting that only 4CU are set aside for the processing of the job. You must alter the parameter to use additional CU at your own discretion. I'm unsure if you can use Less CU.

This leads into an interesting factoid: Xbox One can only at most dispatch 3 Async jobs simultaneously and PS4 4. If of course you cannot use less than 4 CU per job.
 
So does that mean that async compute is a tech that could be "patched" in current engines/games or is it something that needs to be there from start ?
 
Little late in sharing this bit of information but I think I've figured out where that 14+4 CU line of thinking came from finally.

When reading very thoroughly through the Xbox leaked SDK usage of Async compute warrants that the default setting that only 4CU are set aside for the processing of the job. You must alter the parameter to use additional CU at your own discretion. I'm unsure if you can use Less CU.

This leads into an interesting factoid: Xbox One can only at most dispatch 3 Async jobs simultaneously and PS4 4. If of course you cannot use less than 4 CU per job.

I'm having a hard time following what you are trying to say. How did you make the leap from the XB1 SDK to something about the PS4? How does the 8 compute pipelines with 8 queues in the PS4 GPU translate to a discrete numbers of CUs for compute?
 
I'm having a hard time following what you are trying to say. How did you make the leap from the XB1 SDK to something about the PS4? How does the 8 compute pipelines with 8 queues in the PS4 GPU translate to a discrete numbers of CUs for compute?

When you submit an async compute job into the system, wrt Xbox SDK the numbers of CUs that are leveraged for a job by default is '4', unless you modify it to be a larger number. This is what is written in the SDK.

The Async Controllers are looking and waiting for availability to insert work into the CUs to do. But each job requires in (xbox case) will block off at least 4 CU for the task.

If the two GPU are similar in this manner (default CU reservation) the default CU reservation is 4 CU, which is coincidentally all the hub bub about 14+4 a long time ago.
 
When you submit an async compute job into the system, wrt Xbox SDK the numbers of CUs that are leveraged for a job by default is '4', unless you modify it to be a larger number. This is what is written in the SDK.

The Async Controllers are looking and waiting for availability to insert work into the CUs to do. But each job requires in (xbox case) will block off at least 4 CU for the task.

If the two GPU are similar in this manner (default CU reservation) the default CU reservation is 4 CU, which is coincidentally all the hub bub about 14+4 a long time ago.

12 CUs (xbox one) = 4×3
18 CUs (PS4) = 14 + 4 = 3×4 + 4 + 2

You better not use 4 as the default CU reservation on PS4...;)
 
So does that mean that async compute is a tech that could be "patched" in current engines/games or is it something that needs to be there from start ?

There's no reason it couldn't be patched in, although patches are not really the right time to be doing that sort of thing unless there's a huge, previously unnoticed performance issue on ship.
 
When you submit an async compute job into the system, wrt Xbox SDK the numbers of CUs that are leveraged for a job by default is '4', unless you modify it to be a larger number. This is what is written in the SDK.

The Async Controllers are looking and waiting for availability to insert work into the CUs to do. But each job requires in (xbox case) will block off at least 4 CU for the task.

If the two GPU are similar in this manner (default CU reservation) the default CU reservation is 4 CU, which is coincidentally all the hub bub about 14+4 a long time ago.

I am guessing thats because in GCN, CUs are grouped in 4s by the fact that each group share an instruction cache and a scalar data cache.
 
Instead of 8 Aces, there are 4 Aces in fury!!

AMD-Radeon-R9-Nano-Fiji-GPU-Block-Hot-Chips.jpg
 
Is that why it doesn't see as much improvement in the Dx12 Ashes demo as the 290x/390x? That's weird.
 
Is that why it doesn't see as much improvement in the Dx12 Ashes demo as the 290x/390x? That's weird.

The compute scheduler architecture are different. The only things we know they have 2 HWS other sort of scheduler "smarter" than ACE.
 
I find this:

All newer GCN 1.2 cards have this configuration. There are 4 core ACEs. The two HWS units can do the same work as 4 ACEs, so this is why AMD refer to 8 ACEs in some presentations. The HWS units just smarter and can support more interesting workloads, but AMD don't talk about these right now. I think it has something to do with the HSA QoS feature. Essentially the GCN 1.2 design is not just a efficient multitask system, but also good for multi-user environments.

Most GPUs are not designed to run more than one program, because these systems are not optimized for latency. They can execute multiply GPGPU programs, but executing a game when a GPGPU program is running won't give you good results. This is why HSA has a graphics preemption feature. These GCN 1.2 GPUs can prioritize all graphics task to provide a low-latency output. QoS is just one level further. It can run two games or a game and a GPGPU app simultaneously for two different users, and the performance/experience will be really good with these HWS units.

http://forums.anandtech.com/showpost.php?p=37656793&postcount=204
 
Is that why it doesn't see as much improvement in the Dx12 Ashes demo as the 290x/390x? That's weird.
ACES don't increase performance. It's just how many async threads you can hold. Each async thread grabs 4CU as written above, so number of CU/4 is the amount of concurrent threads the GPU can operate on.

In this case, 16 threads. Each ACE is 8 threads.
 
It is more about efficiency, the only known games with more than 2 queues is The Tomorrow Children and it only use 3 queues (ACE) far from 7 (one is reserved for OS)
 
Last edited:
Back
Top