Follow along with the video below to see how to install our site as a web app on your home screen.
Note: This feature may not be available in some browsers.
Uh, that's an ugly topic.What about NVIDIA apparently using PreEmption and not truly parallel execution and having to wait for draw call to finish to switch context? At least that's the case for "asynchronous timewarp", which is just another case of asynchronous shaders / compute? (it's in their VR documents, linked here too, I'm sure)
Oh, and no, Maxwell V2 is actually capable of "parallel" execution. The hardware doesn't profit from it much though, since it has only little "gaps" in the shader utilization either way. So in the end, it's still just sequential execution for most workload, even though if you did manage to stall the pipeline in some way by constructing an unfortunate workload, you could still profit from it. In general, you are only saving on idle time as there is always at least one queue which contains any work at alll. Unlike GCN, where you actually even NEED that additional workload to get full utilization.What about NVIDIA apparently using PreEmption and not truly parallel execution
Please don't use the word "asynchronous" to mean "concurrent". The vast majority of asynchronous interfaces are implemented sequentially. Asynchronous shaders specify the interface, they don't specify the implementation. If you are interested in measuring whether asynchronous shaders can be executed concurrently, in order to improve efficiency, please use "concurrent" rather than "asynchronous".Original topic:
What performance gains are to be expected fro the use of async shaders and other DX12 features. Async shaders are basically additional "tasks" you can launch on your GPU, to increase the utilization and thereby efficiency as well as performance even further. Unlike in the classic model, these tasks are not performed one after another, but truly in parallel.
http://docs.nvidia.com/cuda/samples/6_Advanced/simpleHyperQ/doc/HyperQ.pdfHyper-Q enables multiple CPU threads or processes to launch work on a single GPU simultaneously, thereby dramatically increasing GPU utilization and slashing CPU idle times. This feature increases the total number of "connections" between the host and GPU by allowing 32 simultaneous, hardware-managed connections, compared to the single connection available with GPUs without Hyper-Q (e.g. Fermi GPUs).
For this test, with 1 thread per kernel, it would devolve to 1 kernel launch per cycle per ACE--if each launch were 1 cycle. At the speeds in question it would be difficult to see a difference.I wasn't trying to suggest that 64 kernel launches are actually assembled into a single work-group to run as a single hardware thread (though I did theorise that this is possible).
I think you guys misunderstand the nature of the ACEs. Each of them manages a queue. I would expect an ACE to be able to launch multiple kernels in parallel, which appears to be what we're seeing in each line where the timings in square brackets feature many results which are identical, e.g.:
Code:64. 5.59ms [5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39] 65. 10.59ms [5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 10.37] 66. 10.98ms [5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 10.37 10.78]
Compute is not useless by any means. Many DirectX 11 games used compute shaders. Async is just a bonus. There are lots of games out there that use compute shaders for lighting (Battlefield 3 being the first big AAA title). Compute shader based lighting perform very well even on AMD Terascale VLIW GPUs (such a Radeon 5000 and 6000 series) and NVIDIA Fermi (Geforce GTX 400 series). Compute shaders are used in rendering because compute shaders allow writing less brute force algorithms that save ALU and bandwidth compared to pixel shader equivalents.Basically, even though Maxwell 2 may be as strong as a 390x in compute, if it can't also do graphics rendering concurrently it's pretty useless (since that's one of the most important tasks your GPU has).
Hyper-Q is for running multiple compute dispatches simultaneously, not for running compute + graphics.
Actually, the Fury X is handling concurrent compute kernels quite well. I wouldn't know why someone would assume other wise.Basically, even though Maxwell 2 may be as strong as a 390x in compute, if it can't also do graphics rendering concurrently it's pretty useless (since that's one of the most important tasks your GPU has). Currently, it appears that Maxwell 2 and GCN 1.2 (Fury X) don't handle concurrent graphics+compute well, and we're trying to figure out why, but supposedly it'll be fixed in a driver update.
Raw instruction throughput, or rather raw front-end dispatch capability, would exhaust the total number of dispatches in the test in microseconds.Yes, the latency is comparably high in relation to what Nvidia hat to offer. But the hardware still managed to dispatch up to 128 calls (kernels originally scheduled with only 1 thread each!) in parallel, using only a single queue of a single ACE. Almost at the limit of what had been possible based on the raw instruction throughput.
I do not think we can be sure that several of these cases aren't already being hit thanks to the driver and GPU's attempts to maximize utilization, particularly since so many of the dispatches have no dependences on anything else.
- The number of software queues equals the number of hardware queues
- The number of software queues equals the number of ACEs (AMD only)
- Each compute call dispatches at least a FULL wavefront
We'd have to start creating dependence chains and synchronization points to make sure the optimizations for a trivially parallel case are blocked.
A more clear asynchronous test is to insert controlled stalls into one or the other work type and see if the other side can make progress.