This is the most logical interpretation. An async task would not exhibit a time to completion that's near the sum of the two tasks separately. That indicates NO async mode is functional and its defaulting to serial operations.
Since I'm good at car analogies, let's do that.
2 Cars are on the road, let's call them Car 1 (Compute) and Car 2 (Graphics). Both cars are trying to go from A - > B.
The time it takes for Car 1 to travel the journey is 1 hour. The time it takes for Car 2 to travel the journey is 2 hours.
The question is, how long does it take for both Cars to reach destination B?
1. Both Cars can travel on the road together, simultaneously, starting at the same time: 2 hours.
2. Only ONE Car can be on the road at once, so Car 1 goes first (order doesn't matter), finishes, then Car 2 starts. Thus, both Cars reach their destination in: 3 hours.
Minor variations aside, that should be the expected behavior, correct? #1 would therefore be Async Mode, and #2 is not.
That would be the case for the results we're getting with MDolenc's tests, yes. (BTW that's a "highway lanes analogy, not a cars analogy
)
Basically, if (graphics+compute) time = (graphics time) + (compute time), then at least with this code the hardware isn't running Async Compute.
And that's what we're seeing with both Kepler+Maxwell 1 (which do
not support Async Compute by nVidia's own spec) and Maxwell 2.
As far as I can see, there are 3 very odd things with the results so far:
1 - Maxwell 2 isn't doing Async Compute in this test. Pretty much all results are showing that.
Razor1 pointed to someone with two Titan Xs being seemingly able to do Async but it seems the driver is just cleverly sending the render to one card and the compute to another (which for PhysX is actually something that you could toggle in the driver since G80, so the capability was been there for many years). Of course, if you're using two Maxwell cards for SLI in the typical Alternate Frame Rendering mode, this "feature" will be useless because both cards are rendering. The same thing will happen for a VR implementation where each card is rendering each eye.
2 - Forcing "no Async" in the test (
single command queue) is making nVidia chips to serialize
everything. This means that the last test with rendering + 512 kernels will take the Render-time + 512x(Compute-time of 1 kernel). That's why the test times end up ballooning, which eventually crashes the display driver.
3 - Forcing "no Async" is making GCN 1.1 chips doing some very weird stuff (perhaps the driver is recognizing a pattern and skipping some calculations as suggested before?). GCN 1.0 like Tahiti in the 7950 is behaving like it "should": (compute[n] + render) time = compute[n] time + render time.
Graphics + compute: 238.70ms (7.03G pixels/s)
That's not what your performance log shows... That's the time for 512 kernels in pure compute mode.