I wasn't trying to suggest that 64 kernel launches are actually assembled into a single work-group to run as a single hardware thread (though I did theorise that this is possible).
I think you guys misunderstand the nature of the ACEs. Each of them manages a queue. I would expect an ACE to be able to launch multiple kernels in parallel, which appears to be what we're seeing in each line where the timings in square brackets feature many results which are identical, e.g.:
Code:
64. 5.59ms [5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39]
65. 10.59ms [5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 10.37]
66. 10.98ms [5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 10.37 10.78]
For this test, with 1 thread per kernel, it would devolve to 1 kernel launch per cycle per ACE--if each launch were 1 cycle. At the speeds in question it would be difficult to see a difference.
One thing I was wondering about that I hoped would be teased out if we were able to vary the length of the inner loop, or modify the start and end size of the batches, was whether there was something to be derived by looking at the list rotated 90 degrees.
The unrolled loop's shortening of execution times does something similar.
There is a pattern to the stair steps that is not affected by execution time. It's not consistent across all GCN GPUs, but it seems somewhat stable within examples and their family.
I'm picking through test runs, so it's not rigorous, but one interpretation is that there is a square of dispatches in the horizontal within each batch and vertically between batches.
The recent Fiji test shows that there are almost 30x30 blocks of similar times, before the times to the right move to the next time.
Viewing the next 30 through 60 rows, that second set of times forms its own rough square. There's 1-2 times that show up past the threshold in a few rows, so there's an edge to the heuristic somewhere in there.
Tahiti-derived GPUs seems to have something of a rough 64x64.
There are breaks in the pattern in a few rows, which might depend on whether we're looking at full or salvage die. The 7950 and non-X 280 have a row or two near the end of their stride that are slower than the next.
Sea Islands hovers around 34x34, both the 290 and 7790 have this with and without unrolling, although I am focusing primarily in the 30/60/90/130 range at present and haven't gone through the numbers outside of that range.
Addendum:
There are other interpretations to the data. Since this is being pipelined, and there aren't absolute timings, there are other ways to draw boundaries and even how to make things fit within the same set of boundaries.
edit:
Interestingly, further down the unrolled Fury results, there is a range where it is 32 batches that have roughly 30 dispatch strides that share a similar time (91-122). It is followed by a run of 28 batches with a stride of 30.