If AMD hardware is anything like Nvidia (and I rather think it is), you can absolutely have multiple blocks resident on the same CU. I use this to allow wave sized blocks which avoid the need for syncthreads in some kernels.
This is definitely true. You can have as many thread groups executing simultaneously on a single CU as long as you have resources available. Resources in this case being waves (40 -> 2560 threads), registers (256KB) and LDS (64KB). Scalar register count can also be a limitation, but in practice never is. If a group's thread count is not dividable by wave size (64), it is rounded up. This would give a maximum count of 40 thread groups per wave (assuming single wave groups).
DirectX programmers can't count on a fixed wave/warp size, since Nvidia = 32, AMD = 64 and Intel = {8,16,32}. A good compiler should remove synchthreads calls in kernels with thread group size less than the GPUs native wave/warp size.
Allowing blocks from different kernels on one CU at the same time is a much stickier issue. One has to remember that if this happens, you'll end up with a lot of contention over the instruction cache. This means that scheduling can be very tricky. Do you favor homogeneous loads on each CU, thereby saving instruction cache space (remembering that the compiler may just be looking at the size of that cache to determine how far it can unroll loops and so forth), or do you favor heterogeneous loads, which can help when one task's footprint is conveniently sized to "fill in the gaps", or perhaps you can offset a kernel that bottlenecks on some per-CU resource, like local memory bandwidth, or perhaps texture cache.
Heterogeneous loads are a big performance win in many cases on GCN. There are many real world examples available (look at recent GDC/SIGGRAPH presentations).
Common examples are:
- One kernel is sampler bound and the other isn't. Example: parallax mapping (N trilinear/aniso taps for root finding), anisotropic filtering in general (multiple textures), bicubic kernels (3x3 bilinear taps per pixel), blur filters (N bilinear taps), etc. A pure compute task (no tex filter instructions) can fill the execution gaps nicely.
- One kernel uses lots of LDS (thread group shared memory) and this limits occupation. Other kernel with no LDS usage increases occupancy.
- One kernel is heavily memory (and L1 cache) bound, while the other is mostly math crunching (ALU instructions using registers and/or LDS).
- Tail of another kernel. Work finishes at wave granularity (after last barrier). Waiting for the last wave to finish means that processing cycles are lost on each CU (every time kernel changes).
- Resource bottlenecks (waves/registers/LDS). Kernel allocates some CU resources heavily, while other resources are left unused. It is better to schedule thread groups from multiple kernels (different shader) to utilize the CU resources better.
- Uneven split of resources. There are cases where a single kernel doesn't evenly divide all CU resources (waves/registers/LDS) between thread groups (see my post above). Reminder is left unused. Better to schedule multiple kernels (with different resource counts) to use all the resources.
The most common heterogeneous task is executing vertex and pixel shader on the same CU. Vertex + hull + domain shader is another example. GCN is not limited to running multiple compute shaders concurrently on the same CU. You can have a mix of different graphics kernels and compute kernels.