It does not primarily "limit" to control flow, but the nature of the programming model prevents it from being more than the control flow. I wouldn't be surprised if the compilers may have already take trivial cases to the scalar path, say a constant pointer with a constant offset. It also stores uniform data right now, like texture/buffer descriptors and kernel argument pointers. It is just that you can't possibly go beyond this without human intervention IMO. Whether it is worth to resolve at runtime is debatable too, considering the fact that uniform code paths should be quite identifiable by developers.
GCN compilers already do optimizations for coherent loads & math. The most common case of course being load from a constant address -> it emits a scalar load and stores the value into a scalar register.
This article introduces a simple algorithm for automatic scalar (coherent) code extraction:
http://hwacha.org/papers/scalarization-cgo2013.pdf
All memory loads & ALU instructions based solely on compile time constants, root constants or SV_GroupID (group >= wave) can be trivially converted to scalar loads & instructions (operating on scalar registers). This can be propagated: All inputs are coherent -> all outputs are coherent. This is also why often constant buffer loads can be optimized as scalar loads (and stored in scalar cache). Indexing array in constant buffer with anything calculated from SV_ThreadID obviously is the exception, and the compiler generated vector loads and vector ALU from code like that.
Unfortunately GCN scalar unit doesn't have a floating point instruction set. Scalar propagation cannot continue over float instructions. When the first float instruction is met, the compiler must broadcast the scalar to a vector register and emit SIMD vector math. I have been hoping for a long time that a future GCN would add full float instruction set to the scalar unit. This would allow full automated scalarization process.
The compiler could also be better at scalar extraction. Now it does it only for some simple known cases. For example it can't produce scalar code if I divide the SV_ThreadID by 64 (...128, 192, 256...) and do a load based on that address. Compiler could emit scalar load to a scalar register (and propagate status further -> more scalar loads & math).
If group is 64 threads (for example 8x8 tile in screen space) then SV_GroupID is perfect for scalarization (1:1 mapping to waves). All per tile operations (culling, etc) could be offloaded to scalar unit. But of course this would require float instruction support in scalar unit.
For groups larger than that it becomes slightly harder. If group X size = 64, then groupId.y is wave coherent, and everything based on that could be automatically turned into scalar code. However this would require the programmer to specify awkward group sizes, such as 64x4. This for example mimics a 16x16 group with 8x8 subgroups. You of course need to do custom math to remap threads to actual screen pixels (luckily GCN already has combined single cycle shift+mask instructions). I haven't tried whether the AMD GCN shader compiler does scalarization automatically based on SV_GroupThreadID.y (if group width is multiple of 64). I could try on some integer heavy compute shader. It wouldn't obviously help with float math heavy shaders.