Why would switching active threads require moving data between lanes? Absent cross-lane activity, register access can start with the base register ID for the wavefront+whatever ID the code thinks it is using.
Hm. To me it looks like the GCN register file is like a say 23 bit adress space (8MB), of which you can only address a window of 7 bits (128 VGPRs). So every SIMD in the CUs hase a base-address, and all register access is relative to the base address, much like the "ebp" just for registers instead of memory. Let's call it "tctx" (thread context). Now, when you want to deactivate stalling SIMD threads in a CU, you remember the program counter, store is somewhere and reset the program with a different base address. All the state is in the registerfile, there are no flags or other processor state which could get lost, like with OoO and so on.
No data is moved, but the lanes between SIMDs/CU are rewired to give access to different windows of the register file. The real register file wiring is much larger than just 7 bits, which means you can create an instruction which temporarily alters the access-network's base address such that a "mov v0, tctx[23].v56" would make the address of v0 fetch that part of the register file which is the 56th VGPR of the 23rd thread. That would be the generic idea.
If you don't need to address every other register explicitly, but you want to have registers that are actual swizzles to the sibling threads then you see that you only need to do things like "(threadid + 1) % threadgroup", that gives you the results of your next circular neighbour, subtract your own value and you got derivatives. Multiply by 4 and you get the next SIMDs, multiply by 64 and you get the next CU, and so on. These operations are all arithmetic modifications of a base address in a global VGPR address space. The cross-SIMD "address" modifications that are possible are written down in the GCN documentation. The command is executed in the same cycle in parallel on all threads with the same modifier which is designed such that it's impossible to have a read/write conflict, it's also impossible to have read/read conflicts.
But I'm not a low-level silicon guy, it might sound more like how a GCN emulator could implement the instruction efficiently.
The originally introduced swizzling methods were categorized as LDS instructions that didn't write to the LDS storage banks.
There are some operations that also require more than a simple rotation, including mirroring and broadcasting of specific lanes to later rows.
Having an LDS-like network at the SIMD level could make the LDS network redundant.
Not really I think, you have just a handful of registers (per thread), the LDS is huge in comparison.