So let me get this straight , AMD is moving toward an NVIDIA like architecture (easy to program and mostly hardware scheduler) ?
Or is it moving toward an Intel Larrabee like architecture which had software scheduler and the only difference is that instead of using a 16-wide vector Pentium processors , AMD will design it's own 16-wide vector hardware ?
Or is it a combination of both ? mainly hardware sceduler (NVIDIA's way) and multiple 16-wide vector units (Larrabee's way) ?
Neither nor. Or in between or something else, it depends on what you are looking.
On a very high level it looks a bit like the Cray X1 on a single chip. Four vector processing units (SIMD engine now, SSP in the X1) form a basically self-contained unit (CU or MSP) integrating scalar and vector capabilities. But that's where the similarities end.
GCN inherits the physical width (16 elements) of the vector ALUs almost all GPUs (and Larrabee) use now. The logical width stays at 64, the value used by AMD for quite some time, though. But instead of using one VLIW instruction to issue 4 operations for a single vector (wavefront) as with Cayman, it uses 4 instructions from 4 different wavefronts to fill those vector ALUs. That somewhat resembles a hypothetical doubled GF100 SM with 4 instead of only two vec16 ALUs. The scheduling works different from all formerly known GPUs though.
A GF100 SM has several issue ports (2x vec16 ALU, SFU, L/S [local and global memory]), where each of the two single issue schedulers can issue one instruction for a vector/warp every second (hot) clock cycle (some exceptions apply because of resource contention). Because of the long pipeline (18 cycles or 9 vectors deep) a sophisticated scoreboarding scheme exists to track dependencies between instruction for a warp. For each Warp in flight, a window of 4 or 5 instructions is checked for dependencies and can potentially be issued before another independent instruction for the same warp completes.
R600 through R900 used a far simpler scheduling system. The compiler arranged the instructions in groups (clauses) which were guaranteed to be independent. Control flow or memory instructions opened up separate clauses. Each CU/SIMD engine had two thread sequencer, which simply alternated in supplying the instructions for two wavefronts. Each instruction issued over 4 cycles (64 element warp on vec16 ALUs), this fits exactly the pipeline length of 8 cycles (2 vectors). That means no checking whatsoever had to be done within a clause. For the next instruction 8 cycles later, all dependencies were guaranteed to be resolved. Dependencies were only checked on clause granularity by the global "dispatch processor", making fine grained control flow slow (changing clauses took about 40 cycles, i.e. clauses with less than 10 instructions lower the performance).
GCN does something different. It tries to retain much of the simplicity of the R600 approach with added flexibility and performance. It has basically 4 schedulers within a CU, which work in a round robin fashion (a bit like the alternating thread sequencer in R600). Those schedulers issue to a set of ports with are mostly shared (scalar unit, branch unit, Export/GDS, vector memory, local memory) within the CU but partly private (vector ALU, each scheduler can issue only to its own vec16-ALU) to each scheduler. The shared ports can accept a new instruction each cycle, the private ones only every 4, matching up to the round robin issue.
Up to 5 instructions per cycle can be issued at maximum. Each scheduler selects up to 5 instructions (if there are so many) from 5 different types and from 5 different wavefronts (no dependency checking within a wavefront). Memory dependencies are handled by compiler inserted barrier type instructions counting the number of allowed outstanding memory accesses (which are counted then in the hardware too of course). These barriers disable instruction issue for the wavefront until the dependency is resolved and are consumed within the scheduler itself.
While the GCN approach lacks some of the flexibility of the nvidia scheduler, it makes up for that with the massive amount of issue ports enabling to handle control flow and "scalar stuff" (identical in all elements of the vector) basically in parallel to the vector ALUs increasing the utilization while maintaining a relatively simple operation.
Btw., the main difference between Larrabee (besides the scheduling of Warps/Wavefronts/vectors and that it has a full two issue x86 core as scalar unit per vec16 ALU) and GPUs is that Larrabee has a permute network between register file and the vector ALU lanes. GPUs basically use their local memory for that purpose. In GPUs, each vector lane has its own register file, no such permutations are directly possible. While that decreases flexibility, it saves quite a bit on the power consumption for the reg file.