Hmmm, I don't follow. How does scheduling benefit from the relationship between SIMD count and instruction latency?
You can use a simple round robin scheme for the schedulers (as implied in the presentation) and you are guaranteed to be able to issue a dependent instruction without tracking dependencies between arithmetic instructions. It's the same reasoning as for the two interleaved wavefronts used from R600 through Cayman.
It should just follow the normal mantra of having enough wavefronts available to hide instruction latency. There shouldn't be any hardcoded assumptions at play - that's why they got rid of clauses.
AMD currently hides the instruction latencies with the execution of two interleaved wavefronts. They don't need to track dependecies between individual instructions at all. They only need to track dependencies between clauses (involving memory clauses, arithmetic ones don't have dependencies on each other, they can be executed simply in order).
They get rid of the clauses, because control flow (which opens up a new clause) was inefficient (generally all code with a lot of short clauses). Clauses simplify checking for depencies (at least one kind of them). The arithmetic dependies were completely hidden by the interleaved wavefront issue.
The number of SIMDs is determined by the number of instruction schedulers/dispatchers, the width of the SIMD and the wavefront size.
# SIMDS (x) = instruction dispatch rate * wavefront size / SIMD width.
Since the SIMDs are pipelined you should ALWAYS be able to issue a new wavefront to a given SIMD every x clocks regardless of instruction latency.
But if you increase the numbers of SIMDs to let's say 6 and you use the same scheduling scheme as suggested (round robin), it's not going to be any faster than putting in just 4 SIMDs
Wavefront size stays at 64 as stated and the width of the vector ALUs is 16, so every 4 clocks a new instruction can be accepted. If you have more than 4, you can't serve them round robin anymore because you will come by only every 6 cycles for instance (when you have 6 SIMDs). There is basically no easy way to increase that number arbitrarily, you have to double up your scheduling width (twice the number of instructions per cycle) to accommodate 8 SIMDs. Something in between doesn't make much sense.
Let's go back to 4 SIMDs seen on the slides. The instruction latency comes into play when you think about following the lines of keeping the things simple. If it is 8 cycles for instance, you need to issue the next instruction for that SIMD from another wavefront. Welcome back to the interleaved wavefront scheme (now just for more than just two). Only for a 4 cycle latency you get a "vector back-to-back wavefront issue" as mentioned on the slides and get rid of this interleaving as promised on one of the slides (but one may argue about it, I agree). Everything between 5 and 8 cycles require at least 2 wavefronts/SIMD, 9-12 will requuire 3 and so on. While just marking a wavefront as "busy", i.e. suppressing the issue of an instruction for the next (one or two) times sounds fairly easy, it will nevertheless complicate things a bit (it will have to be checked when considering instructions for issue) and will also raise the minimum number of wavefronts necessary to hide the arithmetic latencies. => Bad
An old SIMD engine could attain peak rates with only 2 wavefronts (128 data elements). With a 4 cycle latencies a CU could get to peak rates with 4 wavefronts (256 data elements). With an 8 cycle latency (everything in between does not make a difference and is only a possible reason for limiting the clockspeed) you need at least 8 wavefronts (512 data elements) und for 12 cycles you are at 12 wavefronts (768 data elements). For comparison, a GF100 SM needs 576 data elements (18 warps, the instruction latency) in the worst case (absolutely no ILP in the instruction stream present) down to 192 data elements (6 warps) with ILP=4 (does not scale beyond that, obviously the size of the instruction window the Fermi schedulers look at). So especially for lowly threaded problems, there is a strong incentive of keeping the latencies down so one doesn't fall behind Fermi (AMD's CUs probably don't care about ILP at all, as this would again mean one would need to track dependencies between arithmetic instructions). And as the effective size of the register file available for one thread is quite a bit lower for the new architecture (because each SIMD has a separate one, before the instructions in VLIW slot z for instance could access elements from all other banks), that's another incentive to keep the number of threads somewhat in check, i.e. the latencies down.
The alternative is Fermi: in comparison much more effort for the scheduling to hide the long latencies not only for memory accesses, but also just for plain arithmetic instructions. I would guess AMD tries to save something on the latter part, especially as they claimed that the die real estate of a CU will rise not excessively compared to Caymans SIMD engines (it would have to rise either way with the scalar units and double the LDS and double the L1 which is now R/W).