There's no reason why the lack of a hotclock would result in more schedulers. You just need to double the ALU width from Vec8 to Vec16 - in fact, that's already what it looks like to the scheduler.
It is already at Vec16 in Fermi. NVidia would need to go to physical vec32 ALUs.
Logically, the vector ALUs are basically vec32 anyway, as this is the warp size and one instruction is always executed for 32 elements (the issue just takes two cycles).
But yes, this allows to double the ALU count (so just the normal increase comes on top of it) without a disproportionate increase of schedulers (I've written about that possibility in the
3DC forum). In fact, if they do that (going to vec32 ALUs), it is sure they will lose the hotclock (or the schedulers will run at that clock too).
So would you agree that NVIDIA's proposed register file cache would solve most of these issues for real-world workloads? While in a way it's obviously orthogonal to what we're discussing and there might (or might not) still be real advantages to having a more AMD-like scheduler/RF, this would make it significantly less important for several reasons.
If you think about it, a register file cache may make less sense for a single register file for several vecALUs or it would need to be bigger or the scheduler needs to apply some bias so that instructions from the same warp are preferentially issued to a certain vecALU. The reasoning is simple: If an instruction from a warp gets issued to one of the vecALUs and the next instruction to another one, the regfile cache will be basically useless. That means instructions from one warp has to be issued to the same vector ALU anyway. But if you do that, you can just bind the vecALU to one scheduler and embed the regfile into it (embed a small regfile to each vecALU lane), i.e. replace the regfile cache in each ALU lane with the regfile itself.
Somehow this regfile cache is neither fish nor fowl in my opinion. It may bring some improvements, but it appears to me a bit like stopping after going half of the way.
If you want a more GCN-like scheduler the really big question to ask is this: how do you issue multiple loads in parallel as early as possible while needing their results as late as possible? A register scoreboard provides an elegant 'perfect' solution to the problem (obviously limited in practice by how many instructions you can keep waiting). AMD's VLIW clauses are less effective and less elegant.
I think with the typical code run on GPUs it is currently often easier to run a few threads more to cover long latencies. Fermi's scoreboarding isn't buying them much in this respect (especially as data dependencies are basically known at compile time). The main advantage in my opinion is that they have a single entity managing all dependencies, while in case of GCN there are several spots, each handling only one particular area. That makes the scoreboarding conceptually simpler, but also more expensive to implement (even with the small window of only 4 instructions in flight per warp).
GCN uses a very smart trick you very nicely described in
one of your posts back in July. I think it's a very good solution - not theoretically as 'perfect' as a register scoreboard but pretty close and clearly less expensive in hardware. NVIDIA could theoretically do something like this as well but I honestly don't see it happening, especially not for Kepler.
I don't see it either. In the 3DC forum post linked above I already clarified that a Kepler SM
may show an apparent similarity to GCN in some aspects, but that it will be just cursory. It will probably work entirely different in main aspects. After all, my statement with the GCN clone (I posted "I'm almost inclined to think that Kepler will look
somewhat similar to GCN!" and a runaway smiley [emphasis added]) was just some kind of a bait.