You're right - that was assuming similar architecture to that of Fermi, where biggest differences are made for DP purposes, and gaming wise it would stay more or less the same - of course counts scaled up due process shrink.
But it's quite clear that when one speculates "1024 shaders" he's not going to think nV will suddenly have, say, 1024 VLIW4 shaders similar to those in Cayman or something, but continue on the same scalar route as before, where the single "CUDA core" will stay quite similar to that of Fermi.
It doesn't make a whole lot difference in today's GPU world if you call ALUs vector units or "scalar" (which they aren't anyway in the strict CPU sense). IHVs are always looking to take advantage as much as possible of low hanging fruit and I don't see why NV wouldn't want to consider any efficiency oriented changes unless someone considers NV's ALUs as "perfect".
Only things I've seen so far being suggested as major change is removing the hotclock, which would allow nVidia to do smaller chips to my understanding (with same cuda core counts), but at the same time they would be slower units too.
So only removing the hotclock and increasing by an equivalent amount SPs and in extension SMs and the job is done? I wish things would be as simple and while not impossible, you might not want to go that route in order to not hit a couple of brickwalls.
Let's do some speculative math:
Tesla M2050 515 GFLOPs DP @238W = 2.16 FLOPs/W
2.16 FLOPs * 3x = 6.49 FLOPs/W
225W * 6.49 = 1460 GFLOPs DP
In other words a Kepler Tesla could have 2921 GFLOPs SP.
Since Tesla boards are clocked typically lower than high end desktop SKUs in the above case it wouldn't be awkward if the high end GPU would reach 3.5 TFLOPs. In order to reach that rate with 1024 SPs you'd need a frequency of roughly 1.75GHz, which is anything but skipping the hotclock but rather increasing it. So in order to reach those hypothetical 3.5 TFLOPs the speculative math gave us at a frequency of 1GHz you'd need 1792 SPs to reach and even worse the lower the core frequency the higher obviously the SP amount.
And that's not even the entire story for such a puzzle: how many GPCs or else raster/trisetup units? how many SMs or clusters per GPC? how many stream processors per SM/cluster exactly? etc etc. Because if they'd to sustain the current Fermi SM scheme of 2*16/SM they'd need 56 SMs for a theoretical 1792SP count which sounds at least to me like a damn long distance for GPUs even of the 28nm generation.
By the way if NV chose a target frequency for Kepler that isn't below 1GHz like Tahiti, but quite a bit higher they might have done some "half-manual" layout of some sorts for units like TMUs/ROPs since those aren't necessarily all that tolerant to very high frequencies, otherwise GPUs would clock at beyond 1.5GHz for a long time now.
I'd love to stand corrected but the whole thing doesn't sound all that simple to me and not it can't obviously just equal getting rid of the hotclock and job done.