From the realworldtech chart, the doubling of compute was not accompanied by a doubling of warps per SMX. So it looks like there are actually more registers available per warp than on GF104, but less warps per compute and less L1 capacity per warp.
I think the numbers in that chart talking about register file size/work item may be a bit misleading for a lot of cases. After all, the number of work-groups or work-items per core are maximum values, the hardware can support for very light threads. Often, this isn't that important, hence Kepler made compromises there by supporting less work groups (relatively to the compute capabilities), which skews the numbers.
To look at it from the other side one can see how many work items the register file size is able to support in case of "heavy" threads, i.e. a case where each thread needs 64 registers for instance (iirc it is the maximum for nV, AMD GPUs support up to 128 regs per thread).
For GF100 an GF104 this works out to be 512 work-items or 16 work groups, for GK104 it is 1024 work-items or 32 work groups (I assume GK110 will be different), and for GCN/Tahiti it is 1024 work-items or 16 work groups.
But now one has to take the issue rate into account. For GF100 it is a single instruction for two work-groups per (base clock) cycle, which means there can be instructions from up to 8 workgroups per scheduler overlapping in flight (they would need 8 cycles for issue one instruction from each of them), which are available for latency hiding. This is in fact not enough to hide the arithmetic latencies (10 base clock cycles), let alone any memory latencies. GF104 is slightly worse as the issue rate can be higher, so one runs more often into the situation where one waits for a memory access and no arithmetic instructions are left for issue to do something useful.
The same is true for GK104, where up to 8 instructions from 4 threads can be issued per cycle. That means there are again only 8 available workgroups each scheduler can choose from and it is quite likely to run out of ready wavefronts. I have no idea how the arithmetic latencies compare to Fermi, they probably changed.
GCN issue rates are slightly harder to compare with, but generally it can schedule only 1 arithmetic (vALU) instruction per cycle (+ 1 scalar + 0.5 local memory access + 0.25 vector memory access + 0.25 export + 1 branch + 1 internal instructions) per cycle, actually each of the 4 schedulers can issue one vALU instruction every four cycles. That means there are 16 workgroups available for latency hiding (which need at least 16 cycles to schedule), significantly more than Fermi and also Kepler have at its disposal.