Btw., has someone else problems to conciliate the die shot with the official version of 6x32 ALUs, 32 L/S, and 32 SFUs? I mean, each "SMX" appears to have 8 physically separate register files aligned along the vector ALU lanes. Even when one considers that half of the register banks are on the left side and the other half on the right of a vALU, each SMX has then a set of 4 identical and replicated subunits. That would fit somehow to the 4 schedulers, but what is in there?
The only way (I can think of right now) one can distribute the units would be, that each dual issue scheduler delivers its instructions to a set of 3 vec16 ALUs, 8 L/S units und 8 SFUs. That basically means one SMX would be a package of four GF104 style SMs (somwhat reminiscent of G80/GT200) where the hotclock and one scheduler got lost (and the local memory, TMUs and some other stuff are shared). The scheduler can issue each cycle two instructions from one thread and alternates each cycle between "even" and "odd" threads (same would then be true for the register access, maybe that's why one can identify 8 vector register files in each SMX, even and odd threads have separate register files). Or maybe a better picture: a scheduler issues up to 4 instructions from two threads every two clock cycles. Or the scheduler issues each cycle a single instruction from two threads (and the vecALU the instruction got issued to is blocked in the next cycle because one can issue an instruction for a 32 element warp only every second cycle to a vALU with 16 lanes). The last version would basically work like the two single issue schedulers in a GF100/110 SM, just that the scheduler run at the same clock as the ALUs and can therefore supply more of them.
Has someone a clever idea how this really works?
PS:
If they didn't have a similar mistake in that slides as during the Fermi presentation, the total register space is the same as with GF100/GF110 (2 MB), so really tiny compared to Tahiti (8 MB). I have a hard time believing that number, considering the similarity of the ALU count of GK104 and Tahiti. I would expect double the value given in that slide (4 MB), i.e. 512 kB per SMX or 128kB per Scheduler.
But also the local memory/L1 is quite small (still 64kB) considering how many threads/workgroups on one SMX have to share it.