I guess you could have something like 4 GPCs, 4 SMs per GPC, 32 wide SMs, 6 TMUs per SM. That keeps the ALU to TMU ratio at close to 6x, as in the GF104 design. Not that I'm convinced or anything, but why would the warp width have to be divisible by the TMU count, given that texturing is decoupled from computation? Also, I assume the OP means 1536 threads per SM, which matches GF100 exactly. 1536 threads for the whole chip cannot be right, and 1536 warps gives 96 threads per ALU (assuming the warp width is still 32, which I think is a safe bet because changing it is liable to hose the performance of a lot of CUDA code). That sounds high.