Tridan created some fascinating alternative SMX/SMM diagrams at
hardware.fr in the GTX 750ti review.
I'm really happy with their detail, since it helps me understand a lot about actual throughput and limitations. They cleared up a giant misconception I had about Kepler's SMX design.. I had always envisioned all four schedulers feeding through a massive crossbar, with any one of the 8 dispatchers able to multiplex its registers into any of the lanes of SPs or SFUs or LD/ST. Tridan's diagram shows that it's much simpler and more limited than that with each scheduler "owning" a set of SPs, with a set of extra SPs shared between each of two pairs of schedulers. (Tridan, again THANKS for the diagram! I keep studying it!)
If you look at the
GK104 SMX layout, I have some questions I can't resolve though.
I am guessing that each of the 4 register files can issue three registers (3 columns of 32, actually) per clock. Those get distributed to the SPs as needed, with the operand collector handling routing, including a buffer to allow accumulating registers over 2 clocks. So a kernel running all FP32 adds would have four schedulers, each continually issuing two registers to its "own" set of SPs. The extra register would be accumulated by the operand collector, and every other clock it could be sent over to the "shared" set of SPs. The partner scheduler would do the same on the alternate clocks, so it all elegantly works in keeping every SP fed every clock. This would give a throughput of 192 FP32 ops per clock.
NVidia claims that FP32 FMADs also have a throughput of 192, but nobody has ever been able to craft code to actually perform this well. The explanation is simple.. with only 3 registers per clock per scheduler, there's not enough bandwidth to feed THREE arguments into every SP per clock, only two. So an FP32 add or FP32 mul has the throughput of 192, but an FP32 FMAD has a throughput of 128. Tridan's diagram makes this all clear why.
We can see NVidia's official
table of operation throughputs in the programming guide. Looking at the 3.0 architecture column, we see many throughputs have a logical explanation from the SMX figure. sin/cos/log have a throughput of 32, because there are 32 SFU units, for example.
But the real puzzle from these throughput values is why many operations like integer adds have a throughput of
160 operations/clock. This confuses me so much! What would cause this design to have a throughput of 160, as opposed to 192 or perhaps 128? It's not register bandwidth (you can do fp32 adds at 192). It's not that one set of 32 SPs are "special" and limited.. because every scheduler only connects to two sets of SPs. If one set was special, you'd get a throughput of 128, not 160, since there are two pairs of schedulers so you'd lose two sets.
So my rambling question is what architectural limit would give this weird 160 ops/clock throughput, not 192 or 128? Or maybe the obvious answer is that NVidia's table is wrong?