So if I understand correctly from their public claims:
- 32-wide warps being scheduled on a 16-wide ALU... (G80 and Rogue say hi!
)
- This allows them to decode 2 instructions in the time the main FP32 pipeline executes 1, so they can run some other instructions "for free" (G80 and Rogue say hi again!)
- Register file *might* be over-specced, it *looks* like it's still 32-wide despite the ALU being 16-wide, which allows these 2 instructions to have a lot less restrictions than they had on G80
- Thanks to the above, you can run FP32 and INT32 in parallel - and maybe FP32 and FP64 in parallel? Or FP32 and Tensor Cores in parallel?
Alternatively maybe the register file isn't 2x the ALU width, and they rely on their "register cache" (and/or extra register banks) to execute multiple pipelines in parallel?
The one thing I'm most surprised by is that they have *full-speed* INT32; presumably full-speed INT32 MULs, not just INT32 ADDs? If so, that's quite expensive... More expensive than the Vec2 INT16/Vec4 INT8 they had on GP102/104. I wonder why? I can't think of any workload that needs it, the only benefit I can think of is simplifying the scheduler a bit. Are they reusing the INT32 units in some clever way - e.g. sharing them with FP64? There are 'interesting' unusual ways you could share some of the INT logic with some other pipelines (rather than just over-speccing the mantissa for FP32 and clock gating it when not doing INT32) but that wouldn't allow full generality of co-issue with all other pipelines which is what they are implying.
Also for their tensor core performance numbers, they are comparing to "FP16 input with FP32 compute" on Pascal; I'm going to guess that's effectively using the FP32 pipeline rather than the FP16 pipeline on Pascal, so 9x isn't quite as mind-blowing (but still impressive); they could have gotten a *lot* of performance simply by supporting the same Vec2 INT16 dot product instruction they had on GP102/GP104 with FP16 instead (since INT16 accumulating to INT32 is good for inference, but not always good enough for training).
I'm also curious about the effective parallelism required to make use of the tensor cores; it's effectively a 4x4x4 matrix multiply, but according to their blog, that's per-thread so across a warp it becomes a 16x16x16 matrix multiply (based on a 32-wide warp I'd expect 16x16x8, not sure at what level the extra 2x happens). That's a *massive* amount of parallelism required for a single instruction, which is fine for convolutional networks, but it sounds like it might not work as well for e.g. recurrent networks in which case you'd want to stick to the CUDA cores? The ideal scenario would be if the scheduler could efficiently use the tensor cores and CUDA cores in parallel for different warps on the same scheduler.
EDIT: Actually if it's 16x16x16, that sounds like they might be running 4x4x4 matrix multiplies sequentially, so the tensor cores might be exposed with descheduling data fences with a long latency to get results back. If so it seems likely that FP32/FP16 CUDA cores and Tensor Cores can work in parallel (but for workloads where you can use the Tensor Cores, it probably makes more sense to only use them, since they should be more power efficient).