You can extract more than 128 MULs/clock out of the SPA using D3D and OpenGL, if you're mindful of the driver and how you issue your shader. Of course you shouldn't have to be mindful of any of that, and we've not seen it hit the peak at all. Also, CUDA specs list a peak flop rating if I'm remembering rightly (rather than talk about any explicit instruction throughput), and it should be possible to extract the MUL in CUDA too, if not with the current build and supporting driver.