Time to crush the half rate 32-bit integer nonsense. According to the Cuda 3.0 Programming Guide, Section 5.4.1, table 5.1 32 bit integer multiplication is the same throughput as floating point mulitplication for compute 2.0 devices, i.e. Fermi. The half rate stuff is dead wrong.
As for 24 bit integer multiplication, the reason it's slower is that there's probably no dedicated instruction (after all, if you have full speed 32 bit, there's no real reason for wasting op code space to include 24 bit...) for it on 2.0 hardware, meaning that it has to be done in software, using bitmasking to ensure correctness. This adds at least one additional instruction, thus making 24 bit slower on 2.0 hardware.
As for 24 bit integer multiplication, the reason it's slower is that there's probably no dedicated instruction (after all, if you have full speed 32 bit, there's no real reason for wasting op code space to include 24 bit...) for it on 2.0 hardware, meaning that it has to be done in software, using bitmasking to ensure correctness. This adds at least one additional instruction, thus making 24 bit slower on 2.0 hardware.