It would be interesting if some of the real game developers here give some insight about this. But I thought the era of coding even small sections of code in ASM was long gone.
For tapping into the potential FLOPS of a current CPU you still need to write specific SIMD (ie SSE) code.
You would not want to do this with assembler but with intrinsics, which are very close to assembler;
just abstracting registers, still allowing normal C++ variables, flow control etc.