Ran it on a SPU (fixed some bugs first) it is actually slower that the original. It's longer by 4 instructions. And introduces the following instructions to the pipeline 2x ceqi (compare equal) and 2x selb (select bits).
Heh. Good thing I didn't say it should be faster