I work with Joker and have had over 20 years of experience in the games industry (from the 8-bit home computers all the way up to the PS3/360).
His comments on the PS3 are fairly close to the truth and the fact that we have the SPU cores to help make up for the inadequacies of the vertex pipeline on the PS3 doesn't change the fact that unaided, the Xenos has much more raw vertex processing power available to it.
Take into consideration that triangle setup time and post-vtx transform caches are relatively slow (even when you optimize your data for them). Unless you offset that by utilizing more complex pixel shaders (and the case he's talking about uses fairly simple ones), you're going to be bottlenecked by the vertex pipe. I have, in fact, rewritten some of our more complex 360 shaders to move some of the burden from the vertex to the pixel pipe (which would seem to be backward thinking!).
Utilizing one of the SPU's to precull your geometry means passing less data up the pipe to the RSX, and therefore less vertices for it to process (which can only be a good thing).
Another thing to note is that if you design this precull code properly, and assuming you have the space to store them, you can effectively pretransform all your geoemtry too - reducing the instruction count in your vertex shaders even further.
One example of this would be a cloth simulation, where you have to pre-skin your data in order (amongst other things) for the collision objects to be in the correct orientation. Why at this stage shouldn't we fold the pretransform *and* the rejection into the cloth transform code? If you have a ~5000 poly cloth sim model, and you precull all the back faces (thus not sending half of the triangles to be rejected by the RSX), then precull all faces not currently visible in the view frustum, the index list sent to the RSX can go down in size anywhere from 45% to near on 100%