Chalnot, I am sorry we are misunderstanding each other, I was not comparing EE's VUs to the Vertex Shaders.
I was treating them as a super-set of them, a set that included also PPP functionality.
From what we are talking about here, for the tool designers ( compilers, libraries, etc... ) optimizing for the VUs or for a PPP + Vertex Shaders is not incredibly different.
VUs might be ultra slow, but even DirectX 8 Vertex Shaders take 4 cycles to do a basic Transform with perspective divide and the VUs take only 7 ( VU1 can be sped up to 5 using the additional FDIV present in the EFU ).
That is not incredibly slow.
If you had to write code for the PPP and the Vertex Shaders using for both ASM level language and not your neat HLSL you would feel a similar pain to what VU coders do, still it might be tough, but it is possible.
Better tools come to help you code for those Vector Units and I do not think that the technology on the software side stopped at the point they are now either.
The worse thing for PlayStation 2 programmers seems to be the efficiency of R5900i's memory accesses due to the low L1 cache and the lack of L2 cache.
If you gave two VU1s in parallel and bandwidth to feed them both, I do not think developers would have tons of problems splitting the T&L work between both.
VU0s was not thought for T&L jobs primarly, that was VU1's work and this has been the way it has been used: a lot of early titles started using the VU0 in mcro-mode like they had an SH-4, but then sihifted the T&L code onto the VU1 and a lot of them did not push the VU0 much for a good while.
I agree with your point regarding independence of each Vertex from the others allowing for easier parallelism, but I would hardly call the EE built for General Purpose Processing and not for multi-media number-crunching.
When PPP will come in the realm of PC GPUs you will ahve a similar situation: unless you leave the host CPU to do all that job for the VSand we keep the current scenario.
Wheher we have those kind of Vector Units on the CPU's chip or on the GPU's chip it makes no difference really: one day PC GPUs will seek the same degree of programmability of the Geometry processing part of the pipeline, they are already looking into that.
In the end, there is one major difference that will always make a GPU more efficient, and that's simply that each vertex passed to the GPU is assumed to be independent of all other vertices. This independence makes parallelism almos trivial. The more general-purpose nature of the PS2 means that the hardware and software designers can't assume such independence, making it much more challenging to make use of all units.
Parallelism can still be exploited when Vertex data has passed the reach of a PPP and has to be T&L.
Also we can still tile the screen, clip triangles and process them in tiles assigning to each Vector Processor one or more tiles.
We can exploit parallelism at the surface/object level: working around those problems can be done.