Remove the PPU, it's too slow, replace it with an OOOE core.
SPUs should be able to execute code that doesn't sit in the local store (yep, they need a proper I$), that would automatically increase the amount of data one can store on the LS and it would remove the ridiculous issues with debug code not fitting in the LS (which was so retarded to begin with).
The per SPU DMA engine needs to be improved so that it can support async gather/scatter and atomic ops.
Add TMUs, make SIMD vectors 8 or 16 wide with automatic instructions replay to easily support larger vector widths when necessary.
Update SPU ISA (it's so limited) and add HW multithreading to better hide more complex instructions latencies (couple of hw threads per SPU would be just fine)