So what tool, which has been proven in shipping title, is there to take a modest subsystem, e.g. 3000-5000 lines of code, accessing randomly a modest amount of memory, e.g. data structures of about 2 MB, and turn it into production-quality SPU code and the associated LS/DMA management?
Software cache?
Seriously, fetching a cacheline through DMA is about the same costs as an L2 miss. I think there are even implementations available on DevNet.
Some teams even code their PPU stuff so that it works on buffers, allowing them to move the code between PPU and SPU by replacing memcpys with DMAs.
But I think the issue is one of perception (And sorry Assen for singeling you out like that, but I think it's a common one). You cannot expect to take a piece of code with less than ideal performance characteristics and have it turned into high-performance SPU code automagically.
But then again, do you really need to? The SPUs are fast, and they are many. So even if your less-than-ideal-for-PPU code ends up being less-than-ideal-for-SPU code, so what? It's off your PPU and you were probably no using that SPU anyway.
Now if you really want optimized code, then optimizing on the SPU is a lot easier than on the PPU. Actually, with the tools at hand, it's hilarious how easy it is to get close to peak performance. Heck, I have performance critical assembly code that literally utilizes the SPE at 100%. And I'm not some sort of superhuman assembly ninja.
From my own experience, the main issue is to actually get started with SPU development. It will take you a month or two to get all the concepts in your head and to be comfortable with the ideas involved. When you're neck deep in development, no senior programmer has that much free time, so the SPUs stay unexplored. This is actually a serious problem, because if someone on the team thinks "Hey, maybe this could go onto an SPU!", there will be no local expert to help him set it up. So the barrier of entry remains. (*)
(As I understand it, that's the reasoning behind Insomniac's SPU Shader approach.)
Once you actually do work with the SPUs, and once you actually do have the "Data first!" mindset, it's not that hard. Certainly not 5x harder than single-threaded. It isn't really harder than writing good multi-threaded code, in my experiece.
Sure, the profilers and debuggers have gotten better since the initial "horrible tools" phase, but the core difficulty of moving code to an array of SPUs without access to shared RAM is still there, and there's no replacement for the old-fashioned blood, sweat and tears involved.
Absolutely true. If you are as drowned in work as most people are, then re-engineering an existing system is hard to justify. But maybe you don't need to, and hopfully the next design will be better for manycore.
Personally, I postponed all SPU work for basically a year on Sacred2, simply because I had no time at all to even think about it. Then my first SPU program was a painful mess while I struggeled with the programming model (protip: Do not spend days minimizing local store access. Duh.
).
By the time I had finished my second program, I was hooked.
(* Well, you can always as on DevNet.)
Disclaimer: I work for Sony and I write SPU code all day. And I like it.