2.) it "requires" lots of micro-management from programmers. Not that cache based architectures allow you to completely forget about things such as cache size, data and code size and access, etc... you gain quite a bit of performance paying attention to those elements, but they are more forgiving in that aspect.
I think we need to be a bit more precise when using that "forgiving" word. I can feel my blood starting to boil here.
In my opinion and experience cache based UMA multiprocessors are NOT forgiving.
An easy programming model to port to, yes.
Easy to get high utilization of computational resources, hell no.
SMP UMA systems are great if you want to speed up multiple instances of legacy code as in classical business server applications, or (less good) if you want some improvement when porting an application and do some "low hanging fruit picking" in terms of threading.
But there are bottlenecks in the (access to) shared resources, and there are contention problems, and there is the problem that the programming model doesn't help you much in dealing with these issues - indeed the goal is rather to help abstracting the underlying complexities away.
Coming from scientific computing, what I really liked about the BE was that it helps make the computational behaviour deterministic. There are separate memory pools that belong to each SPE that won't be stomped by other processors or threads, there are robust mechanisms for transferring data between processors, separate communications pathways for main memory and the coprocessor,... neatly partitioned, and relatively predictable. It comes from my world. Yes, you have to structure your problem to fit the hardware to get good utilization, but
if you do, predicting the results is relatively straightforward.
Compare this to, say, the XBox360 setup, where you can have six hardware threads, all sharing the same cache, and if these threads evict each others cached data (unpredictably, unless you lock by hand, and poof there goes that ease of porting) it will generate bus traffic, over the same bus that not only handles main memory access and cache traffic, but also all communication between CPU and GPU. And that main memory is also accessed by the GPU which has its own needs and ideas in terms of memory access. This may be a straightforward platform to port to, but to get high utilization and ensure consistent and predictable response in different situations is another matter entirely.
My experience with SMP UMA systems has been that if you want high utilization out of them, "forgiving" is simply not an appropriate adjective.
They are, and pardon my clear language, a fucking horrible mess that lack not only the underlying hardware organization, but often also the band-aid software tools needed to analyze and help control the overall dataflow of the system. Coarse grained parallelism over two or possibly four processors - OK. Maybe.
Beyond that, and you are deep into blood-vessel bursting territory. Again, for performance critical work. Horses for courses apply here as everywhere else. But if that is what you're doing... "forgiving" - no, not really.
(* Slowly unclenches jaws *)