Nerve-Damage said:
From what I understand about the SPE local storage memory...is that it provides direct access (programmable access) to the memory and fewer penalties (if any) for memory misses (i.e. seen in cache misses).
No, all it does is move the main memory accesses around and force the programmer to manually schedule them.
On a cached architecture, when you try to access something not in the cache, the CPU stalls until it's loaded from memory. On a hardware threaded CPU like xenon, the other thread kicks in and runs while the first thread is stalled.
When the cache loads something from main memory, it loads the stuff around it as well, this is called a
cache line. This reduces future stalls because most of the time well written code tends to access things in an orderly fashion nearby the thing that caused the miss. This is called the
principle of locality. If the access isn't orderly or cache friendly, the programmer can rearrange his data structures or algorithms to match more closely the cache lines, or take manual control and issue a prefetch ahead of the time the data is needed to try to avoid a stall, though these techniques are not always possible. On some CPUs like xenon, it's also possible to reserve cache lines and treat them like a local memory.
On the SPE, if you need something in main memory, then you must request a DMA to transfer it to local memory before you can use it. This is essentially the same as a cache miss. DMAs take a really long time, so the programmer needs to structure his SPE code so he can do something else while the DMA is happening.
The point is, if you have 1000 MB of input data to process, then regardless of if you have a cache or local memory, you have to schedule time to move 1000 MB of data into the cache or local memory and time to write the results back out.
All having a local memory does is force the programmer to think about when to schedule the main memory accesses, rather than have the cache implicitly schedule them on his behalf, though even on the cache architecture the programmer will have to consider where to put his prefetches and how to construct his algorithms and lay out his data structures so they're cache friendly.