Cell LS and being a dead end
I think, with the SPU LS they were primarily trying to solve 2 problems
1) Energy efficiency
2) Scalability
There are other side effects compared to a cache based architecture, like space saving, and predictable timing (less complex pipeline), higher maintanable IPC because of lack of memory stalls (unless you have random access data structures ;P), but I think these 2 were the most important.
I don't think they missed their mark at all for that, but I think they underestimated the overhead involved in writing software for the chip. To my experience, there several reasons that developing for Cell (on the PS3) is cumbersome compared to competitor platforms (Xbox 360), but I have one that really stands out for me.
No instruction cache (the SPU already has a really small I$, but it reads off LS).
Code and data have to share the LS. This is a load balancing issue and a potential debugging head ache. With a global access I$ we wouldn't have to worry about whether the code fits, if a code update will suddently screw up your finely tuned data layout, having to strip debug symbols to avoid bloating the code elf, not being able to run debug versions because they're simply too big, accidentally overwriting your code with data, jumping through hoops to get persistent break points (because the code is fresh with every upload), etc. I think this is the number one problem for working with the SPUs and something I would really like to see in Cell v2.
Managing data is a comparatively much easier, and I think it's unrealistic for them to completely change their memory system design. It's a very intricate part of the chip (or any chip) and one of their fundemental design points. It will be interesting to see how performance will scale on something like Cell iv32 vs Larrabee though, and how their different approaches play out in real applications.
Whether it's a dead end, I don't know. It's one of these platforms that are theoretically very good if you A) spend substantial time optimizing for it, or B) have good tools to support the more complicated hardware. Right now, the tools just aren't there, and I'm not convinved they will be any time soon.
(Sorry if this is coming up multiple times, I seem to be having some trouble posting)
I think, with the SPU LS they were primarily trying to solve 2 problems
1) Energy efficiency
2) Scalability
There are other side effects compared to a cache based architecture, like space saving, and predictable timing (less complex pipeline), higher maintanable IPC because of lack of memory stalls (unless you have random access data structures ;P), but I think these 2 were the most important.
I don't think they missed their mark at all for that, but I think they underestimated the overhead involved in writing software for the chip. To my experience, there several reasons that developing for Cell (on the PS3) is cumbersome compared to competitor platforms (Xbox 360), but I have one that really stands out for me.
No instruction cache (the SPU already has a really small I$, but it reads off LS).
Code and data have to share the LS. This is a load balancing issue and a potential debugging head ache. With a global access I$ we wouldn't have to worry about whether the code fits, if a code update will suddently screw up your finely tuned data layout, having to strip debug symbols to avoid bloating the code elf, not being able to run debug versions because they're simply too big, accidentally overwriting your code with data, jumping through hoops to get persistent break points (because the code is fresh with every upload), etc. I think this is the number one problem for working with the SPUs and something I would really like to see in Cell v2.
Managing data is a comparatively much easier, and I think it's unrealistic for them to completely change their memory system design. It's a very intricate part of the chip (or any chip) and one of their fundemental design points. It will be interesting to see how performance will scale on something like Cell iv32 vs Larrabee though, and how their different approaches play out in real applications.
Whether it's a dead end, I don't know. It's one of these platforms that are theoretically very good if you A) spend substantial time optimizing for it, or B) have good tools to support the more complicated hardware. Right now, the tools just aren't there, and I'm not convinved they will be any time soon.
(Sorry if this is coming up multiple times, I seem to be having some trouble posting)