I do see a drive in the software by looking at the growth in computing power needed for multi-media ( video playback, music, 3D gaming, etc... ) is growing at a certain pace which makes the need of a power efficient solution desiderable.
What CELL would mean to a PDA could be, in worst case scenarios ( programs that are single threaded and use all scalar operation even if compiled for CELL ), a little bit more transistors on the CPU.
Not all those transistors will be constantly be switching on and off and thus consuming power.
The patent I posted a picture from describe that basically in such a scenario all the unused APUs would go to sleep and in the APU(s) that are running the programs' code would be in power saving mode as well: only one of the FXUs/FPUs would be active while processing scalar instructions.
Imagine a CELL chip with a single PE: 1 PU and 2 APUs and the Pixel Rendering part ( Pixel Engine, Image Cache and CRTC ) running at let's say 500 MHz.
Say we have like 4-8 MB of e-DRAM and some off-chip DRAM.
That is not impossible for 90 nm, but come the 65 nm SOI node ( which could be used for this PDA chips as well if the yelds ar nice enough ) and allocating the transistor budget for such a set-up will not be a problem.
In this sense I can see the ideas behind the "CPU cost will not be the determinant factor".
But, even if you have that kind of transistor budget the issue of maximizing the use of those transistors remains: setting up big high level software layers to enable neat and versatile data sharing and networking of these different devices would mean spending more transistors than what a CELL solution would need to do the same thing.
In such a worst case scenario ( as I was describing few lines above ) which would not be the norm, this would be equivalent to running the simple PU and 1 APU's FXU/FPU.
Still that would yeld a maximum peak of 1 GFLOPS or 1 GOPS which for a PDA would not be THAT bad.
With conventional ARM and SH or MIPS architectures you can certainly match with less transistors the worst case performance of CELL ( PDA solution ), but that is perfectly fine, even numbers like 1 GFLOPS and 1 GOPS would be fine in the right context: such a worst case scenario would mean users who only use a Word Processor or browse the web or send e-mails ( notice the use of "or" ) and that kind of processing power is more than what users would need.
The challenge lies in 3D graphics, multi-tasking and multi-media applications in general: those play to CELL's strenghts.
CELL was designed to do "ok" on those applications that are serial in nature and do not offer much chance to extract power from a parallel processors configuration and to do "very well" on very bandwidth and processing intensive applications that would tend to offer more extractible parallelism.
CELL was also designed to offer power savings features for the former applications: down to putting in sleeping mode the single unused FPUs and FXUs.
CELL is not truely a revolution that brings a totally new concept: while it has some innovative ideas, its base is in tons of the past 20-30 research projects that tried to shake the world and bring new computing paradigms.
What CELL designers decided, starting the architecture from scratch, was to take this immense amount of past research and put it to good use now that finally the implementation details were allowing those concepts tried in economically feasible projects.
E-DRAM is not a novel idea, highly parallel execution resources, modularity and scalability, distributed processing, etc... are not new concepts.
Put the right ideas together with the necessary fairy dust to aggregate it together and you would get something very interesting.
The new technology finally allows to put all those ingenious ideas that have been sitting in labs for over 10-20 years: asI said, choose the right ones, add some spark of genius and you will probably accomplish something ( sorry if repeat some concepts over and over ).
This is something that happens when you sit down, look at the manufacturing process you expect in the next 5-7 years ( leaving yourself headroom if the timeline gets pushed back or hopefully forward ) and design something new with the physical limitations you will have in 5-7 years and not currently.
You will see the good behind this once IPF, in its markets, spread its wings: carefully designed ISA will have its advantages in the long run.
The problem is scaling the existing architectures to match that kind of potential: they were not designed to work on these massive ( parallel ) workloads.
Pentium 4, ARM7-11, MIPS32-64, SH-4 and SH-5, all are the result of architectures and design methodology optimized to run the common case fast in a period where the common case did not involve processing tons of parallel data streams.
The Pentium 4 EE symbolizes this:ridicously fast for what 80% of the users that own PCs now would need ( e-mails, web browsing, Word Processing and use of Excel datasheets ), but not fast enough to drive 3D games without the help of expwnsive GPUs with powerful and highly parallel processing resources.
As they mention in the APU patent I linked previously ( if you do not have the link I will re-post it ), a big issue is to design a power efficient architecture that can deal with what this new, next generation computing problem is: it needs to be very fast in processing several parallel data streams and it needs to provide decent performance for people who just do web browsing and Word Processing, etc...
It was good for x86 where the world was beginning to see the birth of current multimedia applications and it was siccesful in scaling its performance to meet the demands from the content providers: right now, they have been already left behind.
If I can have a CELL CPU in a Dektop PC running JIT compiled x86 legacy code with 2-3 GFLOPS or GOPS of effective performance ( well below current Pentium 4 standards, but who needs much more for web browsing and word processing ), but it is able to reach 150-200 out of its 256-512 GFLOPS peak in complex 3D games, etc... for a similar price compared to a Pentium 4 EE ( by 2005 it should hit $400, it is starting later this year at $740 per 1,000 chips quantities which mean that you, the single consumer, will pay slightly more for your order, depending who you are buying the chip from though ).
Judging by the power they are trying to pack for $299 in PlayStation 3, a $400 CELL processor ( imagining making some sort of profit per CPU ) could be a decent processor.
Even if it only did, in PC games, 50 GFLOPS that would still be more than 4x the power of the Pentium 4 EE.
It could run single instances of Word and excel like a Pentium III 733 MHz for all I care ( even though if you added several instances of those applications or if you are running tons of tasks in the background you would probably get a nice speed-up ), but when you went to play in the area consumers are starting to buy their Hardware for ( 3D games, high quality video [how much performance do you have left when you run the Hi-def version of Terminator 2 on your PC ? Not much even on 3.0 GHz computers] ) and sound processing ) you would see this architecture leaving the common x86 in the ground.
I wanted to link this patent as well, it might be interesting to the discussions we are having:
http://appft1.uspto.gov/netacgi/nph...)&OS=an/"sony+computer"&RS=AN/"sony+computer"
A multi-processing computer architecture and a method of operating the same are provided. The multi-processing architecture provides a main processor and multiple sub-processors cascaded together to efficiently execute loop operations. The main processor executes operations outside of a loop and controls the loop. The multiple sub-processors are operably interconnected, and are each assigned by the main processor to a given loop iteration. Each sub-processor is operable to receive one or more sub-instructions sequentially, operate on each sub-instruction and propagate the sub-instruction to a subsequent sub-processor.