Simon and Dio, please do not abandon the thread... discussions can be informative.
Simon F said:
Dio said:
And I know that traditionally, multiprocessing architectures are harder to program efficiently than single-processor architectures.
Having worked on a multi-cpu system (~32 cpus) in my previous job and having studied the damned things in my Honours year, I'd just like to say that Dio's comment is an understatement.
Well at least you do not have to worry about cache coherency at the APU level: there is no cache
The 128 KB of LS is in all effects the main RAM of each APU: the e-DRAM, the external RAM are all lower steps in the memory hierarchy.
All the code that an APU executes has to be in the LS.
A traditional SMP approach, think about your quad processor Xeon for example, has 4 CPU cores that essentially are trying to feed data and instructions from the same main RAM: each processor, in this SMP system, has a Cache and the problem ( no news here ) is that we do not want each CPU to work on the data on its cache and another CPU trying to load that memory location from memory before the cache's modified blocks ar written back to main RAM or the first CPU might update its cache and let's say main memory as well, but another CPU core might have that memory block already in cache and it could work on that block instead of fetching the "uopdated" block of data from memory.
Both on an System designer point of view and a programmer point of view this can be annoying.
In most cases I think we can compare a CELL CPU with a distributed computing environment over a relatively large network: each PE is a separate and independent super-node in this system.
Each node, in most cases, could be thought as a LAN with a host APU attached to it ( the APU has its own RAM ).
Yes, we have a common resource ( e-DRAM ) that all APUs do access, but if you look at the patent you will see that for all intents and purposes ( unless we start being funny with APU's masks using the PUs ) each APU has its protected portion of the resource.
Implemented in Hardware there is a memory sandbox structure: each APU has its own portion of e-DRAM which is marked with that APU ID and access to that resource is granted to the particular APU hose ID matches with the one the memory space was tagged with ( unless you play with APU's masks as I said: trusted code run by the PUs can change the APU's masks opportunely so that one APU can access the memory space reserved for another APU, but this is up to the programmer if they want to deal with that or not ).
APUs do not necessarily share data with each other as that would be more complicated to deal with ( it is not particularly relaxing to think about 32 APUs all accessing a common and non sub0divided memory space ).
Threating each APU as its own thread/process and implementing message passing between APUs ( this implementation is part of the trusted code: OS + game code that orchestrate the APUs [a sort of "intelligent" thread scheduler and resource organizer that either the programmer or middleware software would provide] ) should be the preferred approach: of course it will still not be easy, that I understand, but the more structured approach might help.
Directing an orchestra is not a trivial job, but that is what will be required i such a scenario.
In each PE the PU runs the OS code, performs I/O of data and runs part of the game code.
The PU will then assign tasks to 1 or more APUs and will instruct them how to communicate and inter-operate together ( what to do with partial results, etc... ): if you imagine sharing T&L across several APUs of even different PEs you will see that for things such as collision detection we do need to put some effort in the PU portion of our code as it is the PUs responsability to make sure everything is performed according to the original plans in the original sequence.
About launching PlayStation 2 one year later with 1 more year in R&D and using 1 or 2 generationss of manufacturing processes after the 250 nm node, I do not see how that would have worsen the whole picture economically ( aside for the greater competition ).
If you launch in 2001 your next-generation is not going to launch, generally speaking, before 2007 and if you launch in 2000 your next-generation is going to launch, generally speaking, before 2006.
In the case of SCE and Toshiba, instead of launching using 250 nm and going down to 90 nm, you could have launched using 180 nm and go down to 65 nm.