There is a great possibility that the PS4 will use the "Cell 3" chip with 34 cores. It is based on the same architecture as the current Cell. Therefore, the same knowledge and tools would most likely carry over from the PS3. That would make developing MUCH easier for them for the next generation. Too bad that means developers will be able to max the console out much earlier in it's lifecycle.
Also, the PS3 was reported as
breaking even or making a profit on the cost since January this year. The PS4 will come out when they need to in order to prevent MS from taking the next-gen market (be it 2010, 2011, or 2012).
http://appft1.uspto.gov/netacgi/nph...AN/"International+Business+Machines"+AND+simd
United States Patent Application 20080126745
Kind Code A1
Mejdrich; Eric Oliver ; et al. May 29, 2008
Operand Multiplexor Control Modifier Instruction in a Fine Grain Multithreaded Vector Microprocessor
Look at that patent: now imagine 3 clusters connected by a either a shared ring bus or better a cross-bar switch with 3xVTE's + 1 BTE (private L1 caches) + Mailbox + I/O Logic + Shared Cache (each VTE having two SIMD units attached to a shared register file) and also one small legacy cluster with optimized/re-engineered PPE + SPU in isolation mode to run the Hypervisor and the OS and the main memory controller (whatever Rambus has at that time).
Each VTE, as explained in that and other related patents, would be cache based (hardware pre-fetcher supporting software hints and no manual DMA management any longer and without a Local Store memory and with a single 32x128 bits register file (small thread context allowing switching between lots of threads thanks to fast context switching and acceleration for commonly used synchronization primitives). I'd see 256 bits registers Larrabee style (edit: even though Larrabee's registers are 512 bits wide) being used as that would guarantee real-world FP performance being much closer to the chip's theoretical peak.
Performance does drop when you execute two scalar instructions or one scalar and one vector instructions... basically when you do not execute work that causes the hardware of all 8 processing lane ( 4 lanes for each SIMD unit in each VTE) to be used together to calculate the final result (over a certain number of cycles...).
Taking the idea of using these vector throughput units as the main target for application writers I'd leave the BTE to run all the book-keeping for them, and be the kind of "offload/special functions accelerator" that some people tried to use SPU's as.
The BTE could be 4-way multi-threaded (SMT) to manage work for each VTE's in the cluster as well as its own and could be be under direct OS control (it would also be unit performing I/O functionality like handling stdio functions such as fopen, etc... for each VTE in the cluster).
The BTE might have a single FPU, but no Vector Extensions (mainly integer processing focused).
What the programmer sees would be a simple array of homogeneous processing units (the VTE's) with the heterogeneous architecture of the chip abstracted away by the OS and various libraries.
Size of the shared cache in each cluster: between 1 MB and 2 MB (less than 8 MB of SRAM cache in the chip counting the PPE's L2 and the 256 KB of LS of the lonely SPUv1
).
Clock frequency at 45 nm (introductory manufacturing process unless 32 nm is already in high volume): 4 GHz.
FP performance: let's see...
1 VTE: 16 FP ops/cycle.
One cluster: 3xVTE -> 3 VTE's* 16 FP ops/cycle*VTE = 48 FP ops/cycle.
Three clusters: 3xcluster -> 3 clusters * 48 FP ops/cycle*cluster = 144 FP ops/cycle.
Peak SP FP performance: 144 FP ops/cycle * 4 GHz = 576 GFLOPS.
Depending on various factors the clock speed could be quite higher or maybe there could be space for one more cluster and you could take away the "legacy cluster" as I put it. In this case you'd have a peak SP FP performance of: 768 GFLOPS.
Or... you could keep the "legacy cluster" and add 2 of the new clusters as long as the chip size does not get unreasonable.
I think they will be more than happy of having more than 2x the performance of the CELLv1 chip inside PS3... do not expect miracles from PS4... the era of batting for the sky and then letting a quick series of manufacturing process shrinks bring the chip production at sane cost levels has already stopped being practical with PS3 (notice the fact that at 90 nm the CELL chip is smaller than the early figures for PS2's EE processor back when PS2 was near its launch).
A DP FP optimized chip would not be in consideration for PS4, but as happened with CELL, be developed as an evolution of the SP design (like optimizing the DP FP rate in each VTE).