Is there something that CELL can still do better than modern CPU/GPU

A 2 PPE/16 SPE Cell variant, GF106, and 2 GB XDR2 would be a huge improvement, and I honestly don't think Sony needs to go any further than that (though 2 GB of RAM could be really tight on devs). Development is getting way too expensive, diminishing graphics returns are just not worth the price (but lets at least jump up to 1080p + some form of AA standard!). The focus really needs to be on the orchestration of games now. While something like a visible true volumetric fluid simulation is obviously going to be hard on the CPU, it's going to be hard on the GPU to render too, so it's going to still take quite a bit of visual rendering power still, but surfaces are obviously running into a visual wall that monitor resolutions can't keep up with. Devs would be happy with familiar hardware with less tight restrictions about using memory (hence unified). Sony will be incredibly stupid if they go over $300 again, so that's a no no. So they need to wait and be patient for process nodes to lower in cost and feasibility. Also this ties to an idea that Sony should focus on creating a smaller system, with less of a footprint, less noise, and less power usage. I don't think consoles need to send much of a visual message outside of being nice, sleek, and functional, but workable with varying room styles or other entertainment components. The PS3 designs (lol PS3 grill) don't seem to work all that well with that philosophy in my eyes. Oh and Sony has no need to expand outside the control inputs currently available: Dualshock 3, PS Move, Camera. Making current products "upwards compatible" is a necessity.

I've heard folk make similar statements recently with regards to Sony's options in the design of a PS4. Although i also think that there is another HUGE consideration Sony has to make when designing their next console, especially considering the userbase split between themselves and their competitors this generation.

Sony needs to not only provide a modern GPU in PS4 which can compete with whatever MS will do, but they also need to provide a CPU which will also be able to match whatever the XboxNext has in the box.

Sony did well with the CELL this generation in terms of their CPU part. CELL saved their asses big time after RSX proved lacking when compared to Xenos in the 360. Multiplatform developement skewed greatly towards the 360 in the beginning of this gen, and until Sony's first parties had learnt how to squeeze performance out of the CELL to make up for the RSX's deficiencies then it was extremely hard for Sony to persuade gamers to purchase their console over an Xbox360 with a +$100 price diference.

Sony was lucky that MS didn't also license SPU tech in the 360 (given that they were able to get a PPU based CPU) as that would have finished the PS3 this gen. Next gen Sony cannot design their console in a vaccuum and therefore needs to provide sufficient GPU and CPU performance to what MS will do.

Let's say that MS goes with a 6 core modern AMD CPU, will a 2PPU (vanilla - i.e. not updated architecture) 14-24SPU CELL be able to outperform the AMD part in the more generalised CPU tasks that are relevant for gaming? Despite having more SPUs which are great for specialised tasks and even some graphics work, will they alone be sufficient for all CPU workloads expected in next-gen games?

I personally like the idea of having a few IBM Power 7 cores attached to a group of updated SPUs, however the feasibilty of such, whilst still having to devote a sizeable amount of their budget to the GPU will be something Sony has to look into thoroughly.
 
I personally like the idea of having a few IBM Power 7 cores attached to a group of updated SPUs, however the feasibilty of such, whilst still having to devote a sizeable amount of their budget to the GPU will be something Sony has to look into thoroughly.

I've had thoughts of a successor that implements more PPU cores like Power7. I would probably like to see a higher number of PPU Power7 cores in a Cell successor, like 4 or so, even if it's just 16 SPUs. I'm sure the devs would love to have more general purpose capability that doesn't require them to be forced to use SPUs when they could be using them for something else better suited towards them. That one PPU in the Cell seems very lonely, and I think there should've been two. But I can't argue with the success of the Cell in terms of what it can do.......

Basically I think the core technology for the general purpose cores and SPUs pretty much exists and is ready, it's just designing a new chip that properly combines the features at a price and power consumption that is reasonable. As for Microsoft, I have a feeling they may go again with a more "basic" or general purpose CPU, but throw in a screamer of a GPU that not only highly graphically capable but fully usable for GPGPU computing as well. MS with the DX11 and DirectCompute specs gives them a head start in a way to push this, but I have a feeling MS will open it's heart to OpenCL for the Xbox 720, with ATi very likely again to provide the Xbox 360 successors GPU and maybe the CPU. We all know how much ATi loves OpenCL ;) I would still recommend that MS stay with IBM for the CPU. I think backwards compatibility will be crucial to them this time around if their relations with IBM and ATi stay nice as opposed to the end of production for the original Xbox.
 
Last edited by a moderator:
Too many simple POWER cores would shift the multi-threading problem onto another side, but what I think PS4 would need is 1 or 2 CPU cores at most that are VERY fast at single-threaded workloads. If you can fit 4 of them, fine... if you can only have 2 of them with or without SMT, fine too...
Having 4+ multi-threaded oriented cores (with low single threaded performance) is not what would make PS4 easier to program for than PS3.
 
Panajev... i agree with your prognosis and would be eager to know what your ideal CELL-based CPU for PS4 would look like?
 
Last edited by a moderator:
Untrue... certainly not in anyway "greatly" at all... These days multiplatform developement as a whole produces parity to all intents and purposes across both platforms. It's hyperbolic to say anything otherwise.

Besides that's irrelevant to the discussion at hand and beside the main point i was trying to make.

Panajev... i agree with your prognosis and would be eager to know what your ideal CELL-based CPU for PS4 would look like?

I'd prefer stating what is more realistic to happen if IBM convinces SCE to keep CELL for PS3.

Let's keep the following into consideration.

1.) CELL as a separate line does not exist in IBM's roadmap anymore. Only bits and pieces of it are merged into the POWER line of CPU's. PPU's are not the bits that IMHO are worth keeping for IBM, SPU's are. FlexIO and XDR memory controllers are RAMBUS's responsibility (and I think they can do a good job integrating XDR2 technology and an updated FlexIO bus into whatever IBM offers Sony).

2.) IBM has already pushed SPU's into the DP FP territory and is willing to push for better and better DP FP performance if they can although the difference between SP and DP floating point performance has already shrunk a lot.

3.) It is unclear what kind of major updates to the SPU ISA we can expect out of IBM and whether or not they make sense for SCE.

4.) Any POWER core should be able to take PPU's place, one of the few requirement SCE would have for BC reasons should be VMX support and IBM seems keen to tackle VMX even in their new POWER7 CPU's. The only change I'd like to see in the VMX unit would be a way to transfer data from the GPR's to the VMX registers and back without potentially getting off-chip (LHS issue).

5.) We should not expect any crazy custom-for-PS4 element added in PS4's CPU. PS4's CPU should be quite standard, made of standard blocks provided by IBM... 1-2 POWER cores + SPE's are within what SCE can ask IBM without IBM asking for a tons of R&D money from SCE.

POWER7 might have elements that go beyond what's needed for a console CPU, so I do not know what core they will try to sell to SCE. Wide, OoOE, 1+ MB L2 cache per core, enhanced VMX unit (fast, low penalty, VMX-to/from-GPR's transfers)... 2 of such cores + 8 active SPU's + shared L3 cache... all running at 4+ GHz.

Even a 4 GHz clock with such a strong central general purpose portion and 1 additional active SPU would provide a neat CPU-side performance bump for PS4 (remember that each SPU would also receive a 800 MHz performance boost). A bigger RAM pool and a DX11-class GPU should do the rest to make PS4 really competitive :).
 

Again you're throwing things off topic by keep bashing the point... which is irrelevant.

Panajev2001a, cheers for your insight and in depth analysis of Sony's realistic options. My question would follow (and this is really my own personal wishful thinking) how many more SPUs do you think Sony could wrangle out of IBM before they'd have to break the bank?
:devilish:
 
Last edited by a moderator:
I'd prefer stating what is more realistic to happen if IBM convinces SCE to keep CELL for PS3.

Let's keep the following into consideration.

1.) CELL as a separate line does not exist in IBM's roadmap anymore. Only bits and pieces of it are merged into the POWER line of CPU's. PPU's are not the bits that IMHO are worth keeping for IBM, SPU's are. FlexIO and XDR memory controllers are RAMBUS's responsibility (and I think they can do a good job integrating XDR2 technology and an updated FlexIO bus into whatever IBM offers Sony).

2.) IBM has already pushed SPU's into the DP FP territory and is willing to push for better and better DP FP performance if they can although the difference between SP and DP floating point performance has already shrunk a lot.

3.) It is unclear what kind of major updates to the SPU ISA we can expect out of IBM and whether or not they make sense for SCE.

4.) Any POWER core should be able to take PPU's place, one of the few requirement SCE would have for BC reasons should be VMX support and IBM seems keen to tackle VMX even in their new POWER7 CPU's. The only change I'd like to see in the VMX unit would be a way to transfer data from the GPR's to the VMX registers and back without potentially getting off-chip (LHS issue).

5.) We should not expect any crazy custom-for-PS4 element added in PS4's CPU. PS4's CPU should be quite standard, made of standard blocks provided by IBM... 1-2 POWER cores + SPE's are within what SCE can ask IBM without IBM asking for a tons of R&D money from SCE.

POWER7 might have elements that go beyond what's needed for a console CPU, so I do not know what core they will try to sell to SCE. Wide, OoOE, 1+ MB L2 cache per core, enhanced VMX unit (fast, low penalty, VMX-to/from-GPR's transfers)... 2 of such cores + 8 active SPU's + shared L3 cache... all running at 4+ GHz.

Even a 4 GHz clock with such a strong central general purpose portion and 1 additional active SPU would provide a neat CPU-side performance bump for PS4 (remember that each SPU would also receive a 800 MHz performance boost). A bigger RAM pool and a DX11-class GPU should do the rest to make PS4 really competitive :).


1) A new PPU is obvious, even IBM engineers admit that the current one is broke. IMO Rambus controls its own fate here, they need the PS4 more than the other way around and it would make a nice showcase esp. with a UMD config.

2) IMO dual threaded SPU's would provide the best bang for the buck and shouldn't be that hard to do (and have probably already been spec'ed). You'd want a small bump in LS, maybe to 384k to keep both fed. I'd bump the SPU count up to 12 with 1 inactive yield, 1 for security, and 1 for PSN code, that leaves you 9 for game code or 18 threads. Add a small clock increase (say 3.8Ghz) and that's a huge bump over the current 6 threads and would cost you only 100-120m trannies.

3) I'm sure Sony already has an updated ISA want list, it's just a question of what's doable (for the price).

4 and 5) Power7 is already setup as a modular core for easy reconfiguration, the 2 execution units providing legacy mainframe support in the current config are obvious to omit and the 4 DP units are overkill for a PS4 (2 should suffice) so there's 4 out of 12 units that could be chopped or about the same amount of trannies as you added to update the SPU's and you still get 2 additional threads, 4 instead of 2 plus a possible clock bump.

That leaves you with the last 3 items/questions, eDram, GPU, and unified chip. I guess a unified chip or not would drive your other decisions, I would definitely do it if feasible.
For eDram, you'd want/need 2-4MB for the PPU and it'd be sweet to have 10MB for the SPU's for L3 cache, add in 16MB for shared MRT's on a unified chip, that give you 30MB of turbo charge being fed by ultra-fast XDR2 mem.
Looking at Fermi's design, it almost screams Cell integration. 3 Fermi GPCs with 3 SM's each, coupled with the 22 Cell threads and all tied together with the 30MB of shared eDram on an integrated chip would give you excellent overall performance and flexibility.

That's actually a fairly moderate update (in terms of cost) which at 28nm should fit on a chip < an F100 in size but would be a significant upgrade in performance.
 
Is there any way at all that you could design a configuration where the SPU dedicated memory could double as an EDRAM type memory? Say that you had 16 SPUs, that would give 4MB of SPU type ram. If you could address it as one pool that could be a great help for, say, PS2 emulation. Though then again, if XDR2 can do 200GB/s what do you need EDRAM for?

So in that case I would dream of 4GB of XDR2 on a 512bit bus that can be accessed by GPU parts and 32 SPUs alike and in paralel, where the CPU/GPU would be optimally designed to cooperate to the point where it's pretty much an integrated design. It would leave room to a great deal of flexibility.

I guess the target should be 120fps for 1920x1080, or in other words full res stereo display. Price should be 399 max.

I'm just dreaming though and don't have an inkling of whether or not this is even remotely possible. But considering the time between the PS3 and the PS4's release, any part that doesn't scale by a factor of 8 is a disappointment but acceptable if the launch unit's cost can be at 399 or lower.

I would make the HDD a core component, but look strongly into finding a way to include a fast HDD, maybe use a few in paralel or find a clever way to use (multiple) solid state drives and use a flash drive for virtual memory, or the reverse - I have no idea what works better at this point.

I would also consider making the BD optional, and only release a BD drive if/when it is fast enough to run BD straight from the disc with minimal (Flash or whatever) caching. You want games to be able to run on HDD only anyway, so there is no gain from mandating a BD drive at this point other than of course being valuable as a BD drive, for people that want it to be one. But by the time the PS4 comes out, not everyone will need it to be a BD drive, so if it is possible to configure away some options that gives a DD only PS4 at say $100 less than the other model, it may well be worth it.
 
XDR2 on a 512 bit bus would be insane console wise because as far as I can tell they are only building up to x32 chips so you would need 16 of them. Assuming 4Gb density of the chips that would be 8GB of ram and since it's 25.6-51.2 GB/s per chip. That would be 409.6-819.2 GB/s of bandwidth assuming my calculations are correct. Probably be expensive as hell but would make EDRAM probably pretty pointless. That's assuming there is no more advancement from now until then on RAMBUS's front as well.
 
2) IMO dual threaded SPU's would provide the best bang for the buck and shouldn't be that hard to do.
Multithreading only makes gains when your processor is sat around twiddling its thumbs waiting for data. And that code is a poor fit for SPUs and needs a redesign. SPUs are number crunchers that should be set a linear task to churn away at, like pixel shaders. If you wouldn't multithread a pixel shader, you shouldn't multi thread a SPU! ;) The current dual-issue design is an excellent compromise (IMO, as someone who doesn't have to write for it!) between complexity which we're wanting to keep down, and efficiency which we're wanting to max out.

Is there any way at all that you could design a configuration where the SPU dedicated memory could double as an EDRAM type memory?
Probably not. LS is all about low latency. The choice of 256kb was a compromise between capacity and latency. If the memory config could be worked to appear as LS in its current implementation, but also a wider level cache RAM addressable from any SPU, it may have benefits. I think that'd more be a case of addressing less than ideal code rather than improving the performance of the design. An overall pool of cache or eDRAM may be better, with the streaming of LS data working as normal but with a fast, lower latency scratch-pad to work on larger datasets saving system BW.

Say that you had 16 SPUs, that would give 4MB of SPU type ram. If you could address it as one pool that could be a great help for, say, PS2 emulation. Though then again, if XDR2 can do 200GB/s what do you need EDRAM for?
True~! Also remember BW isn't the major bottleneck with PS2. Free context switches is more of a killer. It's a very alien architecture and just adding eDRAM because GS had it won't help address that.

So in that case I would dream of 4GB of XDR2 on a 512bit bus that can be accessed by GPU parts and 32 SPUs alike and in paralel, where the CPU/GPU would be optimally designed to cooperate to the point where it's pretty much an integrated design. It would leave room to a great deal of flexibility.
If the eDRAM were not part of the SPE's LS, but connected. You could have the eDRAM sit between GPU and Cell with all parts able to read/write. That'd be an extremely fast and flexible rendering platform that would allow for a simpler GPU leaving Cell to do more complex shader work. Minimum size would need to be 2x 1080p framebuffer at whatever AA level. At FP16, I make that 16 MBs per frame per AA sample. 32MBs eDRAM would get you two framebuffers, no AA. 2x AA would need 64 MBs. That's not looking likely! You'd have to go with tiles if you want MSAA. XB360 shows us we don't want to be locked into a particular rendering paradigm, so tiling should maybe need to be a software issue?
 
XDR2 on a 512 bit bus would be insane console wise because as far as I can tell they are only building up to x32 chips so you would need 16 of them. Assuming 4Gb density of the chips that would be 8GB of ram and since it's 25.6-51.2 GB/s per chip. That would be 409.6-819.2 GB/s of bandwidth assuming my calculations are correct. Probably be expensive as hell but would make EDRAM probably pretty pointless. That's assuming there is no more advancement from now until then on RAMBUS's front as well.

At those speeds, you'd be able to feed SPUs effortlessly also!

Though I guess that perhaps if 409.6-819.2 GB/s goes with a 512bit bus, then I think PS4 could probably live with having a 256bit bus. My target would be to have 8x the combined bandwidth of XDR and GDDR in PS3, which amounts to about 40GB/s (in paralel). If you can get upwards of 300GB/s on a 256bit bus, that could be enough, though if you can do 512bit and 8GB would still be awesome of course. Price will probably dictate 256bit and 4Gb then though - not even production in 2013/2014 may make that affordable, though if you start working now on getting that cheap enough by then, who knows?

The question that would then remain is how viable is it to just add the GPU chips in this design that you need rather than a full GPU set, and what chips those would be? Texture units are pretty much a given, but how many shaders etc?
 
DLAA

Star Wars Force Unleashed II programmer has this information on custom anti-aliasing algorithm. He is comparing unified shader Xenos GPU and Cell processor with 5 SPU usage.

"Typical execution times of DLAA in our current implementation are:

* "Xbox 360: 2.2 +/- 0.2ms @ 1280x720
* "PlayStation3 (5 SPUs): 1.6 +/- 0.3ms @ 1280x720"

48x 500mhz unified shader GPU with GDDR3 takes almost 40% longer for same algorithm than 5x 3.2Ghz SPU with XDR.

In this application, 1x SPU has double throughput per clock vs 1x unified shader unit.
 
Last edited by a moderator:
Star Wars Force Unleashed II programmer has this information on custom anti-aliasing algorithm. He is comparing unified shader Xenos GPU and Cell processor with 5 SPU usage.

"Typical execution times of DLAA in our current implementation are:

* "Xbox 360: 2.2 +/- 0.2ms @ 1280x720
* "PlayStation3 (5 SPUs): 1.6 +/- 0.3ms @ 1280x720"

48x 500mhz unified shader GPU with GDDR3 takes almost 40% longer for same algorithm than 5x 3.2Ghz SPU with XDR.

In this application, 1x SPU has double throughput per clock vs 1x unified shader unit.

don't forget that the PS3 version has slightly higher quality (on long edges).
 
Where did AlStrong's post go ?

What's DLAA ? [size=-2]Is it edible ?[/size] Best if we can get a link on a tech overview. ^_^

Are they really using all 48 units on the GPU to calculate this ?
 
Where did AlStrong's post go ?

What's DLAA ? [size=-2]Is it edible ?[/size] Best if we can get a link on a tech overview. ^_^

Are they really using all 48 units on the GPU to calculate this ?

Is the Xenos programmable enough to allow developers to only use some of the shader cores?

Also, without knowing much about the algorithm employed, what factors of this kind of computation would cause it to be slower on a GPU than the CELL?

Its very surprising to me...
 
Well, they use the GPU for Kinect. I suppose sometimes, it can be hijacked to perform other tasks.

So, how they take that measurement will make a difference, although you can say the same thing for the SPUs. More importantly, some basic understanding of DLAA would help to interpret the numbers.
 
Is the Xenos programmable enough to allow developers to only use some of the shader cores?

Also, without knowing much about the algorithm employed, what factors of this kind of computation would cause it to be slower on a GPU than the CELL?

Its very surprising to me...
Well it is divided into three blocks of 16 ALUs which can switch between vertex and pixel shading. It should be possible to designate one shader pool to the problem independent of the others.
 
Either they go with a real tiny upgrade (may be two better than PPu CPUs) ensuring perfect BC with the cell v1 in which case they still invest heavily on a modern GPU.
Perfect BC is nice, but "good enough" is probably where they are aiming, but I think you are right that emulating Cell on anything with a different architecture may be a very hard and costly task.

Those people may come with better answer but as it is it sucks at pretty much everything, even tied to tex units it would suck as SPUs can hide texturing latency. Larrabee may have sucked but as it is the Cell simply can't handle rendering with close to acceptable perfs.
Basically you have to multi-threaded the SPUs or to include more fixed/specialized hardware that would hide the latencies from the SPUs.
In one of his post nAo states that add multi threading to SPUs may cost you an arm and leg as each thread should have its own 256KB of local store (so a SPUs supporting four hardware threads should have 1MB of LS).
SPUs with a "strange" GPU is something I would like to see. Something like V3 is talking about.
GPUs hide latency through long pipelines, the SPUs hide latency through a different strategy using double and triple buffering with load ahead. It may be hard to translate GPU shader code to using this strategy, but I think it is achieveable.

I also remember nAos ideas about multithreading the SPUs to hide the latency of the local store which is slower than normal 1:st level cache, but I don´t think it will ever happen since that latency can be hidden if the instructions are ordered so that the pipeline is not stalled by load instructions, though it may require manual coding to be achieved. nAo didn´t like the idea of having to hand-optimise the code to get full performance out of the SPU.
 
I don't think RSX emulation is so simple as you imply, here. Remember that the RSX is a video chip which actually drives video output, for instance, and its internal architecture and programming model is nothing like Cell.
Who said it would be simple.:D However, Sony is quite serious about BC and seem to invest quite a lot in emulation technology. I remember the patent on how sequnces of code could be identified when repeated and the emulator would chache the translated code. If they go for any kind of GPU emulation I don´t expect to use any kind of one-to-one instruction emulation, but instead somekind of JIT-instruction translator identifying sequences of shader instructions and translating them to efficient native SPU-instruction sequences.

I expect Sony to create a combined Cell/RSX chip at some point for cost reduction, but there'd be no profit in them trying to design their own integrated GPU functionality just to get rid of the RSX licensing costs, or whatever.
The profit in my scenario would lie in that it would reduce the number of state-of the-art chip Sony would have to develop and maintain (cost-reduce and produce). Getting rid of the RSX-licensing cost would be just the icing on the cake.

On a side-note I am a bit surprised that so many are predicting the PS4 will contain separate CPU and GPU chips, I think the signs of integration are pretty strong on many fronts, now that CUDA as well is gaining more ground, I think it is written on the walls. But maybe someone could fill me in what I am missing, what is the big benefit of sticking to the separated CPU/GPU architecture used to day?
 
Back
Top