Is there something that CELL can still do better than modern CPU/GPU

Cell has been handling graphical tasks soon after birth. ^_^
I speak about a cell that would handle most in not all rendering task that's quiet different.
I don't know if Cell will become LRB. They sound different. They can certainly take design points from both. How hot was LRB ?

Yap, that's the post I remember.
They are indeed almost completely different,
Larrabee is many CPUs with wide SIMD units so homogenous, coherent memory space, heavily multi threaded to face what ever latency is thrown at it, each core can handle it self.

The Cell is one CPU with some "not full blown" vector processors, no flat memory space, no real measure to hide latencies. As vector processors SPUs are not really autonomous, they are not wide by any mean, and their ISA is less complete than Larrabee simd unit one.

Larrabee seems at an advantage but there is a cost, most likely in die size and in power consumption. But we have no figure to discuss as Intel put every body under NDA.

Overall nAo's take sound neat (while closer to larrabee imho).
From the Cell I think I would keep that the focus on perfs per Watts. I would go further by aiming at lower clock speed. I'm not in a situation to lecture the great guys that worked on Larrabee but they should not have focus on competing with high end GPUs. They should have aim to low end (IGP) and see larrabee cores as accelerators for their standard GPU.

From Cell I would also keep the heterogeneous nature and play the different parts to their strength. If it wer to handle graphic I would tie it to some "cut" GPU, as Fafalada put it a "Deffered-Rendering accelerator (super fast at filling attribute buffers and not much else)".

The geek inside me really hope to see a new architecture from IBM, Cell or not and different from the fusion chips that are likely to spawn in the upcoming years. But for me to make sense (in a console at least) to invest that much resources on a "non GPU" part implies that it will have to handle a significant part of the workload the system will handle which is graphics.
 
Yes, and they are Turing complete++, can do arbitrary flow control, scalar integer, SIMD integer, SIMD FP computations, use any integer computation result as an address for loads or branches, and can reach every bit of memory and IO in the system through their DMA channels. Basically, SPEs are the exact opposite of what aaronspink portrays them as.
 
Sorry, but I don't understand your argumentation.
Just to make sure that we don't talk about different things:

-I have a software to deal with
-Now, I make it parallel using for instance MPI
-I take several different architectures to test how the software scales

My opinion is, that the scaling results you get will depend strongest on the algorithms used in the software and your skill when making it parallel. And only secondary on the architecture you are using, as long of course when the same parallelization strategy can be applied to all considered architectures.
So in short, bad algorithm, bad parallelization the software does not scale on the best machines available...
Please read my (previous edited) post with enclosed link. I believe the clarification you need is there.


This is difficult to answer, because I am not sure about how many processors we are talking here. For instance: it is no problem to scale nearly linear on a standard Xeon cluster or Opteron cluster up to 100+ processors. If we are talking only about one CPU with an increased number of cores, there is not much available and the maximum I tested is a quad core with scaling on this four processors depending heavily on the used algorithm.

But then, I never tested shared memory parallelization and I am not sure if you are talking about this??
I believe that one is in the ars technica article in my edited post. It seems to be something not run across often due to too many bottlenecks in CPU design.
 
Last edited by a moderator:
I speak about a cell that would handle most if not all rendering task that's quiet different.

That is indeed an interesting point. How good are the SPUs at emulating the shader instructions of the RSX? What kind of graphical tasks benefit most from dedicated hardware?

We know that the SPUs are damn good at Vertex pre-processing, and at different types of high quality post-processing: MLAA, etc. Sony seems to encourage delegating graphic tasks to the Cell in their Phyre-engine and I wonder if that is engineering work (investments) that will carry over to the PS4.

We also know that the PS3 was a damn expensive development project which is expected to be amortized over a long life cycle. The PS3 is a very powerful architecture in terms of raw performance numbers, but it requires fine tuning and optimisation of the software to transform the performance numbers to visible results. Five years into the life cycle we are starting to see some pretty good results from the first party developers.

The PS2 is still selling pretty well 10 years after its release, the sales have declined in Europe and in NA, but it has recently been released on some markets in South America so we can expect Sony to keep manufacturing PS2s for another 3-5 years.
Given the visuals the PS3 is outputting I expect that it will serve the same purpose as the PS2 over a similar lif span, i.e. I expect it to be manufactured to at least 2020.

If I was Sony I would try to build a PS4 that shared as many components as possible with the PS3. That would be a way to get high volumes for the components right from the start of the PS4 life cycle and remain high long after the end of the PS3 life cycle and it would also reduce the number of components that would go through the costly die shrinks.

Next fall I expect a PS3 using a 30-34 nm Cell produced on a bulk CMOS process, i don´t have any hard evidence of this, but Toshiba has announced their intention to move Cell to a bulk CMOS process and they have been working in an allience with IBM on a 32 nm CMOS process for several year so it seems pretty likely.

Given the normal cost reduction strategies of consoles, the next step in 2013 would be to merge the CPU and GPU (on let say 26-28 nm process) just like in the PS2 on 90 nm and recently in the 360 on 45 nm. What if Sony instead makes a CellV2 on a 26-28 nm process with 32 SPEs and 2 PPEs (just an example of a possible V2 configuration) that is capable of emulating CellV1 and RSX. Put one of these CellV2 in a new cost-reduced PS3 and put two of them in a PS4 paired with 2 GB of higher speced XDR dram.

That would be a pretty powerful machine, and the beauty of it would be the synergy with PS3 both in terms of manufacturing (better yields, shared die shrink costs) and in terms of software development (tonns optimized routines available from start). The actual configuration of Cell V2 will probably look quite different and include some dedicated graphics components as indicated in the recent IBM interview.

It would be interesting if some more knowledgeable person could pinpoint what graphics tasks the current Cell sucks at. What things could be fixed with some minor additions to the SPEs instructions sets and what does need dedicated processing units?
 
Last edited by a moderator:
Yes, and they are Turing complete++, can do arbitrary flow control, scalar integer, SIMD integer, SIMD FP computations, use any integer computation result as an address for loads or branches, and can reach every bit of memory and IO in the system through their DMA channels. Basically, SPEs are the exact opposite of what aaronspink portrays them as.
So I've a question (a real one not being ironic or anything), why is the PPU seems involved in anything complex? (the OS still run on the PPU even if it takes some percent of its time, Spurs queue scheme relies on the PPU too if my memory serves right).

May be "autonomous" was the wrong word or not taken in its literal/technical meaning, what I mean is that on a Larrabee I believe that (while useless) you could run multitple instances of a single or multiple games or running multiple Os. Could the SPUs achieve that alone? Could the SPUs boot the system without the PPU? etc. I meant autonomous in that way (whether I'm right or wrong as I said I'm willing to learn more on the matter).
 
I don't know if Cell will become LRB. They sound different.

Well LRB was meant to be x86 GPU. Sony need SPUs based GPU. I think Sony wanted to continue their EE+GS style of workload. I mean they did researched with GS Cube afterall and PSP sort of went that way too. I am pretty sure Sony wanted a GS successor to be paired with Cell and it wasn't NV based RSX.

From the original patent Sony did envisioned SPUs based GPU, they called Visualiser if I remember right. They must have done some work on it. Maybe it wasn't ready for PS3, but for all we know they continued work on that and may show up on PS4. With multiple Cells and Visualisers together. It may or may not be as powerful as AMD offering, but maybe just like EE+GS, it might just be sufficient.
 
So I've a question (a real one not being ironic or anything), why is the PPU seems involved in anything complex? (the OS still run on the PPU even if it takes some percent of its time, Spurs queue scheme relies on the PPU too if my memory serves right).

May be "autonomous" was the wrong word or not taken in its literal/technical meaning, what I mean is that on a Larrabee I believe that (while useless) you could run multitple instances of a single or multiple games or running multiple Os. Could the SPUs achieve that alone? Could the SPUs boot the system without the PPU? etc. I meant autonomous in that way (whether I'm right or wrong as I said I'm willing to learn more on the matter).

Well the obvious choice would be to run the OS on the PPU if you have one, however Spurs doesn't have a PPU so whatever OS it may run is on the SPU's(it's sometimes configured as a coprocessor though). What you need to boot up an OS is a boot loader, that's true for pretty much any processor including Larrabee.
 
That is indeed an interesting point. How good are the SPUs at emulating the shader instructions of the RSX? What kind of graphical tasks benefit most from dedicated hardware?
They have to be terrible as they can't hide texturing latency.
We know that the SPUs are damn good at Vertex pre-processing, and at different types of high quality post-processing: MLAA, etc. Sony seems to encourage delegating graphic tasks to the Cell in their Phyre-engine and I wonder if that is engineering work (investments) that will carry over to the PS4.
It's better for them to do so after having investing quiet some bucks in R&D on the part. I think that their ingeneering work will carry over not only for Sony techs, The Cell have pushed developers to use fine grained task parallelism, jobs queues, etc. This experience will carry over no matter the hardware this programming model is applied too.
We also know that the PS3 was a damn expensive development project which is expected to be amortized over a long life cycle. The PS3 is a very powerful architecture in terms of raw performance numbers, but it requires fine tuning and optimisation of the software to transform the performance numbers to visible results. Five years into the life cycle we are starting to see some pretty good results from the first party developers.
Indeed I'm not sure that that was a plus for Sony in the last five years. They lost plenty of money that they are likely to never make up for at least with the PS3.
The PS2 is still selling pretty well 10 years after its release, the sales have declined in Europe and in NA, but it has recently been released on some markets in South America so we can expect Sony to keep manufacturing PS2s for another 3-5 years.
Given the visuals the PS3 is outputting I expect that it will serve the same purpose as the PS2 over a similar lif span, i.e. I expect it to be manufactured to at least 2020.
To some extend that true for any of the three systems as long as editors think it's worse it to support the platform. The PS2 achieved close to monopoly last gen, they are plenty of valuable games on the systems. There will plenty of valuable games on the ps360 and teh Wii when console manufacturers will in the situation to feed emergent markets.
If I was Sony I would try to build a PS4 that shared as many components as possible with the PS3. That would be a way to get high volumes for the components right from the start of the PS4 life cycle and remain high long after the end of the PS3 life cycle and it would also reduce the number of components that would go through the costly die shrinks.

Next fall I expect a PS3 using a 30-34 nm Cell produced on a bulk CMOS process, i don´t have any hard evidence of this, but Toshiba has announced their intention to move Cell to a bulk CMOS process and they have been working in an allience with IBM on a 32 nm CMOS process for several year so it seems pretty likely.

Given the normal cost reduction strategies of consoles, the next step in 2013 would be to merge the CPU and GPU (on let say 26-28 nm process) just like in the PS2 on 90 nm and recently in the 360 on 45 nm. What if Sony instead makes a CellV2 on a 26-28 nm process with 32 SPEs and 2 PPEs (just an example of a possible V2 configuration) that is capable of emulating CellV1 and RSX. Put one of these CellV2 in a new cost-reduced PS3 and put two of them in a PS4 paired with 2 GB of higher speced XDR dram.

That would be a pretty powerful machine, and the beauty of it would be the synergy with PS3 both in terms of manufacturing (better yields, shared die shrink costs) and in terms of software development (tonns optimized routines available from start). The actual configuration of Cell V2 will probably look quite different and include some dedicated graphics components as indicated in the recent IBM interview.
From my limited understanding I see two roads (that would make sense for Sony not so much for IBM).
Either they go with a real tiny upgrade (may be two better than PPu CPUs) ensuring perfect BC with the cell v1 in which case they still invest heavily on a modern GPU.

Or they really upgrade it and BC could be problematic and it wil have to handle a really consistent part of the rendering to make the investment worse it.

Putting my-self in Sony seat I found the first solution safer, you have BC, a standard GPU, and a barely improved Cell would still packed an healthy amount of perfs.

It would be interesting if some more knowledgeable person could pinpoint what graphics tasks the current Cell sucks at. What things could be fixed with some minor additions to the SPEs instructions sets and what does need dedicated processing units?
Those people may come with better answer but as it is it sucks at pretty much everything, even tied to tex units it would suck as SPUs can hide texturing latency. Larrabee may have sucked but as it is the Cell simply can't handle rendering with close to acceptable perfs.
Basically you have to multi-threaded the SPUs or to include more fixed/specialized hardware that would hide the latencies from the SPUs.
In one of his post nAo states that add multi threading to SPUs may cost you an arm and leg as each thread should have its own 256KB of local store (so a SPUs supporting four hardware threads should have 1MB of LS).
SPUs with a "strange" GPU is something I would like to see. Something like V3 is talking about.
 
Just curious. Marko "nAo" as in former Ninja Theory dev and now working for Intel?
 
Yes, that nAo, but I don't know where he's now.

EDIT:
So I've a question (a real one not being ironic or anything), why is the PPU seems involved in anything complex? (the OS still run on the PPU even if it takes some percent of its time, Spurs queue scheme relies on the PPU too if my memory serves right).

May be "autonomous" was the wrong word or not taken in its literal/technical meaning, what I mean is that on a Larrabee I believe that (while useless) you could run multitple instances of a single or multiple games or running multiple Os. Could the SPUs achieve that alone? Could the SPUs boot the system without the PPU? etc. I meant autonomous in that way (whether I'm right or wrong as I said I'm willing to learn more on the matter).

The PPU sets up the memory and loads the SPUs with binaries. It behaves like a host CPU. The SPUs themselves support Overlays... if the binary is larger than the LocalStore, the SPU can fetch the linked modules on-demand. There is also the SPU Isolation mode where the PPU only has access to a subset of the LocalStore (Only 192 bytes if my memory still serves me). Under this mode, the SPU is pretty much left alone. The SPUs also support events and interrupts. Communication with external devices (e.g., GPU) can be done via memory-mapping.

Kutaragi talks about multiple OS on Cell at different level (Level 0 HV, Level 1 Kernel, etc.). In practice, the PPU orchestrates the SPUs, and sets them up to connect to the external world. The PPU is also helpful when running codes that access memory in seemingly random fashion.
 
Just curious. Marko "nAo" as in former Ninja Theory dev and now working for Intel?
That's how it looks by the last presentations I saw.
EDIT
I remove part that may allowed to identify him (even though he is not trying to hide his identity or break some kind of NDA) by his CV he is clearly at Intel since 2008,( pretty neat carrier by the way).
EDIT 2
NDAs suck... he has to know a lot about Larrabee and how it performs(ed) :LOL:

EDIT3
Thanks Patsu for your answer ;)
 
Last edited by a moderator:
Next fall I expect a PS3 using a 30-34 nm Cell produced on a bulk CMOS process, i don´t have any hard evidence of this, but Toshiba has announced their intention to move Cell to a bulk CMOS process and they have been working in an allience with IBM on a 32 nm CMOS process for several year so it seems pretty likely.

Given the normal cost reduction strategies of consoles, the next step in 2013 would be to merge the CPU and GPU (on let say 26-28 nm process) just like in the PS2 on 90 nm and recently in the 360 on 45 nm. What if Sony instead makes a CellV2 on a 26-28 nm process with 32 SPEs and 2 PPEs (just an example of a possible V2 configuration) that is capable of emulating CellV1 and RSX. Put one of these CellV2 in a new cost-reduced PS3 and put two of them in a PS4 paired with 2 GB of higher speced XDR dram.

I don't think RSX emulation is so simple as you imply, here. Remember that the RSX is a video chip which actually drives video output, for instance, and its internal architecture and programming model is nothing like Cell.

I expect Sony to create a combined Cell/RSX chip at some point for cost reduction, but there'd be no profit in them trying to design their own integrated GPU functionality just to get rid of the RSX licensing costs, or whatever.
 
From the original patent Sony did envisioned SPUs based GPU, they called Visualiser if I remember right. They must have done some work on it. Maybe it wasn't ready for PS3, but for all we know they continued work on that and may show up on PS4. With multiple Cells and Visualisers together. It may or may not be as powerful as AMD offering, but maybe just like EE+GS, it might just be sufficient.
Never did Sony even imply that the SPUs would form a "GPU".
There were some patent-applications AFAIR where you`d see a "normal" Cell and a second one which had "graphic-units" replacing the SPUs (atleast some of them). Think more in the direction of AMD`s Fusion, instead of running everything on the same cores you have PPU for compatibility and easy development, SPU for strong computational and vector loads, and a third kind of unit for "graphics" - likely something optimized for Pixel-shaders and fetching textures.
 
From my limited understanding I see two roads (that would make sense for Sony not so much for IBM).
Either they go with a real tiny upgrade (may be two better than PPu CPUs) ensuring perfect BC with the cell v1 in which case they still invest heavily on a modern GPU.

Or they really upgrade it and BC could be problematic and it wil have to handle a really consistent part of the rendering to make the investment worse it.

Putting my-self in Sony seat I found the first solution safer, you have BC, a standard GPU, and a barely improved Cell would still packed an healthy amount of perfs.

A 2 PPE/16 SPE Cell variant, GF106, and 2 GB XDR2 would be a huge improvement, and I honestly don't think Sony needs to go any further than that (though 2 GB of RAM could be really tight on devs). Development is getting way too expensive, diminishing graphics returns are just not worth the price (but lets at least jump up to 1080p + some form of AA standard!). The focus really needs to be on the orchestration of games now. While something like a visible true volumetric fluid simulation is obviously going to be hard on the CPU, it's going to be hard on the GPU to render too, so it's going to still take quite a bit of visual rendering power still, but surfaces are obviously running into a visual wall that monitor resolutions can't keep up with. Devs would be happy with familiar hardware with less tight restrictions about using memory (hence unified). Sony will be incredibly stupid if they go over $300 again, so that's a no no. So they need to wait and be patient for process nodes to lower in cost and feasibility. Also this ties to an idea that Sony should focus on creating a smaller system, with less of a footprint, less noise, and less power usage. I don't think consoles need to send much of a visual message outside of being nice, sleek, and functional, but workable with varying room styles or other entertainment components. The PS3 designs (lol PS3 grill) don't seem to work all that well with that philosophy in my eyes. Oh and Sony has no need to expand outside the control inputs currently available: Dualshock 3, PS Move, Camera. Making current products "upwards compatible" is a necessity.
 
So many people would be awful tech designers. You don't go with the bare minimum cause you can't foresee people needing more. Just cause you don't think there's any need of it. That kind of narrow thinking is what gives us the famous statements of the past we all laugh at today. Exotic tech is half the fun of consoles.
 
Cell features a fundamental design decision allowing it much better scalability - no memory view coherency. While many people would argue that it's better to have both options - i.e. both coherency and locality, and configure your cores accordingly, in practice coherency price will always factor in. If history has thought us anything here, it's that architectures that don't try to make coherency seamless always prevail in long-term scalability (see UMA vs NUMA, or the entire history of transputers, for that matter). What Cell could have had, though, is some 'pseudo coherency' facilities - say, a tag array per LS, and an (macro) op that can update anything dirty from (a subset of) LS A into an identically mapped (subset of) LS B.
 
For high performance or real-time use cases, I am inclined to agree. What happens if the average CPU can already handle most of the work load. Is it still worthwhile to give up the convenience of overall cache coherence for specialized tasks ? ^_^

Yes, I also think that the Local Store should/will be improved if IBM does another design iteration.
 
So many people would be awful tech designers. You don't go with the bare minimum cause you can't foresee people needing more. Just cause you don't think there's any need of it. That kind of narrow thinking is what gives us the famous statements of the past we all laugh at today. Exotic tech is half the fun of consoles.

agreed

i hate how many people want psp2 to be barely an upgrade over psp1 for example
hell some people want a downgrade
 
Never did Sony even imply that the SPUs would form a "GPU".
There were some patent-applications AFAIR where you`d see a "normal" Cell and a second one which had "graphic-units" replacing the SPUs (atleast some of them). Think more in the direction of AMD`s Fusion, instead of running everything on the same cores you have PPU for compatibility and easy development, SPU for strong computational and vector loads, and a third kind of unit for "graphics" - likely something optimized for Pixel-shaders and fetching textures.

Well I am sure they had EE+GS in mind. Sony was already in that direction before AMD ever announced Fusion. Remember EE is MIPS + VUs. SPUs replacing EE's VU units for all the computation, and something like GS to do the texture and raster. If I remember right they did call that unit something else, forgot what. In that patent Broadband Engine refers to 4 Cells. PS3 Broadband Engine is just one Cell. Looking back that PS3 design was only possible on 45nm.
 
Back
Top