Is there something that CELL can still do better than modern CPU/GPU

I think that's because the developers could optimize memory access carefully (to avoid communication overheads). The researchers said they could predict parallel performance rather accurately on paper since there is no external/random factors outside their control.

It is interesting to observe because when Cell was first revealed publicly, some said it's just a DSP. ^_^

EDIT: The existing software and techniques that work on Cell has to count for something too, from HDTV to gaming client + server to media processing to financial modelling to supercomputing.
 
Notable here is the absense of Microsoft, could be they've already let IBM know they're going in a different direction next gen.

Could just mean there are contracts with MS that do not allow them to speak open about it a this point in time or the final contracts have not been signed yet.

The only thing we know for sure is that there are contracts with Sony and Nintendo and they are free to talk about it. We alreday knew that Nintendo and Sony were buying services from the IBM Fishkill facilities from this leak from summer last year. But this is the first time next gen systems are mentioned.

I also find it interesting that he mentions they are developing "multiple processing components" to "intense graphics requirements". It supports my belief that the next generation consoles will not have separated CPUs and GPUs.

We should also remember that there is a purpose with releasing this kind of information. It is obviously aimed at IBM stockholders to let them know "they are staying in the business". It is obviosuly a good business from the cashflow perspective considering all the royalty fees that keep ticking in every year.
 
Also, Cell's architecture brings near linear speed up per core added.

?
I think that does highly and mostly depend on the algorithm used?!

But there are sure a lot of architectures around that scale nearly linearly, at least for our codes (used in scientific computing!).
Just got results last week for our speed up test on a Blue Gene supercomputer for instance, scaling on over 4,000 processors without a sweat....weak scaling on over 80k processors without a sweat...but this sure depends on the actual algorithm!!
 
?
I think that does highly and mostly depend on the algorithm used?!

But there are sure a lot of architectures around that scale nearly linearly, at least for our codes (used in scientific computing!).
Just got results last week for our speed up test on a Blue Gene supercomputer for instance, scaling on over 4,000 processors without a sweat....weak scaling on over 80k processors without a sweat...but this sure depends on the actual algorithm!!

Doubtful. For instance, Xenon CPU cores do NOT scale in a near linear fashion. This is mainly due to the shared L2 cache, etc. I believe the same seems to apply to Intel's i7 CPUs as well.

BTW, could you name some of these consumer available (off the shelf) CPU architectures that scale in a nearly linear fashion per additional core?
 
I also find it interesting that he mentions they are developing "multiple processing components" to "intense graphics requirements". It supports my belief that the next generation consoles will not have separated CPUs and GPUs.

Yeah it sorts of gets proved in the PS3 run. Remember when Sony revealed just before PS3's launch that they were going to have two Cells, 1 for CPU, other for GPU... but that they had to let go one in favor of the shader-models and putting in nVidia's GPU-architecture right after customizing it up in the ilk of Cell? But that customized GPU (RSX) proved to be way less efficient than the CPU (Cell) in there.

I wonder how PS3 had performed with 2 of those beasts in there. PS3's versatility is in its dependence on CPU for most of stuff as well as most of graphics and making really great outcome on it, just like the PS2. Things like god of war 3, uncharted 2 and GT5 where we see a peak hitting on visuals and physics blended together, mostly done on CPU alone, does let you think why IBM would leave out such spectacular architecture :)... glad they aren't!
 
Doubtful. For instance, Xenon CPU cores do NOT scale in a near linear fashion. This is mainly due to the shared L2 cache, etc. I believe the same seems to apply to Intel's i7 CPUs as well.

BTW, could you name some of these consumer available (off the shelf) CPU architectures that scale in a nearly linear fashion per additional core?

I think you can lock the shared cache to prevent it from messing up the run-time behaviour (but L2 and 3 are slower than Local Store)

Scaling is mostly a problem space and algorithm thing. The algorithm has to be optimized for the hardware. If the code is not hindered by the small Local Store, then the programmer can usually extract more efficiency from the SPUs because there is no interference from the underlying hardware.

However if the problem is embarrassingly parallel, then it may not matter. The more cores you have, the better.

If the algorithm requires lotsa of random access, then Cell may not be efficient.
 
I think you can lock the shared cache to prevent it from messing up the run-time behaviour (but L2 and 3 are slower than Local Store)

Scaling is mostly a problem space and algorithm thing. The algorithm has to be optimized for the hardware. If the code is not hindered by the small Local Store, then the programmer can usually extract more efficiency from the SPUs because there is no interference from the underlying hardware.

However if the problem is embarrassingly parallel, then it may not matter. The more cores you have, the better.

If the algorithm requires lotsa of random access, then Cell may not be efficient.
True, but doesn't that apply to ALL CPUs? I'm talking about the cases where the CPUs are performing at their most efficient. The addition of a core still doesn't usually scale in a near linear fashion even under those conditions (except the Cell), right?
 
Not necessarily. There are certain types of problems ill-suited for Cell or GPU but run ok for regular CPUs.

I think you meant: when algorithms "performing at their most efficient" on a CPU. The CPU runs assorted stuff. When it's efficient, it's usually because the software is efficient. If the CPU runs slow, it's usually because the software sucks, or is not suitable for the architecture.

To scale a solution linearly, the programmer will need to take care of communication overhead (or data locality). Linear speedup is possible provided the serial portion of the code is small or negligible compared to the parallelizable portion.

In rare cases where say, memory accesses can be amortized better (than expected) over more cores, or when an extra core/thread prune the problem space ahead of time, the speed up can be super linear.
 
I hear what you're saying, but I've seen test results for Intel i7 CPUs and lower models. The tests showed that inner core bandwidth, etc. prevents near linear speed up for each core added to tasks. Those tests indicate hardware as a big component in being able to reach near linear speed up per core added to a task within the CPU. I believe there was another thread touching on this around 2008. I think nAo was speaking on the matter.
 
Cell is:
* Hetereogeneous units in one CPU
* More simple and fast cores for specialized tasks
* Explicit memory management to enable extremely fast memory access
* DMA

I think nAo suggested that the Local Store should be bigger. The PPU should be more powerful. I can't remember anymore but I think we can have more than 1 PPU per Cell too.

EDIT: Forgot about low/efficient power consumption, which is another mantra for Cell.

Also Cell actually works. Intel failed with Larrabee, using cache seems to add layer of complexity to the design of many cores architecture. I think nAo wanted cache intead of LS in Cell, so are many others. I just don't think it's that easy as bolting cache into many cores processor. The design team must have reasons for going DMA and local store.

But going forward, Sony needs to make Cell into GPU somehow. Perhaps attached some texture units like Larrabee. There is just no point in putting 8 billion transistors Cell version 2 in PS4, if it is going to be paired with another GPU like PS3. But from previous discussion on this board, this is impossible to achieve as long as Cell is using DMA and local store. It seems GPU need cache to function effectively.
 
I hear what you're saying, but I've seen test results for Intel i7 CPUs and lower models. The tests showed that inner core bandwidth, etc. prevents near linear speed up for each core added to tasks. Those tests indicate hardware as a big component in being able to reach near linear speed up per core added to a task within the CPU. I believe there was another thread touching on this around 2008. I think nAo was speaking on the matter.


I don't think you should discount what patsu is saying because it is relevant. The answer is not so cut and dry like you are making it out to be, and he did touch on the hardware side of things when referring to communication overhead and data locality. Afterall, communication between cores and even across multiple chips is largely limited by how they are all connected and can share information between each other.
 
Since we're going down memory lane a bit here, the GScube:

http://www.wired.com/wired/archive/9.05/gs3.html

The GScube - named after the PS2's Graphics Synthesizer chip - is basically a parallel array of 16 beefed-up PlayStations. The 16 GS chips, stocked with several times the standard memory arsenal, are split into four graphics-processing blocks. Four more processors merge the graphics output from the four GPUs and send their output to a final merger chip that cranks out an HDTV-ready 1,920 x 1,080 pixels, 16 times the resolution of today's PS2. A standard PlayStation controller plugged into an SGI or Vaio server running Linux controls the whole setup. But it's the server that runs the game program, farming out image and audio data to the individual processing units inside the cube for rendering. (Future consumer consoles based on the concept would compress both cube and server into a single set-top unit.)
 
... except that your search is only partial. Remove the "Larrebee" keyword and you'll see more diverse opinions from him. We all know now that the LRB is not what it's meant to be. So there are still merits in the Cell way.

In particular, you should be able to find nAo's posts regarding how Sony would/should improve Cell for next iteraction. I think some comments were made after the LRB debacle.
 
I don't think you should discount what patsu is saying because it is relevant. The answer is not so cut and dry like you are making it out to be, and he did touch on the hardware side of things when referring to communication overhead and data locality. Afterall, communication between cores and even across multiple chips is largely limited by how they are all connected and can share information between each other.
I think you're misunderstanding me. I'm not discounting what Patsu is saying (I never have). I'm trying NOT to be discounted myself, because the things I've read seemed to be of some significance. I will try to find it the post from the discussion I mentioned.

EDIT: I found this statement from Carl B (1.01.2009) while searching from the post I remembered in this forum. "I would still rate the Cell more highly than the i7, especially from an architectural perspective as you get nearlinearscaling with additional nodes."

Here is an Ars Technica article mentioning what I'm talking about, also. They reference the bandwidth as the biggest factor for the Cell's "near perfect linear scaling".
http://arstechnica.com/hardware/new...tops-high-performance-computing-benchmark.ars
 
Last edited by a moderator:
I get what you're saying. Both points can fit in a consistent framework (as in there's no conflict). Someone can probably put it the right way.
 
... except that your search is only partial. Remove the "Larrebee" keyword and you'll see more diverse opinions from him. We all know now that the LRB is not what it's meant to be. So there are still merits in the Cell way.

In particular, you should be able to find nAo's posts regarding how Sony would/should improve Cell for next iteraction. I think some comments were made after the LRB debacle.
Actually you can find posts in this regard in the searches I did. But on the other hand I read quiet some other posts (for my- self using only his name as the criteria).
nAo raises quiet some interesting point about BC once you start modifying the Cell too much but: latencies, cost of implementing multi threading. But I think the main issue he raises is that it doesn't makes that much sense to have a larger cell with a modern GPU.
In the posts I read nAo seems to think that investing more silicon in the cell would makes sense if you plan to use it to handle graphical tasks but in the same it would mean a departure of what the cell is right now.
It's not to dismiss the Cell's greatness, the SPUs are <7mm² @45nm, they consume really few power, they have great throughput.
I remember a post where nAo describes how a many cores system could look like (sadly I could not find this one may be I search in the wrong forum :( ) basically from memory he was describing something pretty close to larrabee (actually he was asking flat/coherent memory space only for the instruction cache) ~ basically many multi threaded vector processors.

For Larrabee yep things didn't go as smoothly as Intel wanted. But we still have no clue about how it would have perform in closed system as a console free from API and compatibility issues.

Don't get me wrong I'm exited to know more about the direction IBM is taking but it's just that they are fighting an uphill battle.
GPU has plenty of thing going for themselves, now they are language like CUDA that allow to code for both the GPU and CPU and it will only get better.
A Cell2 to me would looks pretty close to a Larrabee, I would like to see IBM pull this right but on the other hand as nAo said it may no longer have anything to do with the Cell V1.

EDIT

Found the post it was in the general console section :)
http://forum.beyond3d.com/showpost.php?p=1363589&postcount=39
 
Last edited by a moderator:
Actually you can find posts in this regard in the searches I did. But on the other hand I read quiet some other posts (for my- self using only his name as the criteria).
nAo raises quiet some interesting point about BC once you start modifying the Cell too much but: latencies, cost of implementing multi threading. But I think the main issue he raises is that it doesn't makes that much sense to have a larger cell with a modern GPU.
In the posts I read nAo seems to think that investing more silicon in the cell would makes sense if you plan to use it to handle graphical tasks but in the same it would mean a departure of what the cell is right now.

Cell has been handling graphical tasks soon after birth. ^_^

It's not to dismiss the Cell's greatness, the SPUs are <7mm² @45nm, they consume really few power, they have great throughput.
I remember a post where nAo describes what a many cores system could like (sadly I could not find this one may be I search in the wrong forum :( ) basically from memory he was describing something pretty close to larrabee (actually he was asking flat/coherent memory space only for the instruction cache) ~ basically many multi threaded vector processors.

For Larrabee yep things didn't go as smoothly as Intel wanted. But we still have no clue about how it would have perform in closed system as a console free from API and compatibility issues.

Don't get me wrong I'm exited to know more about the direction IBM is taking but it's just that they are fighting an uphill battle.
GPU has plenty of thing going for themselves, now they are language like CUDA that allow to code for both the GPU and CPU and it will only get better.
A Cell2 to me would looks pretty close to a Larrabee, I would like to see IBM pull this right but on the other hand as nAo said it may no longer have anything to do with the Cell V1.

I don't know if Cell will become LRB. They sound different. They can certainly take design points from both. How hot was LRB ?

Found the post it was in the general console section :)
http://forum.beyond3d.com/showpost.php?p=1363589&postcount=39

Yap, that's the post I remember.

I think nAo suggested that the Local Store should be bigger. The PPU should be more powerful. I can't remember anymore but I think we can have more than 1 PPU per Cell too.
 
Doubtful. For instance, Xenon CPU cores do NOT scale in a near linear fashion. This is mainly due to the shared L2 cache, etc. I believe the same seems to apply to Intel's i7 CPUs as well.

Sorry, but I don't understand your argumentation.
Just to make sure that we don't talk about different things:

-I have a software to deal with
-Now, I make it parallel using for instance MPI
-I take several different architectures to test how the software scales

My opinion is, that the scaling results you get will depend strongest on the algorithms used in the software and your skill when making it parallel. And only secondary on the architecture you are using, as long of course when the same parallelization strategy can be applied to all considered architectures.
So in short, bad algorithm, bad parallelization the software does not scale on the best machines available...

BTW, could you name some of these consumer available (off the shelf) CPU architectures that scale in a nearly linear fashion per additional core?

This is difficult to answer, because I am not sure about how many processors we are talking here. For instance: it is no problem to scale nearly linear on a standard Xeon cluster or Opteron cluster up to 100+ processors. If we are talking only about one CPU with an increased number of cores, there is not much available and the maximum I tested is a quad core with scaling on this four processors depending heavily on the used algorithm.

But then, I never tested shared memory parallelization and I am not sure if you are talking about this??
 
Back
Top