Is there something that CELL can still do better than modern CPU/GPU

Crossbar said:
Couldn´t Larrabee pull off this trick?
Larrabee is a CPU, a massively multicore Pentium with refinements to better handle graphics. But you can write and execute x86, and do any task. A GPU by definition is designed to process graphics workloads. Any processor designed to process any workload is not a Graphics processing unit (unless you are identifying it as satisfying a function in an a system. eg. A Z80 CPU could be a GPU if you have it controlling an LCD as part of a simple computer system).

I am not so sure. I mean, what I really wanted to ask: is there really anything that CELL can do better than a GPU with respect to non-graphics stuff?
Yes. It has more immediately available local store for working with allowing more versatility. I can certainly see Cell being a better fit for audio synthesis and processing than a GPU. Cell is much faster in linear processing than a GPU, which are designed to process lots of data simultaneously. There's a reason supercomputers are using Cell. Going forwards if GPUs extend programmability then they should overtake, but then they'd stop being GPUs and be CPUs - generic processors capable of doing any job you ask of them.
 
Sebbbi pointed other issues, synchronization issue, it takes sometime for CPU to recover calculations done on the GPU. Fusion may be a huge win in this regard or the other way around you may run a lot more stuffs on the GPU.
The latency problem is basically caused by the fact that you are running graphics rendering and GPGPU processing on the same GPU. In the best case GPU and CPU are running completely asynchronously. To keep the GPU perfectly fed 100% of the time, there's always a lot of graphics stuff waiting to be executed in the GPU command buffers. There is no way to insert all the low latency GPGPU stuff before the graphics stuff in the command buffer with the current driver model. The driver model could be changed to support low latency privileged GPU access, but it's a really big can of worms. The whole GPU state would have been first stored, then the low latency stuff executed and then the GPU state would have to be restored so that the graphics rendering would continue without problems. Basically the system would require a context switch (like modern multitasking OSs do all the time with CPUs). And a full state load/save for GPU is a lot of work.

A future gaming console could have for example one GPU dedicated for graphics and one dedicated for general purpose calculation. This way the GPGPU latency would be really low compared to the current situation.
 
The latency problem is basically caused by the fact that you are running graphics rendering and GPGPU processing on the same GPU. In the best case GPU and CPU are running completely asynchronously. To keep the GPU perfectly fed 100% of the time, there's always a lot of graphics stuff waiting to be executed in the GPU command buffers. There is no way to insert all the low latency GPGPU stuff before the graphics stuff in the command buffer with the current driver model. The driver model could be changed to support low latency privileged GPU access, but it's a really big can of worms. The whole GPU state would have been first stored, then the low latency stuff executed and then the GPU state would have to be restored so that the graphics rendering would continue without problems. Basically the system would require a context switch (like modern multitasking OSs do all the time with CPUs). And a full state load/save for GPU is a lot of work.

A future gaming console could have for example one GPU dedicated for graphics and one dedicated for general purpose calculation. This way the GPGPU latency would be really low compared to the current situation.
Thanks for going deeper into the details of how things works :)

In regard to the part I highlighted, it would be interesting if at some point developers (game or not) could explicitly reserve some resources for one type of calculations (say some larrabee cores or some SIMD array). It sounds like this would be possible on larrabee at least actual GPU still has to evolve.

EDIT
I actually got a bit carried away off the topic by being a bit too much enthusiastic.... So I don't delete but hide the non relevant part. Actually I've opened a topic somehow on that matter to clear some misconceptions I have on GPUs. Depending on the responses I may have I may ask about the troubles tessellation (among other things) may introduce and how GPU could evolve in the topic I created.
Sorry Sebbbi I end up asking you a lot in a pretty airy fashion. Anyway thanks for enlightening the in and out of doing things on the GPU and how it could be improved :)

I come to think a bit more about tessellation and some issues raised in my mind especially after watching Uningine(?) benchmark. Say one were to tessellation the ground and then displace it, so far so good then you put in a character, collision are done on the CPU... The CPU didn't know the ground were to be displaced... Short story you end up with one (lone) character with its feet stuck into cobblestone.
Supposedly geometry shader should be able to adapt the character mesh to the new topology (may a reason why they are after tessellation in direct x pipeline?) but to do so the gpu by it self would have to run a new collision test or the result of tessellation could be sent back to the CPU to do so (which would kill the purpose of doing tessellation on GPU). Things could get worse between two characters and if the collision should have had an effect on world simulation/AI (you take hit from an horn for example).
What I want to say is that the issues you explained me are likely to grow a lot in the future (at various stage of your pipeline you would want to return data to the CPU to run calculation or the other way around). multitasking on one chip is tough, on multiple chips with communication overhead, it could be hell.
So there are different takes for the hardware to handle it properly:
*You still have the CPU and GPU as two chips, you keep the CPU job minimal and you do everything on a more advance GPU.
*You have only one chip with a discrete CPU and GPU "living" in it, they would have fast way to communicate, they may have a coherent memory model.
*You have a bunch of the same discrete entities, same as above but may be load balancing would be easier (homogenous design).
*Last one... my favorite I forget about tesselation and stuffs that are too high above my head

What is your take in this regard? When AMD/ATI says Fusion is the future do you think that in the end it could make sense to have one or two CPU cores on the same die as the GPU no matter you have a Quad core hanging around, what I mean is having the "gpu" as an autonomic resource in a computer, like windows detect that you want to play a game then send the .exe to GPU that then do all by itself. (I admit that simply give up on the quad core could also make sense depending on your needs ).
 
Last edited by a moderator:
I am not so sure. I mean, what I really wanted to ask: is there really anything that CELL can do better than a GPU with respect to non-graphics stuff?

well as the best GPU available (the latest ATI one) cant 'really' handle recursion
I think u can safely say cell does it better :)
 
Couldn´t Larrabee pull off this trick?
I mean the Arm cores in the NetPCs are not really any power horses, still they work pretty well.

Couldn´t one or a couple of the Larrabee cores be assigned to run some OS and app code on a similar level?

Only catch is Larrabee isn't as much a GPU as it is a pool of CPUs which Intel is *hoping* will be competitive when used as a GPU. To answer the question, though, I believe the answer is yes, being it's just a bunch of modified CPU cores. Each core should be able to run a fully functional iteration of an OS, and/or any number of other functions.
 
Last edited by a moderator:
Only catch is Larrabee isn't as much a GPU as it is a pool of CPUs which Intel is *hoping* will be competitive when used as a GPU. To answer the question, though, I believe the answer is yes, being it's just a bunch of modified CPU cores. Each core should be able to run a fully functional iteration of an OS, and/or any number of other functions.

OK, yeah I learnt from Shifty as well that it isn´t really labled as a GPU. I think I got that impression from the plans to put in on graphics cards, but I also learnt that it doesn´t necessarily make it a GPU. Will be really interesting to see if it will find a suitable niche.
 
OK, yeah I learnt from Shifty as well that it isn´t really labled as a GPU. I think I got that impression from the plans to put in on graphics cards, but I also learnt that it doesn´t necessarily make it a GPU. Will be really interesting to see if it will find a suitable niche.

It sorta boils down to this. Larrabee is basically a scaled down computing node, sorta like what you'd expect to see in a super computer, I.e. multiple x86-compatible cores, communications, memory, etcetera, but with ROPs (iirc). It leaves me wondering how this is going to work out - Even super computers typically rely on clusters of GPUs to render their output.
 
It has more immediately available local store for working with allowing more versatility.

Uhhh, I really think that the local store is the single worst idea on the whole planet!!! I mean this makes it nearly impossible (maybe a ND ninja could do it) to port our stuff to CELL. In my opinion (which of course does not count at all) the local (small) store is the reason CELL will only be on the fringes.


A future gaming console could have for example one GPU dedicated for graphics and one dedicated for general purpose calculation. This way the GPGPU latency would be really low compared to the current situation.

That is what I exactly asked and thought...why even bother with a CPU :D
For the next console...just buy a TESLA, name it XStation or PlayBox et voila, here you have your new super console :mrgreen:


well as the best GPU available (the latest ATI one) cant 'really' handle recursion
I think u can safely say cell does it better :)

Hmmm, interesting. Did not know this.
 
Uhhh, I really think that the local store is the single worst idea on the whole planet!!! I mean this makes it nearly impossible (maybe a ND ninja could do it) to port our stuff to CELL. In my opinion (which of course does not count at all) the local (small) store is the reason CELL will only be on the fringes.
But LS is like a large 'L1' cache, 2MBs on a full 1:8 Cell, over which you have control. The only difference is, if you're used to the CPU handling all memory fetches to keep things local to be processed, having to manage that yourself will seem very hard. AFAIK there are libraries to help though.

If you really are interested in how Cell fairs and how perhaps you could address performance issues, you should visit the Cell Development forum. From the sounds of it, personal experience has you missing some of the potential, and if you understood some other apporaches, you might notice it's a very good processor that GPUs still have a little way to overtake in all areas.
 
Uhhh, I really think that the local store is the single worst idea on the whole planet!!! I mean this makes it nearly impossible (maybe a ND ninja could do it) to port our stuff to CELL. In my opinion (which of course does not count at all) the local (small) store is the reason CELL will only be on the fringes.
Except that the SPU local store is basically the only thing that CELL has that nobody else does. Each SPU has it's own memory and it's own memory controller. This provides huge amount of total memory bandwidth to the whole system. All multicore CPUs and GPU processing units have to fight for shared memory resources.

well as the best GPU available (the latest ATI one) cant 'really' handle recursion
I think u can safely say cell does it better :)
You can index temporal shader variables by recent GPUs. So you can create indexable arrays and most importantly a stack. With a stack, you can create recursive algorithms. Granted, recursion on GPUs is not as easy as with real recursive functions.
 
Last edited by a moderator:
If you really are interested in how Cell fairs and how perhaps you could address performance issues, you should visit the Cell Development forum. From the sounds of it, personal experience has you missing some of the potential, and if you understood some other apporaches, you might notice it's a very good processor that GPUs still have a little way to overtake in all areas.

Even better, you should consult the free online red book on Cell Development (Programmer's Manual or whatever it is called, I'll look it up - read most of it myself, on the PSP no less! :D Wish BookR would be released as a Mini!).

Anyway, it contains sections on among others explanations on what algorithms the Cell is suited for and how well suited. Also general programming tips etc. If you know what you are going to do in your project, it would be really easy to see whether or not the Cell would be useful for it. Then you can decide whether or not it will be worth investing in adapting your programming style / designs / current codebase.
 
Except that the SPU local store is basically the only thing that CELL has that nobody else does. Each SPU has it's own memory and it's own memory controller. This provides huge amount of total memory bandwidth to the whole system. All multicore CPUs and GPU processing units have to fight for shared memory resources.
Yes, I should have been more precisely. I think the main concept is not bad, you are completely right. Our problem is how to fit the damn code into this small junks of memory :devilish:
We discussed a lot with people who ported their stuff (which is roughly comparable to our framework) to CELL.
They said: 'for this part we needed 3 man years, the other part was a little bit difficult, we needed 6-7 man years'.
Here in our institute, we have the following problem compared to these guys: usually one man-year, means in our case one man coding a whole year :mrgreen:
 
Even better, you should consult the free online red book on Cell Development (Programmer's Manual or whatever it is called, I'll look it up - read most of it myself, on the PSP no less! :D Wish BookR would be released as a Mini!).

Anyway, it contains sections on among others explanations on what algorithms the Cell is suited for and how well suited. Also general programming tips etc. If you know what you are going to do in your project, it would be really easy to see whether or not the Cell would be useful for it. Then you can decide whether or not it will be worth investing in adapting your programming style / designs / current codebase.
Thanks very much. I already new this book, nontheless good hint (time to read it again, just for fun ;)).
 
Larrabee is a CPU, a massively multicore Pentium with refinements to better handle graphics. But you can write and execute x86, and do any task. A GPU by definition is designed to process graphics workloads. Any processor designed to process any workload is not a Graphics processing unit (unless you are identifying it as satisfying a function in an a system. eg. A Z80 CPU could be a GPU if you have it controlling an LCD as part of a simple computer system).

Yes. It has more immediately available local store for working with allowing more versatility. I can certainly see Cell being a better fit for audio synthesis and processing than a GPU. Cell is much faster in linear processing than a GPU, which are designed to process lots of data simultaneously. There's a reason supercomputers are using Cell. Going forwards if GPUs extend programmability then they should overtake, but then they'd stop being GPUs and be CPUs - generic processors capable of doing any job you ask of them.

I think one thing people are ignoring is cost.

Cost is everything in business and super-computing is a business with fixed budget. GPU is much more costly than Cell. It has too much hardware that is not for execution of programmable code. Cell is very clean design. Only little bit bloat is the PPU and L2. A big running cost for super-computing is also power consumption. GPU is not only power hungry but also needs a lot of cooling.

Cell is also now very mature design for manufacturing and is very cheap and reliable in that also.

I think if they can make GPU like Fermi cheap to manufacture, cool, and reliable, maybe that will be a good option for super-computing.
 
Many algorithms require larger data sets with higher data dependencies and are inherently serial and branchy in nature, so they would not map efficiently to a CELL processing model no matter how many man years were spent on development.

A higher "rating" of ops per second by even an order of magnitude or more can easily be found to be meaningless in the face of efficiency.
 
Of course. That's why IBM explicitly indicates which type of work Cell is good/bad at (using a five point rating) in its manual.

Many algorithms require larger data sets with higher data dependencies and are inherently serial and branchy in nature, so they would not map efficiently to a CELL processing model no matter how many man years were spent on development.

A higher "rating" of ops per second by even an order of magnitude or more can easily be found to be meaningless in the face of efficiency.
 
Why those problems may suck on GPGPUs too. Need to look at the specifics.

As long as there is a way to reorganize, bunch up the data and stream them in less than 128K, they should run well on Cell. A "branchy" problem is ok when the data can be packed/localized. In one case, the Cell platform outran a traditional supercomputer node in a breadth-first tree search.

Also, if the per-SPU speed-up is modest (i.e., not counter-productive), the programmer can allocate 8 SPUs to do the job -- which should allow the programmer to get away with some quick and small parallelization work.

The other nice thing about Cell is it's very predictable, so the scientists could estimate its performance rather accurately before coding it.
 
Would serial problems really be an issue on Cell since you can pipe data from one SPU to another across the internal bus? Just have each SPU work on part of the algorithm, then chain em all together...in theory.

Anyone done any work like that?
 
A "branchy" problem is ok when the data can be packed/localized. In one case, the Cell platform outran a traditional supercomputer node in a breadth-first tree search.
Local memory offers the SPU really fast (low latency) and predictable (no latency spikes) path to the memory. These things are really helping the SPU branching performance. The less memory latency you have, the less you need complex branch prediction hardware (and the better your branching performance). Of course the problem needs to be really well localized, so you don't need to move/load new data to/from the local memory because of a branch. That would really kill the performance.
 
If code runs well on a SPU, it will run well on a regular core. The advantage of the SPU is that it is small, and relatively low power, so you can have more execution resources on the same die. That is the only advantage, everything else is a disadvantage.

It'll be interesting to see how well it will hold up against regular cores optimized for power and size, like a cortex A9 with SIMD add-on.

Cheers
 
Back
Top