Is there something that CELL can still do better than modern CPU/GPU

Panajev2001a · Nov 11, 2009

Gubbi said:
If code runs well on a SPU, it will run well on a regular core. The advantage of the SPU is that it is small, and relatively low power, so you can have more execution resources on the same die. That is the only advantage, everything else is a disadvantage.

You are referring to the possibility of locking cache lines tus making a normal cache be able to offer the advantages of a Local Store and then some, but a fast 256 KB cache with 6 cycles of latency at 3 GHz is not exactly trivial to do. Especially on 90 nm a few years ago when CELL came out.

archangelmorph · Nov 11, 2009

Gubbi said:
If code runs well on a SPU, it will run well on a regular core. The advantage of the SPU is that it is small, and relatively low power, so you can have more execution resources on the same die. That is the only advantage, everything else is a disadvantage.

It'll be interesting to see how well it will hold up against regular cores optimized for power and size, like a cortex A9 with SIMD add-on.

Cheers

Surely the arguement here an issue of core vs core but Chip vs chip..

It's not the SPU that makes the Cell a nippy chip. It's the fact that there's eight of them..

Gubbi · Nov 11, 2009

Panajev2001a said:
You are referring to the possibility of locking cache lines tus making a normal cache be able to offer the advantages of a Local Store and then some, but a fast 256 KB cache with 6 cycles of latency at 3 GHz is not exactly trivial to do. Especially on 90 nm a few years ago when CELL came out.

No, I'm not.

I'm referring to that if you rewrite your code so that it can run out of local store, the locality will be such that it can run out of caches.

There is no need to lock cache lines if your code uses loads and stores with temporal hints.

Cheers

patsu · Nov 11, 2009

You have 8 of them though, in a tight efficient package. Then throw in the vector unit from the PPU. All -- not just one SPU core -- can run without interfering with each other if designed well.

Shifty Geezer · Nov 11, 2009

patsu said:
You have 8 of them though, in a tight efficient package.

Gubbi said that

Gubbi said:
The advantage of the SPU is that it is small, and relatively low power, so you can have more execution resources on the same die.

Though as Panajev says, 3GHz low latency is impressive, and we could get faster. A single SPU is an efficient core, and the simplicity means lots of them. Thus the Cell architecture, as archangelmorph points out, is a Good One.

Which in summary shows I've just quoted what everyone else has said, which begs the question why did I bother to post anything?

patsu · Nov 11, 2009

Yeah, I just don't know if he thought about having 8 separate caches for each SPU core, or he's just thinking about a one core scenario for the data locality argument.

Panajev2001a · Nov 11, 2009

Gubbi said:
No, I'm not.

I'm referring to that if you rewrite your code so that it can run out of local store, the locality will be such that it can run out of caches.

There is no need to lock cache lines if your code uses loads and stores with temporal hints.

Cheers

I guess that either I do not know much about the effect of temporal hints (highly probable) and they can guarantee 100% deterministic behavior or that they can only do that 98% of the time and the extra 2% is not worth pursuing given what other things cache can give you.
Still, there might still be the need of cache locking to avoid polluting the cache with streamed data IMHO.

Gubbi · Nov 12, 2009

Panajev2001a said:
I guess that either I do not know much about the effect of temporal hints (highly probable) and they can guarantee 100% deterministic behavior or that they can only do that 98% of the time and the extra 2% is not worth pursuing given what other things cache can give you.
Still, there might still be the need of cache locking to avoid polluting the cache with streamed data IMHO.

Non-temporal loads and stores avoid polluting caches with data that hs no temporal locality. This increases the effectiveness of caches, it also avoids read-modify-writes of cache-lines.

SPUs are predictable because they operate on data in local store. If you can hide the latency it takes to get data moved in and out of the localstore (like splitting your workload into chunks and double buffering), you're fine, if not, you're SOL. The latter is obviously not fit to run on a SPU.

My point is, if you have a piece of code, that you get running efficiently on a SPU by thinking long and hard about algorithms, data layout etc, then back-porting it to your original CPU is likely to increase performance on your CPU as well, because the things that make code run on a SPU, - increased locality, greatly benefits regular CPUs as well.

Cheers

Panajev2001a · Nov 12, 2009

Gubbi said:
My point is, if you have a piece of code, that you get running efficiently on a SPU by thinking long and hard about algorithms, data layout etc, then back-porting it to your original CPU is likely to increase performance on your CPU as well, because the things that make code run on a SPU, - increased locality, greatly benefits regular CPUs as well.

Agreed, it is also something that many PS3 developers have stated... in a way PS3 did help the industry because all the developers who forced themselves to make code that runs fast on CELL and move as many subsystems on SPU's as possible are now better programmers on all platforms

.

Acert93 · Nov 12, 2009

repi's slides he just linked to indicate as much (PS3 the hardest, will run very poorly if not targetted, lead on the PS3, PS3 shows very good potential, etc).

patsu · Nov 12, 2009

That has been the design philosophy all along.

If the solution is optimized for the Cell architecture, pound-for-pound, it will run faster than a general/complex design because of the added cores and simpler hardware. The fast interconnect between the SPUs help too.

The GPUs will have even more cores, but they are suited for a different (but overlapping) problem/solution profile

Acert93 · Nov 12, 2009

patsu said:
That has been the design philosophy all along.

If the solution is optimized for the Cell architecture, pound-for-pound, it will run faster than a general/complex design because of the added cores and simpler hardware. The fast interconnect between the SPUs help too.

The GPUs will have even more cores, but they are suited for a different (but overlapping) problem/solution profile

As in suited meaning, "GPUs have some limitations that prevent even GPU optimized code from matching SPE code." Which is true. But it is the same point of contention ignored about SPEs versus normal processors. Optimized doesn't always mean better in the big picture, but as the slides I mentioned pointed out we are going directions where certain tasks dominate cycles and the important thing is to get your heavy lifting fitting within that envelope.

ADEX · Nov 15, 2009

Gubbi said:
If code runs well on a SPU, it will run well on a regular core. The advantage of the SPU is that it is small, and relatively low power, so you can have more execution resources on the same die. That is the only advantage, everything else is a disadvantage.

The lack of coherence between the Local Stores is probably seen as a disadvantage but once you start to scale Cell it'll turn out to be a big advantage.

Once you start adding in piles of cores coherent caches will become a major source of latency and power consumption.

Terarrim · Nov 16, 2009

That was one of the larger design decisions on creating the Cell was that there was a limit to the amount of cache you can use before you hit diminishing returns. Where as not only is the sdram predictable its infinitly scaleable.

Gubbi · Nov 17, 2009

ADEX said:
The lack of coherence between the Local Stores is probably seen as a disadvantage but once you start to scale Cell it'll turn out to be a big advantage.

You're right, that relying on memory coherence for inter-thread/proces communication won't scale as number of cores approaches infinity. I think message passing will come into vogue.

I terms of coherence, I would really like both.

Explicit coherence control for message passing, and automatic/strict coherence for each context and for the odd global lock/communication. The big Altixen proves you can have a big single coherent system image, and good performance is usually reached by using message passing.

I think the future is going to be something like an Altix on a chip. Hundreds, or thousands of coherent cores with primitives that support message passing, - like load and stores that hit a specific level of the on-die memory system (spilling to main memory as necessary), smarter coherence protocols (automatic directories) and virtual channels in memory.

CELLs complete lack of coherence is a major PITA. While it may be worthwhile from a pure performance perspective (which isn't clear cut at all, ie. there is zero constructive interference in the on-die memory, - unlike a cache hierarchy), once you factor in human resources, it isn't (IMO). Hardware cost is monotonically falling, software developer salaries are monotonically rising.

Cheers

ps2rocks · May 1, 2010

One thing you mentioned that caught my attention was the use of FFT or Full Fourier Transform... and yes this is totally amazing that Cell is doing this and I've totally not seen any other processor even going to check this, let alone to be made for it. Processing wise, it can be put into perspective as the first game in the history of gaming that did dynamic wave-dispersion of water, when water is doing procedural wave generation as well... and is interactive for any number of individuals touching it... is Resistance-2. Entire game's water is done on FFT. Funny thing is that even Crysis hasn't touched it yet... so obviously no interactive wave dispersion in it(Crysis only does procedural waves, but neither dynamic dispersion, nor interactive generation blending).

For Crysis 2, we're hearing news of utilizing FFT. In Resistance-2 both small body and large body(ocean) water are rendered seperately but the core FFT algorithm remains the same. Be it 2 ft or 200 metres, it remains the same without framerate issues. That was the first where I was completely sold to Cell's potential even into the face of the 100s of cores containing modern GPUs.

Shifty Geezer · May 11, 2010

This isn't a water appreciation thread, guys...

Shifty Geezer · May 13, 2010

Water discussion moved here.

ps2rocks · Sep 16, 2010

A remarkably interesting read and an excellent sum-up of this thread:

Some excerpt first:
*GPU and Cell/B.E. are close cousins from a hardware architecture point of view.

*They both rely on Single Instruction Multiple Data (SIMD) parallelism — a.k.a vector processing, and

*they both run at high clock speed (>3GHz) and implement floating point operations using RISC technology achieving single cycle execution even for complex operations like reciprocal or square root estimates. These come in very handy for 3D transformations and distance calculations (used a lot both in 3D graphics and scientific modeling).

*They both manage to pack over 200 GFlops (billions of floating point operations per second) into a single chip. They are excellent choices for applications like 3D molecular modeling, MM force field computations, docking, scoring, flexible ligand overlay, protein folding.

BUT

*There are some subtle differences between the two, e.g. Cell/B.E. support double precision calculations while GPUs do not (there is some work being done in that direction at Nvidia though), which makes the Cell/B.E. the only suitable choice for quantum chemistry calculations.

*There is a difference in memory handling too: GPUs rely on caching just like CPUs, while the Cell/B.E. puts complete control into the hands of the programmers via direct DMA programming. This allows the developers to keep “feeding the beast” with data using double buffering techniques without ever hitting a cache-miss causing stalls in the computation.

*Another difference is that GPUs use wider registers 256 bits, while the Cell/B.E. uses 128 bits, but using a double-pipe which allows two operations to execute in a single cycle. The two approach may sound like equivalent on a cursory look, but again provides a subtle difference. 128 bit houses 4 floats, enough for a 3D transformation row or point coordinate (typically extended to 4 instead of 3 to handle perspective), so you can execute 2 different operations on them on the Cell/B.E. while the GPU can only do the same operation on more data.
So If the purpose is to apply an operation to a lot of data, that comes down to the same, but a more complex computation series on a single 3D matrix can be done twice as fast on the Cell/B.E.

*The 8 Synergetic Processor Units of the Cell/B.E. can transfer data between each others memory via a 192GB/s bandwidth bus, while the fastest GPU (GeForce 8800 Ultra) has a bandwidth of 103.7 GB/s and all others fall well below 100GB/s. The high end GPUs have over 300GFlops theoretical throughput, but due to the memory bus speed limitations and cache miss latency, the practical throughput falls far short of that, while the Cell/B.E. has demonstrated benchmark results (e.g. for real-time ray tracing application) far superior to that of the G80 GPU despite the theoretical throughput being lower than the GPU.

Here's the rest of it, even cost effectiveness is discussed:
http://www.simbiosys.ca/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/

liolio · Sep 16, 2010

Hum your sum-up is inaccurate on many accounts.

Is there something that CELL can still do better than modern CPU/GPU

Panajev2001a

archangelmorph

Gubbi

patsu

Shifty Geezer

uber-Troll!

patsu

Panajev2001a

Gubbi

Panajev2001a

Acert93

Artist formerly known as Acert93

patsu

Acert93

Artist formerly known as Acert93

ADEX

Terarrim

Gubbi

ps2rocks

Shifty Geezer

uber-Troll!

Shifty Geezer

uber-Troll!

ps2rocks

liolio

Aquoiboniste

Similar threads