Is there something that CELL can still do better than modern CPU/GPU

Discussion in 'Console Technology' started by gongo, Nov 4, 2009.

  1. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    You are referring to the possibility of locking cache lines tus making a normal cache be able to offer the advantages of a Local Store and then some, but a fast 256 KB cache with 6 cycles of latency at 3 GHz is not exactly trivial to do. Especially on 90 nm a few years ago when CELL came out.
     
  2. archangelmorph

    Veteran

    Joined:
    Jun 19, 2006
    Messages:
    1,551
    Likes Received:
    11
    Location:
    London
    Surely the arguement here an issue of core vs core but Chip vs chip..

    It's not the SPU that makes the Cell a nippy chip. It's the fact that there's eight of them..
     
  3. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,519
    Likes Received:
    852
    No, I'm not.

    I'm referring to that if you rewrite your code so that it can run out of local store, the locality will be such that it can run out of caches.

    There is no need to lock cache lines if your code uses loads and stores with temporal hints.

    Cheers
     
  4. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,614
    Likes Received:
    60
    You have 8 of them though, in a tight efficient package. Then throw in the vector unit from the PPU. All -- not just one SPU core -- can run without interfering with each other if designed well.
     
  5. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    40,598
    Likes Received:
    11,004
    Location:
    Under my bridge
    Gubbi said that ;)
    Though as Panajev says, 3GHz low latency is impressive, and we could get faster. A single SPU is an efficient core, and the simplicity means lots of them. Thus the Cell architecture, as archangelmorph points out, is a Good One.

    Which in summary shows I've just quoted what everyone else has said, which begs the question why did I bother to post anything? :p
     
  6. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,614
    Likes Received:
    60
    Yeah, I just don't know if he thought about having 8 separate caches for each SPU core, or he's just thinking about a one core scenario for the data locality argument. :)
     
  7. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    I guess that either I do not know much about the effect of temporal hints (highly probable) and they can guarantee 100% deterministic behavior or that they can only do that 98% of the time and the extra 2% is not worth pursuing given what other things cache can give you.
    Still, there might still be the need of cache locking to avoid polluting the cache with streamed data IMHO.
     
  8. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,519
    Likes Received:
    852
    Non-temporal loads and stores avoid polluting caches with data that hs no temporal locality. This increases the effectiveness of caches, it also avoids read-modify-writes of cache-lines.

    SPUs are predictable because they operate on data in local store. If you can hide the latency it takes to get data moved in and out of the localstore (like splitting your workload into chunks and double buffering), you're fine, if not, you're SOL. The latter is obviously not fit to run on a SPU.

    My point is, if you have a piece of code, that you get running efficiently on a SPU by thinking long and hard about algorithms, data layout etc, then back-porting it to your original CPU is likely to increase performance on your CPU as well, because the things that make code run on a SPU, - increased locality, greatly benefits regular CPUs as well.

    Cheers
     
  9. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    Agreed, it is also something that many PS3 developers have stated... in a way PS3 did help the industry because all the developers who forced themselves to make code that runs fast on CELL and move as many subsystems on SPU's as possible are now better programmers on all platforms ;).
     
  10. Acert93

    Acert93 Artist formerly known as Acert93
    Legend

    Joined:
    Dec 9, 2004
    Messages:
    7,782
    Likes Received:
    162
    Location:
    Seattle
    repi's slides he just linked to indicate as much (PS3 the hardest, will run very poorly if not targetted, lead on the PS3, PS3 shows very good potential, etc).
     
  11. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,614
    Likes Received:
    60
    That has been the design philosophy all along.

    If the solution is optimized for the Cell architecture, pound-for-pound, it will run faster than a general/complex design because of the added cores and simpler hardware. The fast interconnect between the SPUs help too.

    The GPUs will have even more cores, but they are suited for a different (but overlapping) problem/solution profile
     
  12. Acert93

    Acert93 Artist formerly known as Acert93
    Legend

    Joined:
    Dec 9, 2004
    Messages:
    7,782
    Likes Received:
    162
    Location:
    Seattle
    As in suited meaning, "GPUs have some limitations that prevent even GPU optimized code from matching SPE code." Which is true. But it is the same point of contention ignored about SPEs versus normal processors. Optimized doesn't always mean better in the big picture, but as the slides I mentioned pointed out we are going directions where certain tasks dominate cycles and the important thing is to get your heavy lifting fitting within that envelope.
     
  13. ADEX

    Newcomer

    Joined:
    Sep 11, 2005
    Messages:
    231
    Likes Received:
    10
    Location:
    Here
    The lack of coherence between the Local Stores is probably seen as a disadvantage but once you start to scale Cell it'll turn out to be a big advantage.

    Once you start adding in piles of cores coherent caches will become a major source of latency and power consumption.
     
  14. Terarrim

    Newcomer

    Joined:
    Jun 12, 2007
    Messages:
    177
    Likes Received:
    0
    That was one of the larger design decisions on creating the Cell was that there was a limit to the amount of cache you can use before you hit diminishing returns. Where as not only is the sdram predictable its infinitly scaleable.
     
  15. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,519
    Likes Received:
    852
    You're right, that relying on memory coherence for inter-thread/proces communication won't scale as number of cores approaches infinity. I think message passing will come into vogue.

    I terms of coherence, I would really like both.

    Explicit coherence control for message passing, and automatic/strict coherence for each context and for the odd global lock/communication. The big Altixen proves you can have a big single coherent system image, and good performance is usually reached by using message passing.

    I think the future is going to be something like an Altix on a chip. Hundreds, or thousands of coherent cores with primitives that support message passing, - like load and stores that hit a specific level of the on-die memory system (spilling to main memory as necessary), smarter coherence protocols (automatic directories) and virtual channels in memory.

    CELLs complete lack of coherence is a major PITA. While it may be worthwhile from a pure performance perspective (which isn't clear cut at all, ie. there is zero constructive interference in the on-die memory, - unlike a cache hierarchy), once you factor in human resources, it isn't (IMO). Hardware cost is monotonically falling, software developer salaries are monotonically rising.

    Cheers
     
    #55 Gubbi, Nov 17, 2009
    Last edited by a moderator: Nov 17, 2009
  16. ps2rocks

    Banned

    Joined:
    May 1, 2010
    Messages:
    32
    Likes Received:
    0
    One thing you mentioned that caught my attention was the use of FFT or Full Fourier Transform... and yes this is totally amazing that Cell is doing this and I've totally not seen any other processor even going to check this, let alone to be made for it. Processing wise, it can be put into perspective as the first game in the history of gaming that did dynamic wave-dispersion of water, when water is doing procedural wave generation as well... and is interactive for any number of individuals touching it... is Resistance-2. Entire game's water is done on FFT. Funny thing is that even Crysis hasn't touched it yet... so obviously no interactive wave dispersion in it(Crysis only does procedural waves, but neither dynamic dispersion, nor interactive generation blending).

    For Crysis 2, we're hearing news of utilizing FFT. In Resistance-2 both small body and large body(ocean) water are rendered seperately but the core FFT algorithm remains the same. Be it 2 ft or 200 metres, it remains the same without framerate issues. That was the first where I was completely sold to Cell's potential even into the face of the 100s of cores containing modern GPUs.
     
  17. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    40,598
    Likes Received:
    11,004
    Location:
    Under my bridge
    This isn't a water appreciation thread, guys...
     
  18. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    40,598
    Likes Received:
    11,004
    Location:
    Under my bridge
    Water discussion moved here.
     
  19. ps2rocks

    Banned

    Joined:
    May 1, 2010
    Messages:
    32
    Likes Received:
    0
    A remarkably interesting read and an excellent sum-up of this thread:


    Some excerpt first:
    *GPU and Cell/B.E. are close cousins from a hardware architecture point of view.

    *They both rely on Single Instruction Multiple Data (SIMD) parallelism — a.k.a vector processing, and

    *they both run at high clock speed (>3GHz) and implement floating point operations using RISC technology achieving single cycle execution even for complex operations like reciprocal or square root estimates. These come in very handy for 3D transformations and distance calculations (used a lot both in 3D graphics and scientific modeling).

    *They both manage to pack over 200 GFlops (billions of floating point operations per second) into a single chip. They are excellent choices for applications like 3D molecular modeling, MM force field computations, docking, scoring, flexible ligand overlay, protein folding.

    BUT

    *There are some subtle differences between the two, e.g. Cell/B.E. support double precision calculations while GPUs do not (there is some work being done in that direction at Nvidia though), which makes the Cell/B.E. the only suitable choice for quantum chemistry calculations.

    *There is a difference in memory handling too: GPUs rely on caching just like CPUs, while the Cell/B.E. puts complete control into the hands of the programmers via direct DMA programming. This allows the developers to keep “feeding the beast” with data using double buffering techniques without ever hitting a cache-miss causing stalls in the computation.

    *Another difference is that GPUs use wider registers 256 bits, while the Cell/B.E. uses 128 bits, but using a double-pipe which allows two operations to execute in a single cycle. The two approach may sound like equivalent on a cursory look, but again provides a subtle difference. 128 bit houses 4 floats, enough for a 3D transformation row or point coordinate (typically extended to 4 instead of 3 to handle perspective), so you can execute 2 different operations on them on the Cell/B.E. while the GPU can only do the same operation on more data.
    So If the purpose is to apply an operation to a lot of data, that comes down to the same, but a more complex computation series on a single 3D matrix can be done twice as fast on the Cell/B.E.

    *The 8 Synergetic Processor Units of the Cell/B.E. can transfer data between each others memory via a 192GB/s bandwidth bus, while the fastest GPU (GeForce 8800 Ultra) has a bandwidth of 103.7 GB/s and all others fall well below 100GB/s. The high end GPUs have over 300GFlops theoretical throughput, but due to the memory bus speed limitations and cache miss latency, the practical throughput falls far short of that, while the Cell/B.E. has demonstrated benchmark results (e.g. for real-time ray tracing application) far superior to that of the G80 GPU despite the theoretical throughput being lower than the GPU.


    Here's the rest of it, even cost effectiveness is discussed:
    http://www.simbiosys.ca/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/
     
    #59 ps2rocks, Sep 16, 2010
    Last edited by a moderator: Sep 16, 2010
  20. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,723
    Likes Received:
    193
    Location:
    Stateless
    Hum your sum-up is inaccurate on many accounts.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...