Does the Cell processor have a chance?

Discussion in 'CellPerformance@B3D' started by stof, Nov 25, 2007.

  1. stof

    Newcomer

    Joined:
    Nov 9, 2006
    Messages:
    3
    Likes Received:
    0
    I occaisionally hear hype about the Cell processor, but I wonder if it really has a chance. It seems to have fatal flaws.

    The upside of the Cell is that it has 200 GFLOPS peak performance per chip. This performance number comes from each SPU running at 3.2 GHz, able to perform 4 multiplies & 4 adds simultaneously, which is 25 GFLOPS per SPU, times the 8 SPUs on the chip.

    I wonder if you can really get to 50% of peak performance.

    A single modern Intel core running at 3 GHz can do 4 multiplies or adds at the same time, which is 12 GFLOPS. It's not that hard to get to peak performance of an x86-64 [COLOR=#000080! important][COLOR=#000080! important]CPU[/COLOR][/COLOR]. That's 100 GFLOPs too.
    And you can put 8 of them in a cheap box.


    A major problem with the Cell is that it uses expensive XDR memory and you can only put 2 Gbytes on a node. That is very limiting. A Cell blade is very expensive, ~$10,000. And, Cell isn't improving as fast as Intel/AMD is.

    So, the Cell doesn't look that great with price/performance, it has limited memory, it has little software infrastructure, and uncertainty with its future.

    Does it have a chance?
     
  2. Vitaly Vidmirov

    Newcomer

    Joined:
    Jul 9, 2007
    Messages:
    108
    Likes Received:
    10
    Location:
    Russia
    It is possible to get 99% of peak performance on certain tasks like matrix multiply.

    So what did you expect? CELL in place of x86?
    x86 is not the most popular processor in the world, anyway.
     
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,295
    Likes Received:
    3,622
    Location:
    Well within 3d

    This is highly dependent on workload, and for many problem types and system sizes, 50% utilization would be something any architecture would kill for.

    It's also not too hard to make x86 run below peak. Even on Linpack, the going rate is something like 80% peak, and Linpack is a standard benchmark everyone targets.
    Since memory latency and bandwidth has become so important, the greater control Cell has for memory access is in many areas far superior to the current broadcast coherency schemes of x86 chips.

    It should also be noted that the x86 system that can hit 100 GFLOPS does so with two chips with TDPs of 120W.
    That's several times the TDP of one Cell.
    Power concerns are going to be dominant from now on, as it now constrains clock speeds, system footprint, and operating costs for a system.

    A valid point, which is why the HPC variant of Cell uses DDR2.
    I'll cover the volume and price considerations at the end of this.

    Does it have a chance in what field?

    The desktop? Basically none.
    Future game consoles? Maybe one of them.
    HPC? Probably the best chance it has for creating a niche, much in the way Blue Gene's processors have their own small space.
    Other fields? Maybe something here or there, but the support isn't all that enthusiastic.

    The primary reasons for doubt is that Cell so far has not realized the volume that commodity x86 has attained.
    Given market trends and costs, this may prove telling.
    The more likely outcome is that future x86 chips are going to copy most of what makes Cell perform so well, leaving Cell with little to offer.
     
  4. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,614
    Likes Received:
    767
    Location:
    Guess...
    I thought Core2 could perform 4 dual precision but 8 single precision operations per cycle (per core that is)?
     
  5. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,295
    Likes Received:
    3,622
    Location:
    Well within 3d
    I was going by the DP throughput of a two-socket Yorkfield system which is roughly 100 GFLOPS, while the HPC Cell with enhanced DP throughput also tops out at ~100 DP GFLOPS.

    edit:
    Cell would also have double the SP throughput over DP for the HPC version.
     
  6. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,614
    Likes Received:
    767
    Location:
    Guess...
    Ah cool. Just wanted to make sure I wasn't mistaken. So the HPC Cell pretty much doubles Yorkfields peak throughput in either SP or DP.

    I wonder if we'll see a new, beefier Cell before Nehalem arrives. I expect so but it would be strange to see a single socket x86 matching or exceeding Cell in peak floating point.
     
  7. stof

    Newcomer

    Joined:
    Nov 9, 2006
    Messages:
    3
    Likes Received:
    0
    Where can I find more information on Cell boards with DDR2? The IBM web site doesn't have any.

    The Core2 can do only 4 single precision operations per cycle and 2 double precision. It can't do simultaneous multiply & add, like the SPE on the Cell. But, it's hard to keep simultaneous multiply & adds busy.
     
  8. Carl B

    Carl B Friends call me xbd
    Moderator Legend

    Joined:
    Feb 20, 2005
    Messages:
    6,266
    Likes Received:
    63
    What are you going to buy some?

    Anyway this is the thread that would probably be your best introduction to the DDR2/HPC Cell: http://forum.beyond3d.com/showthread.php?t=40661

    I'll mention also that it's this version of Cell that's going to go into Roadrunner. It's not available to the 'general' public right now, but as time goes on I'm sure you'll see it pop up. As to the original point of the thread, frankly I think Cell has done very well for itself considering it's a new architecture.
     
  9. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,614
    Likes Received:
    767
    Location:
    Guess...
    According to this Core2 is capable of 8 SP operations per cycle:

    http://www.behardware.com/articles/623-5/intel-core-2-duo-test.html

    "Core uses two floating point calculation units, one dedicated to addition and the other to multiplication and division. Theoretical calculation capacity is 2 x87 instructions per cycle and 2 SSE 128 bit floating point instructions per cycle (that is 8 operations on 32 bit simple precision floating points, or 4 operations for double precision 64 bit floating points). Core is, in theory, two times faster for this type of instruction than Mobile, Netburst and K8."

    That would result in a theoretical peak of 96 GFLOPs for the fastest single socket CPU.
     
  10. stof

    Newcomer

    Joined:
    Nov 9, 2006
    Messages:
    3
    Likes Received:
    0
    I am a HPC software developer. My software is used on about $100 million of hardware. It's pretty important for new hardware to recruit HPC software developers.

    I need to be careful about what I invest my time in. With the high cost of Cell boards, the limited memory, and the limited install base, I don't have confidence Cell will become mainstream for commercial HPC ($500K-$10 million clusters). I agree with the above comments that
    The x86 chips will probably do it at lower price and better software infrastructure.

    And while I don't want to diverge this discussion on Intel hardware, the above Intel information is misleading. Yes, the Intel chips can work on a SIMD multiply and add at the same time, but they take more than a clock cycle. You can submit a SSE multiply but it takes 5 cycles to complete. 1 clock cycle after the submit, you can submit another SSE instruction, such as an SSE add, and they will work at the same time, but you won't get 8 flop throughput per cycle. You can only submit one SSE instruction at a time.
     
  11. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,614
    Likes Received:
    60
    stof, what kind of HPC software ? Is it media related ? or scientific computing ?
     
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,295
    Likes Received:
    3,622
    Location:
    Well within 3d
    I read that the FP mulitplier has a throughput of 1 per cycle and a latency of 4. Only 80-bit FP multiply has a throughput of less than 1 per cycle.

    Core2 also has SSE units on 3 issue ports, 1 port for FADD, 1 port for FMUL, and 1 port for other ops.

    The peak number would seem to hold unless you can't find any non-dependent multiplies.
     
  13. Nite_Hawk

    Veteran

    Joined:
    Feb 11, 2002
    Messages:
    1,202
    Likes Received:
    35
    Location:
    Minneapolis, MN
    Hi Stof,

    I'm a developer at the Minnesota Supercomputing Institute. Similar feelings about Cell. I really wish they would make development hardware cheaper to attract more attention. $10k isn't that much in the grand scheme of things, but it's not exactly throw away money either. Cell is a popular topic around here (MSI) mostly because it's neat and exotic. There are few people here that are actually doing any real work on them.

    Nite_Hawk
     
  14. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    42,283
    Likes Received:
    13,864
    Location:
    Under my bridge
    For the sake of experimentation, isn't PS3 a suitable introduction to try things out and gauge performance? IBM's libraries support distributed processing over networked PS3's, right? So you could get 2 or 3 and try out some algorithms and see how well you think it manages for a grand or so. Less if you know a few PS3 owning mates who wouldn't mind lending you their PS3's to run a bit of Linux code on!
     
  15. Mmmkay

    Regular

    Joined:
    Jul 3, 2005
    Messages:
    627
    Likes Received:
    31
    There's little interest at RAL, given its commodity focused HPC efforts. DP performance of the PS3 is just not worth it, and the eventual HPC Cell products will be out of reach. And that's forgoing the problems how RAL operates in terms of library and application support. In fact it's probably the latter which has more influence. Neat and exotic just isn't in the language.
     
  16. Arwin

    Arwin Now Officially a Top 10 Poster
    Moderator Legend

    Joined:
    May 17, 2006
    Messages:
    17,856
    Likes Received:
    1,400
    Location:
    Maastricht, The Netherlands
    As Shifty said, precisely what is making the Cell a popular chip in this area is the possibility to just buy that 399 PS3, install Linux on it and get going with the SDKs and excellent documentation. And you can even see examples out there already from people stacking several PS3s too.
     
  17. seebs

    Newcomer

    Joined:
    Nov 29, 2007
    Messages:
    44
    Likes Received:
    0
    Location:
    Minnesota
    It's a good testbed, I think. I did a bunch of stuff on cell simulators early on, and the PS3's faster, even if it's not quite the same.

    I was going to get one of the actual dev systems, but I never got so much as a call back when I tried to contact the nice folks at Mercury. Apparently, they're WAY too busy with important things to even bother to tell me that they don't want my business. :p
     
  18. Arwin

    Arwin Now Officially a Top 10 Poster
    Moderator Legend

    Joined:
    May 17, 2006
    Messages:
    17,856
    Likes Received:
    1,400
    Location:
    Maastricht, The Netherlands
    That's a shame. On the other hand, I guess that also partly answers the thread title. :D
     
  19. seebs

    Newcomer

    Joined:
    Nov 29, 2007
    Messages:
    44
    Likes Received:
    0
    Location:
    Minnesota
    Well, to be fair, I'm just some guy. I wasn't even affiliated with a company -- I just wanted a cell blade system because I do a lot of technical writing, and I could have taken it as a deductible expense, and PROBABLY paid for it with work eventually.

    But I'm just one guy, there's no company involved, so I assume they just figured there wasn't enough business there to justify the effort. It's not as though, if I wrote a lot of articles about it, I'd come back and buy fifty or a hundred more.
     
  20. Arwin

    Arwin Now Officially a Top 10 Poster
    Moderator Legend

    Joined:
    May 17, 2006
    Messages:
    17,856
    Likes Received:
    1,400
    Location:
    Maastricht, The Netherlands
    Probably not, but if they were genuinely bored (i.e. not be at 100%+ work capacity), my guess would have been that they'd have gladly sold you one, precisely because you do write articles about it. That's just speculation on my part though.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...