The SPE as general purpose processor

Discussion in 'Console Technology' started by Frank, Mar 7, 2006.

  1. DarkRage

    Newcomer

    Joined:
    Jul 25, 2005
    Messages:
    70
    Likes Received:
    1
    Location:
    Spain
    I would also recommend to everybody to read the 5 documents (well, at least the first 3) about compilation in Cell, that you can find in http://www.research.ibm.com/cell/

    Clearly, together with ISA analysis, SPEs are designed for single precision fp vector calculation. Any other usage, like integers, is a workaround of the floating point vector capabilities. You can find something like a = b[i+1]+c[i+2] can be extremely dangerous for performance (by the way, it is something I think it was already discussed by DeanoC and aaaaa00 or ERP long time ago).

    There are no magic tricks in Cell, it is the same technology we can find in other top-of-the-line products. Just it has been tuned for a very specific usage, sacrificing other areas. Integer operations are significantly slower in SPEs than in any other similar core (Xenon, P4 or AMD64). However, fp vector calculation is very strong. It is a very interesting pay-off.

    SPEs can execute almost everything. But nobody has said it can be done efficiently. Even more, IBM is saying it can not be in many cases.
     
  2. Nemo80

    Banned

    Joined:
    Sep 5, 2005
    Messages:
    128
    Likes Received:
    3


    Erm, the SPEs have the same amount of ALUs as FPUs...
     
  3. add n to (x)

    Newcomer

    Joined:
    Dec 5, 2004
    Messages:
    13
    Likes Received:
    1
    Location:
    London, United Kingdom
    Erm, did you even read the documentation? Try reading section 5 of the SPU Instruction Set Architecture Manual and tell me that integer operations are a "workaround".


    You can't get much faster than single cycle throughput of an instruction. Unless one of those other processors has a >128-bit integer datapath that I don't know about, the SPEs are just as fast for the majority of integer operations.
     
  4. Edge

    Regular

    Joined:
    Apr 26, 2002
    Messages:
    613
    Likes Received:
    10
    Not only that, but you have seven of them at 3.2 GHz each. Anyone who thinks the SPE's are not integer monsters along with the generally accepted "floating point monsters", has not read the documentation on CELL SPE's like you said. You're looking at a *maximum* 22.4 billion integer instructions per second throughput!!!!!
     
  5. Frank

    Frank Certified not a majority
    Veteran

    Joined:
    Sep 21, 2003
    Messages:
    3,187
    Likes Received:
    59
    Location:
    Sittard, the Netherlands
    Linux runs on a lot of processors that don't support pre-emptive multitasking in hardware. It helps when the processor does, but it's not a requirement. And neither are the other things mentioned. And I think it fits the bill of a current, full-fledged OS pretty well.

    So, while the PPE might run a very small micro-kernel of at most a few kB in size to handle dispatching, interrupting and page switching, everything else (32+ MB) can run on an SPE. And while that SPE wouldn't run the whole OS, most other platforms use dedicated hardware and/or other small processors for dedicated things as well.

    Like, the processor in your keyboard. Would we want the main CPU to spend time handling that as well?
     
  6. danteye

    Newcomer

    Joined:
    May 25, 2005
    Messages:
    33
    Likes Received:
    0
    i would like to know how many integer operation can xenon do instead...does anyone know that?
     
  7. Frank

    Frank Certified not a majority
    Veteran

    Joined:
    Sep 21, 2003
    Messages:
    3,187
    Likes Received:
    59
    Location:
    Sittard, the Netherlands
    Btw, there are very many computations even such dedicated hardware as an GPU has to do that are totally integer. Indexes, for a start. You cannot build ANY kind of processor that cannot handle those.
     
  8. Robert.L

    Newcomer

    Joined:
    Feb 25, 2006
    Messages:
    84
    Likes Received:
    0
    Well xenon has 2 integer/fixed units if I’m not mistaken and AltiVec can also do integers if I remember correctly and since not much is known about xecpu’s Altivec it’s hard to say how many GOPS xecpu does …but there where some slides of some presentation on 360 architecture that rated each core of xecpu at 6400 MIPS .
     
  9. Edge

    Regular

    Joined:
    Apr 26, 2002
    Messages:
    613
    Likes Received:
    10
    If we assume an integer instruction per cycle than 9.6 billion integer instructions per second. I think the cores on the 360 GPU being more complex (dual integer units per core), probably averages more like 1.2 to 1.4 integer instructions per cycle, giving 11.5 to 13.44 billion integer instructions per second.

    The two rates I just gave a quite meaningless for figuring out overall throughput, but gives an ideal to localized performance.

    My original point was to simply point out the SPE's are no weaklings to integer performance, and with good programming can outperform the 360 CPU in that area, especially considering the PPE on CELL adds an extra 3.2 to to 4.48 (if 1.4 instruction per cycle max considered) billion instructions per second bringing the previous 22 billion total to 25 to 27 billion total integer instructions per second.

    Hopefully this dispels the myth that CELL is not good at integer work!
     
    #29 Edge, Mar 10, 2006
    Last edited by a moderator: Mar 10, 2006
    Carl B likes this.
  10. j^aws

    Veteran

    Joined:
    Jun 1, 2004
    Messages:
    1,939
    Likes Received:
    42
    7 SPUs * 1 int instruction per core * 3.2 GHz ~ 22.4 Ginst/sec

    Also add the PPE. I beleive the PPE can't dual issue 2 int instruction per/cycle,

    1 * 1 int instruction per core * 3.2 GHz ~ 3.2 Ginst/sec

    Cell ~ 22.4 + 3.2 ~ 25.6 Ginst/sec (integer)

    XeCPU ~ 3*3.2 ~ 9.6 Ginst/sec (integer)
     
  11. j^aws

    Veteran

    Joined:
    Jun 1, 2004
    Messages:
    1,939
    Likes Received:
    42
    I don't think the PPE can dual issue 2 integer instructions per cycle. The peak would still be 1 int inst/ cycle.
     
  12. aaaaa00

    Regular

    Joined:
    Jul 24, 2002
    Messages:
    790
    Likes Received:
    20
    When people say "integer" they mean "everything that is not a flop".

    The most important integer operations are conditionals, branches, loads, and stores. Not integer math and bitwise operators.

    So to be clear, just counting the number of integer math instructions that an SPE can execute per second doesn't give you a clear picture of how good or bad an SPE is at integer operations.
     
  13. j^aws

    Veteran

    Joined:
    Jun 1, 2004
    Messages:
    1,939
    Likes Received:
    42
    Not always.

    They're all important depending on what you're doing. Being explicit to what your referring to is best course...

    We've had this discussion before and I agree it's confusing. But I think integer and FP are self evident. Just be explicit with what you're referring to...
     
  14. Edge

    Regular

    Joined:
    Apr 26, 2002
    Messages:
    613
    Likes Received:
    10
    Obviously if you are talking about integer instructions, you have to include integer math or bitwise operators. Fine by me to include conditionals, branches, loads and stores, as that actually handled on the secondary pipeline in the SPE. Even if some of those are multi-cycle execution times, you still have the main pipeline available doing work.

    Also note, we are not discussing operations per second, as with integer math with 8-bit values in a 128-bit register will be 16 operations per cycle, or 358 billion operations per second in a seven SPE CELL chip running at 3.2 GHz. I have to check, but forget if the SPE's has an 8-bit parallel operation on those 128-bit registers. Maybe 16-bit instead?
     
    #34 Edge, Mar 10, 2006
    Last edited by a moderator: Mar 10, 2006
  15. danteye

    Newcomer

    Joined:
    May 25, 2005
    Messages:
    33
    Likes Received:
    0

    In fact i have asked this bacause i always read about the strenght of xenon over cell in integer operation, but i did'nt know exactly how many int operation xenon could do!
     
  16. danteye

    Newcomer

    Joined:
    May 25, 2005
    Messages:
    33
    Likes Received:
    0
    But i've also read on IBM.com that every spe can do 4*32 bit operation per cycle, that means 90 GOPS for seven spe at 3.2GHZ. Is it right?

    And i've read that every spe can do 4 single precision floating point operation per cycle that means 12.5 Gigaflops per spe, but this results is in contrast with the 25 gigaflops of ibm documents...

    How can you explain this??
     
  17. Guden Oden

    Guden Oden Senior Member
    Legend

    Joined:
    Dec 20, 2003
    Messages:
    6,201
    Likes Received:
    91
    I think it's important to distinguish between instructions and operations in these kind of discussions, as people sometimes confuse the two, others might mean one thing without saying so explicitly and another might think they mean the other - also without saying so, and so on.

    Each core in the Xenon CPU does two instructions per cycle. Peak! It will likely be (much) less in reality. Either of these instructions can be ONE of: integer op (math, bit manipulation etc), branching/load/store, float math, or VMX math.

    SO, you can't issue two VMX instructions per clock, or TWO float math etc. But you could have one VMX math and a load/store. There might be other restrictions as well that applies, I haven't read any detailed technical docs on this subject, and these infos might be restricted access anyway...

    Now, a VMX instruction might be multiple operations all in one single instruction. Such as multiply 3.141593 with four different 32-bit numbers packed into one 128-bit register or somesuch; this is called SIMD, which stands for single instruction multiple data. So operations count might be higher than two per cycle, but instructions won't ever be higher than 2.

    As for Cell, it was long said the PPE core also had dual-issue capabilities, and Jaws thinks that might not be the case, well, who knows for sure really. :) In any case, each SPE has one instruction pipe that exclusively deals with floats and one that exclusively deals with integers/everything else. I think these co-issue as well, but it might be one at a time only. In any case, the instruction versus operation distinction applies here as well. Cell SPEs have integer SIMD instructions, I don't know if Xenon/VMX does, but it's likely that's the case. After all, good ol' x86 has had it since the late 90s when MMX appeared on the scene with much thunder and little else. ;)
     
    Shifty Geezer likes this.
  18. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,586
    Likes Received:
    981
    aaaaa00 is right.

    Repetition: Back in the day floating point operations were expensive multi cycle operations carried out by off-chip (and expensive) co-processors. Back then it made sense to look at the performance of these bolt-on devices in isolation, hence people talked about floating point performance. The main CPU would do the program control flow, calculating adresses for loading and storing values etc.. - all work done on integer registers. Therefore integer performance was used to describe the performance of the main CPU.

    Back then FP operations were so expensive (both high latency and low throughput) as to render almost all other house-keeping tasks (integer) of a compute heavy program irrelevant.

    So back then it made sense to characterize a system by its floating point and integer performance. Floating point performance being decisive for the overall system performance.

    Fast forwarding from the 80s:

    Then these co-processors got integrated onto the CPU dies... Then they got pipelined.

    The huge increase in throughput (from 1 every 50th cycle to 1 every cycle) and the massive reduction in FP op latency (from 50+ cycles to 3-5) now meant that the remaining "integer" performance of the system started to be more and more important.

    Modern CPUs added short vector support (SIMD) in their FPUs. These do not only handle arithmetic with floating point numbers but also arithmetic with various bit-width integers.

    While the arithmetic part of the CPU has enjoyed a massive increase in throughput, the remaining, the part described as "integer" with the old fashioned nomenclature, has not. Simply because it is a lot harder.

    The main CPU still has to do all the program flow, the calculation of addresses, loading and storing values. Program flow is inherently sequential in nature, load/store is increasingly limited by bandwidth and, more important, latency.

    Today it would be better to characterize a CPU by:
    1. Arithmetic (integer or floating point)
    2. Program control flow (branch resolution, predication etc.)
    3. load/store (bandwidth, latency and number of transactions/cycle)

    If you absolutely insist of describing a modern CPU by "floating point" and "integer" performance you should *not* count integer arithmetic towards integer performance since that would tell you nothing about 2.) and 3.) of a particular CPU.


    Using the above to characterize modern CPUs:
    Xenon (1 core): 1 - very high, 2 - high, 3 - high
    PPE: 1 - very high, 2 - high, 3 - high
    SPE: 1 - very high, 2 - low, 3 - *mixed
    Merom/Conroe: 1 - very high, 2 - very high, 3 - very high
    A64: 1 - high, 2 - very high, 3 - high to very high

    (*mixed*): very high bandwidth, somewhat low latency in local store, high latency out of LS with an archaic (1960s) memory model.

    Cheers
    Gubbi
     
    #38 Gubbi, Mar 10, 2006
    Last edited by a moderator: Mar 10, 2006
    blakjedi and Shifty Geezer like this.
  19. danteye

    Newcomer

    Joined:
    May 25, 2005
    Messages:
    33
    Likes Received:
    0

    ok, i understand!! thank you very much!

    Last question: how could you explain me the fact that on ibm documents there is written that a spe can do 4 flops for cycle that means 13 gflops instead of 25? maybe every spe can do 2 istructions for cycle?
     
  20. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,586
    Likes Received:
    981
    It depends on how you how count. A SPE can do a 4-way fused multiply-add every cycle which is 4 muls and 4 adds, or 8 ops. Some count the mul-add as one op and then you only get 4.

    Cheers
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...