ISSCC 2005

Discussion in 'Console Technology' started by ChryZ, Jan 20, 2005.

  1. Jov

    Jov
    Regular

    Joined:
    Dec 16, 2002
    Messages:
    506
    Likes Received:
    3
    Is it possible to integrate one PE + nVidia GPU on a same silicon in the future (I guess anything is possible, but how practical is another question)?

    If Sony doesn't plan to have a PE+GPU, then they'll probably want to utilise the transistor count on 65nm, no? Thus it’s a good possibility to have more than one PE.
     
  2. AutomatedMech

    Newcomer

    Joined:
    Feb 7, 2005
    Messages:
    10
    Likes Received:
    0
    Kutaragi Ken's numbers.

    Frequency : 4 Ghz logical.
    FLOPS : 256 GFLOPS with compressed floats.

    My REAL numbers.

    Frequency : 1 Ghz physical
    FLOPS : 64 GFLOPS with uncompressed normal floats.

    The only thing revolutionary about CELL is the use of compressed floats to boost FLOPS performance figure. This, I gotta give it to Kutaragi Ken.

    Too bad this trick does not work on scientific and physics computing. Run LINPACK on CELL and you are lucky to get 32 GFLOPS sustained.[/url]
     
  3. ninelven

    Veteran

    Joined:
    Dec 27, 2002
    Messages:
    1,742
    Likes Received:
    152
  4. Psikotiko

    Newcomer

    Joined:
    Nov 18, 2002
    Messages:
    192
    Likes Received:
    0
    Location:
    Lisbon, Portugal
    Ars Article about the Cell (Part 1)

    http://arstechnica.com/articles/paedia/cpu/cell-1.ars
     
  5. SiBoy

    Newcomer

    Joined:
    Apr 26, 2004
    Messages:
    117
    Likes Received:
    1
    Re: hahahahaha

    Sorry, you're completely and utterly wrong. The ALU runs at 4GHz+. The presentations went into the number of FO4 levels per pipe stage, etc.

    Regarding power consumption, there was the peculiar comment that the power given was JUST the dynamic power consumption. One entry in the schmoo plot was 2W, but this didn't include 1.3W of leakage and 1.7W of clock power (5W total). I might have the leakage vs. clock backwards, but those were the raw numbers.
     
  6. one

    one Unruly Member
    Veteran

    Joined:
    Jul 26, 2004
    Messages:
    4,838
    Likes Received:
    167
    Location:
    Minato-ku, Tokyo
    From left to right, Yoshio Masubuchi, Director of Engineering, Toshiba America Electronic Components, Jim Kahle, IBM Fellow, and Masakazu Suzuoki, Vice President of Microprocessor Development Department, SCEI, 0WN j00 :p
    [​IMG]
     
  7. SiBoy

    Newcomer

    Joined:
    Apr 26, 2004
    Messages:
    117
    Likes Received:
    1
    I didn't get this impression. The local store was single ported only, and stores all instructions+data. I think the 4 banks was a physical partitioning only.

    There are 3 "customers" for the local store, each arbitrating for access on a per cycle basis:

    1) DMA transfer - highest priority, 128 bytes per access
    2) data load/store - 2nd priority
    3) instruction fetch - lowest priority, 128 bytes per access

    They claimed they could get 80-90% utilization of the local store's single port interface.
     
  8. SiBoy

    Newcomer

    Joined:
    Apr 26, 2004
    Messages:
    117
    Likes Received:
    1
    Here are my other notes from the SPU talk (the Cell talk is tomorrow). Sorry they are a little disorganized, they are scribbled on an envelope :)

    One of the goals of the SPU was obviously simplicity. The local store is not a cache, so there are no misses, no tags, no backing store. Likewise there are no complex instructions (I guess your definition of complex is relative). But no divide - multiply-add and permute seemed the most complex of the instructions. The philosophy was that every time something complex came up, they asked themselves if it was better off adding it, or keeping the SPU simple and packing more SPU's on a chip.

    The DMA was presented as a big deal. They support scatter/gather, etc. DMA can be overlapped with computation by using S/W multithreading on a single SPU (run one compute thread while another is waiting for a DMA, etc.). DMA accesses are up to 16 kilobytes each.

    Some definition clarifications. SPE referred to the combination of the SPU and it's DMA unit.

    Most SPU instructions are 3 128-bit input operands.

    A single 128x128-bit register file is shared for fixed and floating point values.

    GFLOPS rating followed the simple math - 4-way SIMD of multiply-add operation. 2 ops (multiply+add) x 4way SIMD x 4GHz = 32 GFLOPS per SPU. 8 SPU's per BE (yes, the Cell was explicitly referred to as the broadband engine) is 256 GFLOPS total.

    Branch mis-predicts are 18 cycles, so this has to be carefully managed in S/W. Mux instruction is used to avoid branches (compute both sides of an if-then and select the result instead of branching around one).

    Load/store unit has 6 cycle latency for accesses to the local store.

    Presented as a middleground between a CPU and a GPU.

    Interesting enough, all the power numbers being quoted were for the example of a single-precision transformation+lighting benchmark. They claimed achieving 1.4 IPC for this. The loop was unrolled 4 times to hide the 6 cycle latency.

    The SPU is dual issue, but it is completly in-order. There is no register renaming or reordering of anything.

    Circuits are about 20% dynamic logic, 80% static logic.

    Another interesting factoid, the interconnect between SPU's is set up as a ring, so adjacent SPU's can pass data between their 256KB local stores. In this way the SPU's can be set up as a simple pipeline.

    That's it for now, I'll take more notes in tomorrow's BE presentation :)
     
  9. Inane_Dork

    Inane_Dork Rebmem Roines
    Veteran

    Joined:
    Sep 14, 2004
    Messages:
    1,987
    Likes Received:
    46
    Thanks so much. I appreciate it. :)
     
  10. Megadrive1988

    Veteran

    Joined:
    May 30, 2002
    Messages:
    4,723
    Likes Received:
    242
    http://arstechnica.com/articles/paedia/cpu/cell-1.ars/2

    so each Synergistic Processing Element (formerly known as the APU) has 21 million transistors. that's more than the entire Emotion Engine which was 13 ~ 13.5 million transistors.

    it's nice that they doubled the Local Storage from 128K for the APU in the patent to 256K in the actual implementation, now called the SPE.


    Now be bold Sony and give us 32 to 64 SPEs in Playstation3. I don't care how they're divided up. 8)
     
  11. Acert93

    Acert93 Artist formerly known as Acert93
    Legend

    Joined:
    Dec 9, 2004
    Messages:
    7,782
    Likes Received:
    162
    Location:
    Seattle
    Thanks SiBoy. That clears up a lot. Good info, and also a great tid-bit on the BE.

    I know some of you are dissappointed, but holy cow! 256GFLOPs?! If you consider what Sony did with the PS2 and argueably the least powerful console, think of what they will do with possibly the most powerful HW.
     
  12. randycat99

    Veteran

    Joined:
    Jul 24, 2002
    Messages:
    1,772
    Likes Received:
    12
    Location:
    turn around...
    I think the only persons here that are "disappointed" are the ones that live to pick apart every bone in a Sony product, anyway. The rest of us are quite hopeful of what "256 GFLOPs" will bring in future PS3 games, for sure. To say this has been an ambitious project does not really do it justice. I think it is safe to say that those here who are genuinely intrigued with the project are happy to see it is progressing along and reaching its milestones pretty much on schedule.
     
  13. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Oh god, fugly ... so the software can know 18 cycles ahead of time which way a branch will go, but will not have a way to tell the hardware that? Personally I would even prefer delay slots over this (split branches are even better, dont really expose the pipeline and they save you the headache of pro/epilogue code, but I guess they might be patented). I still prefer a disposeable ISA over inefficient hardware.
     
  14. Bohdy

    Regular

    Joined:
    Jun 9, 2003
    Messages:
    731
    Likes Received:
    4
    Hang on, since when are multiply-adds counted as 2 ops? Afaik current processors specs treat a fused multiply-add as one op.
     
  15. one

    one Unruly Member
    Veteran

    Joined:
    Jul 26, 2004
    Messages:
    4,838
    Likes Received:
    167
    Location:
    Minato-ku, Tokyo
    It seems so.
    A possible configuration in the PS3:
    128MB XDR-DRAM @ 25.6 GB/s - CPU - (76.8GB/s FlexIO) - GPU - 128MB XDR-DRAM @ 25.6 GB/s
     
  16. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,360
    Likes Received:
    1,377
    It has been par for the course in supercomputing bragging rights for literally decades, and the practise has carried over into other fields of computing as well.
    It makes some sense - if one architecture is capable of single cycle mutiply-add, and another is not, should not the one capable of performing two floating point operations in one cycle be credited for that? And it is two floating point ops fused into one computer instruction.
    Again, it's standard practise when counting max FLOPS.
     
  17. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    I read the "manage in software" as "we'll leave it to the programmer to sort it out". Ie. the programmer (or compiler, but let's be real :)) has to ensure that the predicate used in the branch is calculated 18 cycles ahead (36 instructions, *ugh*), if the predicate is not calculated the hint bit will decide which way the branch is speculated.

    Cheers
    Gubbi
     
  18. PZ

    PZ
    Newcomer

    Joined:
    Nov 29, 2004
    Messages:
    16
    Likes Received:
    0
    Yeah, I am concerned that this chip will require a super compiler, OS, and scheduler to work properly without overwhelming the programmer. It seems as though they took a lot of the complexity out of the hardware in order to get speed and shifted the complexity to the OS and compiler and ultimately back onto the general purpose PPC core (is that good?). I wonder how much these tools will do as versus the programmer having to manage the cache, set up pipelines, control local stores, etc. and how much real world performance will be lost without the dedicated cache/memory management.
     
  19. phat

    Regular

    Joined:
    Feb 13, 2002
    Messages:
    496
    Likes Received:
    3
    Location:
    Waterloo, ON Canada
    I think actually the convention is an FMAC counts as 2 FP ops. The FMAC's I've encountered can all be used to either just add (multiplier has selectable hard value 1), just multiply (accumulator has selectable assignment instead of summation), or add and multiply, so, conceptually, each FMAC can be viewed as two independently selectable operations.
     
  20. PZ

    PZ
    Newcomer

    Joined:
    Nov 29, 2004
    Messages:
    16
    Likes Received:
    0
    Hmmm... did PS3 programming just start to suck :D
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...