Creating High Performance Radar Applications with the Cell Broadband Engine

Discussion in 'CellPerformance@B3D' started by Carl B, Mar 28, 2007.

  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    I skimmed over and missed the section with the distributed transpose/adjoin.

    For the given application, distributing with the SPEs is a major win. I'd rather they didn't have to spend time working around the PPE, but the overall results are very good.
     
  2. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    40,725
    Likes Received:
    11,196
    Location:
    Under my bridge
    A big question then is what should happen to PPE? Why's it so uninspired and how big and powerful should it go in future Cells? In these tasks, what's holding it up and could the hardware be improved to negate that - ie. Is it a problem with memory access patterns that can't be solved, or is the hardware just plain gimped for some reason?
     
  3. Bigus Dickus

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    943
    Likes Received:
    16
    That was my first thought as well reading the first post or three in the thread. But if the architecture itself is promising, no doubt money can be found to work on hardening.
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    The PPE's a narrow in-order core with a long pipeline, 2-cycle latency for many common int instructions, okay branch prediction, okay amounts of cache that has to contend with the variable memory traffic of up to 8 other cores and possibly arbitrate between them at unpredictable intervals.

    It's multithreaded, but it's so narrow that the simultaneous part of SMT is kind of uncertain. Resource contention is such that the best case is one thread working heavily with integer code while the other works with VMX. However, the SPEs are taking more of the FP/VMX code so that would probably leave mostly less-than-optimal integer code.

    Perhaps there are scenarios where both threads are integer-heavy and likely to have long periods where both are stalled on the in-order core.

    It's hard to say because I haven't seen an in-depth comparison, Cell's newness and more limited exposure means there aren't the very good technical analyses done for x86 cores on the web.

    addendum: In the future, it may be that the SPEs will be made to be more effective at working around the PPE, or the PPE will be beefed up or more heavily threaded.

    Perhaps, though I wonder if it has to be more than that. I know that hardware in military and space applications usually lags in process size. Issues with reliability due to interference and other issues is known to worsen exponentially with process improvements. Current chips have ECC and other error correction measures put in just to keep error rates within the same neighborhood as larger geometry designs.

    If there isn't a redesign, it may mean Cell must be made at a larger process.
    A .18 or .13 micron Cell would be huge, and thermal requirements could keep it clocked way below current speeds.

    This may keep it from entering fields that require very high tolerances.
    Cell's big win is less that it is new or revolutionary, but that it can do all of this on-chip. It simply was infeasable prior to 90nm. If hardening requirements mean the chip must be manufactured at a larger geometry, Cell becomes infeasable again.

    Multi-chip systems with control processors and multiple DSPs have already been done. For highly specialized hardened apps, Cell may be too much and still too general.

    I imagine that there are less stringent areas that it may be more friendly to adopting Cell.
    I also imagine the software and tool sets being demonstrated could just be made to apply to other heterogenous platforms, if Cell proves inappropriate for some applications.
     
    #24 3dilettante, Mar 30, 2007
    Last edited by a moderator: Mar 30, 2007
  5. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    40,725
    Likes Received:
    11,196
    Location:
    Under my bridge
    Do you think it'd be a viable option to lose PPE's VMX abilities and spend the transistors on better branching and int performance? I guess there's an amount of legacy support needed for PPC code, but going forwards Cell specific code would on the whole be avoiding PPE's VMX, no? I've no idea how much room VMX takes up and how much effort is needed to improve the other aspects. Given how many transistors a conventional PPC uses, you're not going to fit something with that performance in PPE's space.
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    I can't find a die shot that shows the outline of the PPE's units.

    I'm not sure the savings would be enough to justify a break in PPC compatibility.
    It's also up to the designers of Cell2 to decide if they want to break backwards compatibility.
     
  7. inefficient

    Veteran

    Joined:
    May 5, 2004
    Messages:
    2,121
    Likes Received:
    53
    Location:
    Tokyo
    You would not necessarily have to break compatibility if the SIMD instructions could get implemented in microcode and get broken down then executed/completed in multiple cycles.

    You would lose performance. But if Cell2 was going to be several times faster anyway, everything could balance out evenly so that at worst Cell2 ran VMX code only as fast as Cell1.

    I think it is unlikely though. The VMX unit cost should be nothing compared to the rumored 40% transistor increase for an OO design.
     
  8. Kryton

    Regular

    Joined:
    Oct 26, 2005
    Messages:
    273
    Likes Received:
    8
    Cell takes Aahmdah's law very seriously along with the whole RISC philosophy. I can't personally see what sort of micro-code they could use because many of the operations are already the basic building blocks (yes, a few could be built out of microcode).

    Anyway, if they put the microcode converter on the front of the instruction cache the only lengthening of the critical path would be in the memory fetch. It's already ridiculously large (in CPU terms), so the translation phase would have almost negligible impact.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...