POWER8 - IBM goes ballistic on big processors

Discussion in 'PC Industry' started by fellix, Aug 27, 2013.

Tags:
  1. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,525
    Likes Received:
    460
    Location:
    Varna, Bulgaria
    4GHz 12-core Power8 for badass boxes
    :shock:
     
  2. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    508
    Likes Received:
    18
    The introduction of Centaur: L4 cache / memory controller dies is very interesting. This plus stacked RAM should eventually obviate the need for separate commodity dimms and interfaces on motherboards in the near future. The marginal benefit balance of stamping out more identical cores vs having additional cache looks like it's starting to favor cache now even in GPU like products. I know GCN has a rather sophisticated cache hierarchy for instance.
     
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,523
    Likes Received:
    4,590
    Location:
    Well within 3d
    IBM has brought back the north bridge, or rather bridges. It's interesting to see articles about the 40ns link latency. I remember when the removal of that latency was celebrated with IMCs.
    The latency didn't matter as much for the server loads POWER targets, so it's not as strong a reversal as it would be if a desktop chip did the same.

    I find it interesting that people bring up the massive DRAM bandwidth to the Centaur chips, given the much narrower link to the CPU. I would think that is more of a capacity play than a bandwidth one. Why not load up on a massive amount of slower DRAM for less money, so long as you still match the more limited link bandwidth?

    I do recall Nvidia patenting something like this for GPUs, although it never came to market.
    The rationalization was the same: being free from having to redesign the main chip for different memory types.
    The other side of that, however, is an admission that it has been decided that the main chip is going to evolve more slowly than the glacial cadence of memory standards. Nvidia instead went back and reworked its memory controllers after a generation or two of potentially less effective implementations.

    This could be a hedge due to some kind of uncertainty as to what memory is going to be most useful in the next couple of years, or a provision for different markets if licensing takes off.
    The opening of IBM's design platform might mean that it could wind up in places that like one form of memory over the other, which the decoupled memory architecture might simplify.

    The threshold for package memory, hybrid memory cube, DDR4, optical links to DRAM on a different rack, and possibly some evolution of current standards in stacked memory form might all be hitting in an uncomfortably close time window. In theory, a chip like Centaur could become the lower level of a memory stack, pulling along the same lines as HMC.

    Either scenario happens if IBM's silicon cadence can't match the speciation of memory platforms, although the one where the explosion of options is impractical to match is more positive than the one where IBM has slowed below the rate of basic DDR evolution.


    Eh. It's more sophisticated than the previous AMD GPU designs. I'm not sure it measures up to any CPU capable of SMP between two caches. The protocol for the next-gen consoles, for example, seems to be very primitive, ordering almost isn't, and for internal coherence it relies on a shared L2 wth sections statically linked to memory controllers.
     
    #3 3dilettante, Aug 28, 2013
    Last edited by a moderator: Aug 28, 2013
  4. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,175
    Location:
    La-la land
    It has 8-way multithreading and absolutely oodles of cache, so a little extra main RAM latency probably won't affect it much at all, regardless of load. :) It's not as if processors suffered horribly because of RAM latency in the era before integrated MCs anyway...

    In any case, what a monster chip. Absolutely frightening specs on just about every level. Too bad it'll never see the insides of any consumer gear. Makes me sad.
     
  5. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,523
    Likes Received:
    4,590
    Location:
    Well within 3d
    For the desktop, it was the major performance differentiator between K7 and K8, and kept AMD ahead of Northwood until the Prescott misfire made the difference obvious.

    The massively cached EE chips were what Intel resorted to in order to make up for the latency, so there was suffering involved for non-server loads.

    There was also a bandwidth limitation with the FSB that IMCs could get around, mostly. AMD eventually managed to bottleneck itself in its on-die crossbar when memory speeds went higher and the on-die northbridge clocks didn't keep pace.
    POWER8 reintroduces the bottleneck with that rather pedestrian link bandwidth, although it sort of compensates by having multiple links. It seems unfortunate the chip link couldn't be a bit closer to the massive DRAM to Centaur bandwidth.
     
  6. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    508
    Likes Received:
    18
    Since we were looking at something like 100 ns of latency for Sandybridge to DDR3, 40 ns doesn't sound too bad provided there isn't too much more added between Centaur and the actual ram. I'd like to see a graph like this for the Power 8 cache hierarchy:

    http://images.anandtech.com/doci/7003/latency_575px.png

    http://images.anandtech.com/doci/6993/bandwidth_575px.png

    Intel seems to have kept latency nicely in control with its 4 tiered cache with clear benefits in latency within the range of L4 cache.

    Yes, GCN doesn't have a cache hierarchy that's anywhere as sophisticated as a cpu, I just meant that AMD saw it fit to devote a lot of die space to cache and cache related logic with that gen.
     
  7. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    On a similar token, Core 2 did okay without it and while Nehalem was faster it's hard to say if the IMC contributed a bulk of that. Core 2's latency was also far lower than Prescott's (and IIRC far lower than K7's as well), so it at least shows what can be accomplished without an IMC.

    Something I'm wondering about these Centuar chips is if the really can come with stacked DRAM could that have any positive benefit for latency to help negate the cost of moving the controller off chip?
     
  8. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,175
    Location:
    La-la land
    I think you speak too lightly of bottlenecks; you don't know all the details and what secret sauce is in this chip.

    Also, K7 is a decade away by now and completely outdated by this chip or any other current offering; ten years is half an eternity in computing terms, technology has progressed enormously since then, and all of the issues off-die MCs suffered back then won't necessarily be true for power8. For example, memory traffic traversed the CPU I/O bus in pre-K8 and pre-nehalem; not true for power8.
     
  9. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,523
    Likes Received:
    4,590
    Location:
    Well within 3d
    Sandy Bridge has a memory latency of less than 50ns.

    The DRAM to Centaur latency should be in the same neighborhood as what you get with an IMC, minus the full cache hierarchy.
    POWER8 would then add the 40ns link, and then the latency of the on-die memory hierarchy.

    It's difficult to say since the die shots show automated layout that makes the cache arrays less obvious, but the GPU devotes the vast majority of its area to non-cache things.

    Going on-die brought tens of nanoseconds in latency savings, if you correct for prefetching.
    It's particularly relevant for the K7 to K8 transition because the cores are so similar.

    http://http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.140.8981&rep=rep1&type=pdf
    When comparing K8 to Conroe, it was over 20ns saved.

    A similar latency drop was detectable between Nehalem and Conroe, although performance comparison is murky since the architectures are less similar than K7 to K8.
    Nehalem is what killed Opteron's server niche, since along with whatever latency benefits it got, it brought more bandwidth scalability, which was the last refuge for Opteron in HPC and the like.


    The DRAM bandwidth to Centaur is on an absolute level massively higher than the link bandwidth.
    On either side of the link, there are higher-bandwidth subsystems, so it is some form of bottleneck.
     
  10. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    508
    Likes Received:
    18
    Centaur does adds a sizable L4 which hides some of that latency, and yes for ideal latency we'd have that silicon on die but perhaps they are planning on using centaur as some kind of flexible substrate for stacked DRAM as you mentioned. So if they could improve latencies from Centaur to DRAM through the use of stacked dies, they might have a nice net performance boost in the future and separate that complexity and heat from the main die.

    After looking into it a little more, Intel has implemented what also looks like an off die solution with its L4 on the i7-4950 which according to Anand has a latency of 30-32 ns (so IBM's latency to L4 is slightly worse). The i7-4950 just looks like a convenient block of eDRAM bolted onto haswell, and Intel's L4 is a victim cache for L3; it looks like the memory controllers are still attached to the i7-4950's L3, whereas going through Centaur and adding 40 ns seems to be Power8's only path to ram. So you're right, they're definitely adding that 40 ns to DRAM access whereas Intel isn't. Again, they might gain all of that back and then some once they go stacked DRAM, and Intel might move its memory controller onto some kind of L4 + memory controller-like interface as well for the inevitable transition to stacked ram to alleviate heat and complexity issues.

    EDIT: I do agree that ideally all of this should just be on a single die. However, there seem to be restrictions on the density and type of memory that can be etched onto a die for now and connecting / creating TSV's sounds like a new and delicate manufacturing process which risks an otherwise good die. Something like Centaur seems to be a good compromise for all these things.
     
    #10 Raqia, Aug 28, 2013
    Last edited by a moderator: Aug 28, 2013
  11. Wynix

    Veteran Regular

    Joined:
    Feb 23, 2013
    Messages:
    1,052
    Likes Received:
    57
    That is one beastly processor.

    I don't know much about the server market, can someone simplify how competitive this will be against Intels offerings?
     
  12. grndzro

    Newcomer

    Joined:
    Jul 11, 2013
    Messages:
    45
    Likes Received:
    0
    If the price is somewhat competitive I think demolish would be an apt term.
     
  13. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,175
    Location:
    La-la land
    Doesn't mean the processor actually bottlenecks on the link, that's what I'm trying to point out. It's meaningless talking about bottlenecks if there's a "bottleneck" that is never actually maxed out. We don't have the information necessary to judge if there's a real bottleneck, and I think the engineers at IBM are well aware of the various capacities of the links in the CPU they've designed, and have balanced accordingly... They're not a bunch of idiots, you can be sure of that. ;)
     
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,523
    Likes Received:
    4,590
    Location:
    Well within 3d
    Designers have to give a nod at some point to physical reality when it comes to implementing the system. The bandwidth appears to be massively improved, but at the clocks and thread counts in question, there are non-theoretical workloads that benefit greatly from the bandwidth while being all too ready to consume more. It's not denigrating their work to point out that there's a link in the chain that isn't quite at the same level as the others.

    Read bandwidth at the DRAM level is almost four times that of the link, although part of the problem I'm facing is that I can't get the quoted 9.6 GB/s for the link to add up correctly to IBM's bandwidth total, with or without IO added in or with any combination of bidirectional or partially bidirectional bandwidth.
    The L3 side of the link is even higher bandwidth, so if it were possible cut the buffer chip out, the CPU could readily soak up the bandwidth.

    Going to DDR4 with that link would make the mismatch more noticeable.
     
  15. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,235
    Likes Received:
    2,841
    Location:
    Germany
    Is the complete PDF already available? At the hotchips' site, the archive's still password protected.

    In the registers' article, there are a couple of seemingly conflicting numbers:

    • „At a 4GHz clock speed, you can move data into L3 cache from the external L4 cache at 128GB/sec“
    -> divided by eight memory channels, that'd make 16 GB/s of read bandwidth for each centaur
    • „That memory link between the Power8 package and the Centaur memory buffer chip has a 40-nanosecond latency and 9.6GB/sec of bandwidth“
    • „That socket would have eight memory channels, for a total of 230GB/sec of sustained bandwidth into and out of the processor …“
    -> divided by eight: 28.75 GB/s
    • „… and the 32 DDR memory ports hanging off one twelve-core chip would have 410GB/sec of peakbandwidth at the DRAM level.“

    Leaves me quite confused.
     
  16. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,605
    Likes Received:
    1,021
    Core 2 introduced speculative load/store reordering, that meant more memory ops in flight for lower apparent latency. This was the primary reason for the large performance jump from Core -> Core 2. It also featured lots of smart prefetchers, which K8 didn't.

    IMO, the performance jump from C2 to Nehalem I7s was largely from the lower latency memory subsystem. Nehalem also featured slightly higher bandwidth and a bigger OOOE window (128 entries). However in SMT mode each thread only has half the ROB and half the store buffers to its disposal and it still beats Core 2 silly.

    I went from a 2.2GHz Athlon XP to a 2.2GHz Athlon 64 back in 2004 and more or less doubled performance in everything from games to large software builds. Other than the 64 datapaths and the integrated memory controller, XPs and 64s were almost identical micro-architecture wise (same I&D caches, same number of memory ops per cycle, same decode width, same ROB size etc).

    The P8 seems to focus on *large* memory configurations with this as well as throughput (8x smt) over all out single thread performance.

    Cheers
     
  17. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    Not really, if you look at all the current high end server CPUs, they all use BoB (buffer on board) technology to connect to DRAM: Intel Xeon E7s, Oracle Sparc processors, and even the existing Power7/7+.

    It is both a bandwidth and capacity play. The high speed links provide significantly more bandwidth per pin than the actual DRAM interfaces.


    The chip link doesn't need to match the DRAM bandwidth. It is quite rare that DRAM reaches its peak bandwidth, even 50% of peak isn't achievable in many workloads that are quite well behaved. When you look at workloads like DBs it is even lower. The main reasons they have so many channels per BoB is to increase concurrency (which lowers effective latency with highly MT workloads) and to enable increased memory capacity.

    While you see it as a bottleneck, it is unlikely to be an actual bottleneck in any actual workloads or even uK benchmarks like stream.

    The quoted 9.6 GB/s is going to be CPU->BoB bandwidth. The BoB->CPU bandwidth is going to be ~19.2 GB/s. AKA, the interface is likely 20b from BoB to CPU and 10b from CPU to BoB running at around 10 GT/s.
     
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,523
    Likes Received:
    4,590
    Location:
    Well within 3d
    That clears up my understanding of the article's numbers. The bandwidth between the Centaur chip and the CPU is much closer now that I'm clear that the number isn't the total.
     
    #18 3dilettante, Aug 30, 2013
    Last edited by a moderator: Aug 30, 2013
  19. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    Yep.

    The two new things with IBM's BoB is shifting of the actual memory scheduling logic from the CPU to the BoB and the addition of the large cache on the BoB. The front end coherence/consistency portion of the memory controller is still likely on die as there is no advantage to shifting it behind the cache and it would likely result in worse performance.
     
  20. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,605
    Likes Received:
    1,021
    How is the BoB cache supposed to work ? As a per memory channel victim cache ? Or as a massive write coalescing buffer? (or both ?)

    Cheers
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...