POWER8 - IBM goes ballistic on big processors

fellix

Veteran
4GHz 12-core Power8 for badass boxes
With the Power8 chip, IBM has a few goals. First, the company is shifting from the 32-nanometer processes used for the relatively recent Power7+ chips to a 22-nanometer process. The shrinking of the transistor gates allows IBM to add more features to a die, cranks the clocks, or do a little of both.

Judging from the Power8, it looks like IBM is content to keep in the same clock speed range as the Power7+ chips - around 4GHz, give or take a little. It'll also move PCI-Express 3 controllers into the chip package to keep those hungry little Power8 cores fed; these controllers will offer a coherent memory protocol to external accelerators as well as a new cache hierarchy that goes all the way out to the L4 cache.

The Power8 chip is implemented in IBM's familiar high-k metal gate processes, which include copper and silicon-on-insulator technologies in a 22-nanometer process. The precise transistor count was not given during the presentation, but the Power8 chip weighs in at 650 square millimetres; this is a bit bigger than Power7+, which used a 32-nanometer process, had 2.1 billion transistors, and a surface area of 567 square millimetres.

The Power8 core has a total of sixteen execution pipes. These include two load store units (LSUs) and a condition register unit (CRU), a branch register unit (BRU), and two instruction fetch units (IFUs). There are two fixed-point units (FXUs), two vector math units (VMXs), a decimal floating unit (DFU), and one cryptographic unit (not labeled in the core diagram above).

Each core now has eight threads implemented using simultaneous multithreading (what IBM calls SMT8), instead of four threads per core with the Power7 and Power7+ chips. And like earlier Power chips, this SMT is dynamically tuneable so a core can have one, two, four, or eight threads fired up.

:oops:
 
The introduction of Centaur: L4 cache / memory controller dies is very interesting. This plus stacked RAM should eventually obviate the need for separate commodity dimms and interfaces on motherboards in the near future. The marginal benefit balance of stamping out more identical cores vs having additional cache looks like it's starting to favor cache now even in GPU like products. I know GCN has a rather sophisticated cache hierarchy for instance.
 
IBM has brought back the north bridge, or rather bridges. It's interesting to see articles about the 40ns link latency. I remember when the removal of that latency was celebrated with IMCs.
The latency didn't matter as much for the server loads POWER targets, so it's not as strong a reversal as it would be if a desktop chip did the same.

I find it interesting that people bring up the massive DRAM bandwidth to the Centaur chips, given the much narrower link to the CPU. I would think that is more of a capacity play than a bandwidth one. Why not load up on a massive amount of slower DRAM for less money, so long as you still match the more limited link bandwidth?

I do recall Nvidia patenting something like this for GPUs, although it never came to market.
The rationalization was the same: being free from having to redesign the main chip for different memory types.
The other side of that, however, is an admission that it has been decided that the main chip is going to evolve more slowly than the glacial cadence of memory standards. Nvidia instead went back and reworked its memory controllers after a generation or two of potentially less effective implementations.

This could be a hedge due to some kind of uncertainty as to what memory is going to be most useful in the next couple of years, or a provision for different markets if licensing takes off.
The opening of IBM's design platform might mean that it could wind up in places that like one form of memory over the other, which the decoupled memory architecture might simplify.

The threshold for package memory, hybrid memory cube, DDR4, optical links to DRAM on a different rack, and possibly some evolution of current standards in stacked memory form might all be hitting in an uncomfortably close time window. In theory, a chip like Centaur could become the lower level of a memory stack, pulling along the same lines as HMC.

Either scenario happens if IBM's silicon cadence can't match the speciation of memory platforms, although the one where the explosion of options is impractical to match is more positive than the one where IBM has slowed below the rate of basic DDR evolution.


I know GCN has a rather sophisticated cache hierarchy for instance.
Eh. It's more sophisticated than the previous AMD GPU designs. I'm not sure it measures up to any CPU capable of SMP between two caches. The protocol for the next-gen consoles, for example, seems to be very primitive, ordering almost isn't, and for internal coherence it relies on a shared L2 wth sections statically linked to memory controllers.
 
Last edited by a moderator:
The latency didn't matter as much for the server loads POWER targets, so it's not as strong a reversal as it would be if a desktop chip did the same.
It has 8-way multithreading and absolutely oodles of cache, so a little extra main RAM latency probably won't affect it much at all, regardless of load. :) It's not as if processors suffered horribly because of RAM latency in the era before integrated MCs anyway...

In any case, what a monster chip. Absolutely frightening specs on just about every level. Too bad it'll never see the insides of any consumer gear. Makes me sad.
 
It has 8-way multithreading and absolutely oodles of cache, so a little extra main RAM latency probably won't affect it much at all, regardless of load. :) It's not as if processors suffered horribly because of RAM latency in the era before integrated MCs anyway...
For the desktop, it was the major performance differentiator between K7 and K8, and kept AMD ahead of Northwood until the Prescott misfire made the difference obvious.

The massively cached EE chips were what Intel resorted to in order to make up for the latency, so there was suffering involved for non-server loads.

There was also a bandwidth limitation with the FSB that IMCs could get around, mostly. AMD eventually managed to bottleneck itself in its on-die crossbar when memory speeds went higher and the on-die northbridge clocks didn't keep pace.
POWER8 reintroduces the bottleneck with that rather pedestrian link bandwidth, although it sort of compensates by having multiple links. It seems unfortunate the chip link couldn't be a bit closer to the massive DRAM to Centaur bandwidth.
 
Since we were looking at something like 100 ns of latency for Sandybridge to DDR3, 40 ns doesn't sound too bad provided there isn't too much more added between Centaur and the actual ram. I'd like to see a graph like this for the Power 8 cache hierarchy:

http://images.anandtech.com/doci/7003/latency_575px.png

http://images.anandtech.com/doci/6993/bandwidth_575px.png

Intel seems to have kept latency nicely in control with its 4 tiered cache with clear benefits in latency within the range of L4 cache.

Yes, GCN doesn't have a cache hierarchy that's anywhere as sophisticated as a cpu, I just meant that AMD saw it fit to devote a lot of die space to cache and cache related logic with that gen.
 
For the desktop, it was the major performance differentiator between K7 and K8, and kept AMD ahead of Northwood until the Prescott misfire made the difference obvious.

On a similar token, Core 2 did okay without it and while Nehalem was faster it's hard to say if the IMC contributed a bulk of that. Core 2's latency was also far lower than Prescott's (and IIRC far lower than K7's as well), so it at least shows what can be accomplished without an IMC.

Something I'm wondering about these Centuar chips is if the really can come with stacked DRAM could that have any positive benefit for latency to help negate the cost of moving the controller off chip?
 
POWER8 reintroduces the bottleneck with that rather pedestrian link bandwidth, although it sort of compensates by having multiple links.
I think you speak too lightly of bottlenecks; you don't know all the details and what secret sauce is in this chip.

Also, K7 is a decade away by now and completely outdated by this chip or any other current offering; ten years is half an eternity in computing terms, technology has progressed enormously since then, and all of the issues off-die MCs suffered back then won't necessarily be true for power8. For example, memory traffic traversed the CPU I/O bus in pre-K8 and pre-nehalem; not true for power8.
 
Since we were looking at something like 100 ns of latency for Sandybridge to DDR3, 40 ns doesn't sound too bad provided there isn't too much more added between Centaur and the actual ram.
Sandy Bridge has a memory latency of less than 50ns.

The DRAM to Centaur latency should be in the same neighborhood as what you get with an IMC, minus the full cache hierarchy.
POWER8 would then add the 40ns link, and then the latency of the on-die memory hierarchy.

Yes, GCN doesn't have a cache hierarchy that's anywhere as sophisticated as a cpu, I just meant that AMD saw it fit to devote a lot of die space to cache and cache related logic with that gen.
It's difficult to say since the die shots show automated layout that makes the cache arrays less obvious, but the GPU devotes the vast majority of its area to non-cache things.

On a similar token, Core 2 did okay without it and while Nehalem was faster it's hard to say if the IMC contributed a bulk of that. Core 2's latency was also far lower than Prescott's (and IIRC far lower than K7's as well), so it at least shows what can be accomplished without an IMC.
Going on-die brought tens of nanoseconds in latency savings, if you correct for prefetching.
It's particularly relevant for the K7 to K8 transition because the cores are so similar.

http://http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.140.8981&rep=rep1&type=pdf
When comparing K8 to Conroe, it was over 20ns saved.

A similar latency drop was detectable between Nehalem and Conroe, although performance comparison is murky since the architectures are less similar than K7 to K8.
Nehalem is what killed Opteron's server niche, since along with whatever latency benefits it got, it brought more bandwidth scalability, which was the last refuge for Opteron in HPC and the like.


I think you speak too lightly of bottlenecks; you don't know all the details and what secret sauce is in this chip.
The DRAM bandwidth to Centaur is on an absolute level massively higher than the link bandwidth.
On either side of the link, there are higher-bandwidth subsystems, so it is some form of bottleneck.
 
Centaur does adds a sizable L4 which hides some of that latency, and yes for ideal latency we'd have that silicon on die but perhaps they are planning on using centaur as some kind of flexible substrate for stacked DRAM as you mentioned. So if they could improve latencies from Centaur to DRAM through the use of stacked dies, they might have a nice net performance boost in the future and separate that complexity and heat from the main die.

After looking into it a little more, Intel has implemented what also looks like an off die solution with its L4 on the i7-4950 which according to Anand has a latency of 30-32 ns (so IBM's latency to L4 is slightly worse). The i7-4950 just looks like a convenient block of eDRAM bolted onto haswell, and Intel's L4 is a victim cache for L3; it looks like the memory controllers are still attached to the i7-4950's L3, whereas going through Centaur and adding 40 ns seems to be Power8's only path to ram. So you're right, they're definitely adding that 40 ns to DRAM access whereas Intel isn't. Again, they might gain all of that back and then some once they go stacked DRAM, and Intel might move its memory controller onto some kind of L4 + memory controller-like interface as well for the inevitable transition to stacked ram to alleviate heat and complexity issues.

EDIT: I do agree that ideally all of this should just be on a single die. However, there seem to be restrictions on the density and type of memory that can be etched onto a die for now and connecting / creating TSV's sounds like a new and delicate manufacturing process which risks an otherwise good die. Something like Centaur seems to be a good compromise for all these things.
 
Last edited by a moderator:
That is one beastly processor.

I don't know much about the server market, can someone simplify how competitive this will be against Intels offerings?
 
That is one beastly processor.

I don't know much about the server market, can someone simplify how competitive this will be against Intels offerings?

If the price is somewhat competitive I think demolish would be an apt term.
 
On either side of the link, there are higher-bandwidth subsystems, so it is some form of bottleneck.
Doesn't mean the processor actually bottlenecks on the link, that's what I'm trying to point out. It's meaningless talking about bottlenecks if there's a "bottleneck" that is never actually maxed out. We don't have the information necessary to judge if there's a real bottleneck, and I think the engineers at IBM are well aware of the various capacities of the links in the CPU they've designed, and have balanced accordingly... They're not a bunch of idiots, you can be sure of that. ;)
 
Designers have to give a nod at some point to physical reality when it comes to implementing the system. The bandwidth appears to be massively improved, but at the clocks and thread counts in question, there are non-theoretical workloads that benefit greatly from the bandwidth while being all too ready to consume more. It's not denigrating their work to point out that there's a link in the chain that isn't quite at the same level as the others.

Read bandwidth at the DRAM level is almost four times that of the link, although part of the problem I'm facing is that I can't get the quoted 9.6 GB/s for the link to add up correctly to IBM's bandwidth total, with or without IO added in or with any combination of bidirectional or partially bidirectional bandwidth.
The L3 side of the link is even higher bandwidth, so if it were possible cut the buffer chip out, the CPU could readily soak up the bandwidth.

Going to DDR4 with that link would make the mismatch more noticeable.
 
Is the complete PDF already available? At the hotchips' site, the archive's still password protected.

In the registers' article, there are a couple of seemingly conflicting numbers:

• „At a 4GHz clock speed, you can move data into L3 cache from the external L4 cache at 128GB/sec“
-> divided by eight memory channels, that'd make 16 GB/s of read bandwidth for each centaur
• „That memory link between the Power8 package and the Centaur memory buffer chip has a 40-nanosecond latency and 9.6GB/sec of bandwidth“
• „That socket would have eight memory channels, for a total of 230GB/sec of sustained bandwidth into and out of the processor …“
-> divided by eight: 28.75 GB/s
• „… and the 32 DDR memory ports hanging off one twelve-core chip would have 410GB/sec of peakbandwidth at the DRAM level.“

Leaves me quite confused.
 
On a similar token, Core 2 did okay without it and while Nehalem was faster it's hard to say if the IMC contributed a bulk of that. Core 2's latency was also far lower than Prescott's (and IIRC far lower than K7's as well), so it at least shows what can be accomplished without an IMC.

Core 2 introduced speculative load/store reordering, that meant more memory ops in flight for lower apparent latency. This was the primary reason for the large performance jump from Core -> Core 2. It also featured lots of smart prefetchers, which K8 didn't.

IMO, the performance jump from C2 to Nehalem I7s was largely from the lower latency memory subsystem. Nehalem also featured slightly higher bandwidth and a bigger OOOE window (128 entries). However in SMT mode each thread only has half the ROB and half the store buffers to its disposal and it still beats Core 2 silly.

I went from a 2.2GHz Athlon XP to a 2.2GHz Athlon 64 back in 2004 and more or less doubled performance in everything from games to large software builds. Other than the 64 datapaths and the integrated memory controller, XPs and 64s were almost identical micro-architecture wise (same I&D caches, same number of memory ops per cycle, same decode width, same ROB size etc).

The P8 seems to focus on *large* memory configurations with this as well as throughput (8x smt) over all out single thread performance.

Cheers
 
IBM has brought back the north bridge, or rather bridges. It's interesting to see articles about the 40ns link latency. I remember when the removal of that latency was celebrated with IMCs.
The latency didn't matter as much for the server loads POWER targets, so it's not as strong a reversal as it would be if a desktop chip did the same.

Not really, if you look at all the current high end server CPUs, they all use BoB (buffer on board) technology to connect to DRAM: Intel Xeon E7s, Oracle Sparc processors, and even the existing Power7/7+.

I find it interesting that people bring up the massive DRAM bandwidth to the Centaur chips, given the much narrower link to the CPU. I would think that is more of a capacity play than a bandwidth one. Why not load up on a massive amount of slower DRAM for less money, so long as you still match the more limited link bandwidth?

It is both a bandwidth and capacity play. The high speed links provide significantly more bandwidth per pin than the actual DRAM interfaces.


POWER8 reintroduces the bottleneck with that rather pedestrian link bandwidth, although it sort of compensates by having multiple links. It seems unfortunate the chip link couldn't be a bit closer to the massive DRAM to Centaur bandwidth.

The chip link doesn't need to match the DRAM bandwidth. It is quite rare that DRAM reaches its peak bandwidth, even 50% of peak isn't achievable in many workloads that are quite well behaved. When you look at workloads like DBs it is even lower. The main reasons they have so many channels per BoB is to increase concurrency (which lowers effective latency with highly MT workloads) and to enable increased memory capacity.

The DRAM bandwidth to Centaur is on an absolute level massively higher than the link bandwidth.
On either side of the link, there are higher-bandwidth subsystems, so it is some form of bottleneck.

While you see it as a bottleneck, it is unlikely to be an actual bottleneck in any actual workloads or even uK benchmarks like stream.

Read bandwidth at the DRAM level is almost four times that of the link, although part of the problem I'm facing is that I can't get the quoted 9.6 GB/s for the link to add up correctly to IBM's bandwidth total, with or without IO added in or with any combination of bidirectional or partially bidirectional bandwidth.
The L3 side of the link is even higher bandwidth, so if it were possible cut the buffer chip out, the CPU could readily soak up the bandwidth.

Going to DDR4 with that link would make the mismatch more noticeable.

The quoted 9.6 GB/s is going to be CPU->BoB bandwidth. The BoB->CPU bandwidth is going to be ~19.2 GB/s. AKA, the interface is likely 20b from BoB to CPU and 10b from CPU to BoB running at around 10 GT/s.
 
The quoted 9.6 GB/s is going to be CPU->BoB bandwidth. The BoB->CPU bandwidth is going to be ~19.2 GB/s. AKA, the interface is likely 20b from BoB to CPU and 10b from CPU to BoB running at around 10 GT/s.

That clears up my understanding of the article's numbers. The bandwidth between the Centaur chip and the CPU is much closer now that I'm clear that the number isn't the total.
 
Last edited by a moderator:
That clears up my understanding of the article's numbers. The bandwidth between the Centaur chip and the CPU is much closer now that I'm clear that the number isn't the total.

Yep.

The two new things with IBM's BoB is shifting of the actual memory scheduling logic from the CPU to the BoB and the addition of the large cache on the BoB. The front end coherence/consistency portion of the memory controller is still likely on die as there is no advantage to shifting it behind the cache and it would likely result in worse performance.
 
Yep.

The two new things with IBM's BoB is shifting of the actual memory scheduling logic from the CPU to the BoB and the addition of the large cache on the BoB. The front end coherence/consistency portion of the memory controller is still likely on die as there is no advantage to shifting it behind the cache and it would likely result in worse performance.

How is the BoB cache supposed to work ? As a per memory channel victim cache ? Or as a massive write coalescing buffer? (or both ?)

Cheers
 
Back
Top