AMD Bulldozer Core Patent Diagrams

fehu · Sep 4, 2010

how can be possible that bobcat is so little and so powerful compared to the atom core that still has only necessary stuff?

Npl · Sep 4, 2010

fehu said:
how can be possible that bobcat is so little and so powerful compared to the atom core that still has only necessary stuff?

Better architecture?
AFAIK Atom was based on old Pentium CPUs while Bobcat is newly engineered from ground up.

Alexko · Sep 4, 2010

3dilettante said:
The other elephant in the room is the 32nm gate-first HiKMG SOI process.
Which for all of AMD's and GF's bluster has not shown to have overcome known problems, and the public weight of pretty much everybody else going gate-last, ahead of AMD, faster than AMD, and with great effect in terms of yields and variability.

I am a few days late, but this surprised me a little bit. Apart from Intel, who is going gate-last and ahead of GloFo, let alone faster and with better yields?

I believe IBM went gate-first and has been making 45nm CPUs with HK/MG for a while. TSMC is going gate-last, but doesn't appear to have any significant advantage over GloFo at this point, whether in terms of time to market, yields or performance.

I don't know about other IHVs/foundries, but I can't think of anyone other than TSMC with a chance of getting HK/MG to market before GloFo.

Npl said:
Better architecture?
AFAIK Atom was based on old Pentium CPUs while Bobcat is newly engineered from ground up.

Atom is a new architecture as well. I might be wrong here, but I think it trades some density for power. Considering that it's targeted at smartphones/MIDs/tablets in addition to netbooks, and that Intel typically enjoys a 12~18-month process lead over the rest of the industry, not to mention that they have their own fabs and huge volume, it sounds like a good tradeoff. I would imagine that Hyper-Threading adds some silicon too, though probably not as much as OoO.

hoom · Sep 4, 2010

Oh, I see that the Bobcat die is being covered in the Fusion Die Shot thread already

hkultala · Sep 5, 2010

Npl said:
Better architecture?
AFAIK Atom was based on old Pentium CPUs while Bobcat is newly engineered from ground up.

No.

Atom is a completely new design, BUT it's done with many "similar princilples" than Pentium/P5. (but not completely same principles, as it's also adding things like HypeThreading). But the pipeline is still completely different, "does not share any same parts".

So Atom is about as close to pentium as K7 is to pentium pro, or bobcat is to K6.

3dilettante · Sep 7, 2010

Alexko said:
I am a few days late, but this surprised me a little bit. Apart from Intel, who is going gate-last and ahead of GloFo, let alone faster and with better yields?

TSMC will be a close finish, and that is with the consideration that it did an about-face and switched to gate-last after seeing what happened with gate-first.
NEC is doing some kind of half and half approach.
Then there's Intel, which has demonstrated two successful generations of gate-last.

There's an alliance under IBM's umbrella, and while they ostensibly stand behind gate-first, there is significant pressure to switch to gate-last. It makes sense because IBM's priorities do not match the needs of a foundry.

I believe IBM went gate-first and has been making 45nm CPUs with HK/MG for a while.

It's been producing chips with margins and service revenue that make otherwise horrible yeilds acceptable.

TSMC is going gate-last, but doesn't appear to have any significant advantage over GloFo at this point, whether in terms of time to market, yields or performance.

It might be a close finish even after TSMC changed its mind halfway through.

Triskaine · Sep 7, 2010

IBM's 45 nm SOI does not include High-K Metal Gates.

Alexko · Sep 7, 2010

IBM demonstrated 45nm with HK/MG a while ago. I assumed their latest CPUs used that process… But perhaps I was wrong.

dkanter · Sep 8, 2010

Just to emphasize this fact - nobody has shipped any HKMG products...except for Intel, which has two generations under their belt.

http://www.realworldtech.com/page.cfm?ArticleID=RWT072109003617&p=11

IBM is not shipping 45nm HKMG, it was really just research. And they are going gate first on 32nm, and it appears to be an epic fail.

What is truly amazing is the results IBM can get without HKMG, e.g. the POWER7 or latest zSeries design.

David

3dilettante · Sep 14, 2010

Anand put out a Sandy Bridge architecture article.
Whereas what is known of BD acquits itself pretty well against Westemere, there are serious deficits against Sandy Bridge.
In the single-threaded case, there is very little question which architecture is the better equipped for per-clock performance, and it appears SB will also have a very, very capable turbo function.

SB has an advantage of ~30% for its OOE window, 50% for its integer register resources. If one BD core can access the FP unit exclusively, BD has a ~11% advantage. Otherwise, there is not an advantage. An unknown about the implementation of AVX on BD could make it worse, if the FP register count is halved because they are 128-bit, whereas SB natively supports the width.

SB has a 60% advantage in load queue and 50% advantage in store.
BD can do better in a multithreaded case since it effectively doubles its count in that case, giving it a 25% advantage in load and 33% in store.

In a 4-core 8-thread chip, SB will have 4 vector/FPU blocks.
BD with a 4-module setup will have the same.

Let's assume the FMA blocks can find a way to decompose down to dual FADD and dual FMUL.
In a 128-bit SSE case, that is 4*4=16.
SB can do one of each, which leaves 4*2=8.
With AVX, the throughput would be double.
One notable problem is that the port setup in BD appears to halve BD's FP capability if an XBAR operation is performed.
SB has shuffle on a different port and suffers no shortfal.

Another lost opportunity for BD is that its FP unit can only manage the equivalent of 2 128-bit loads, even though in theory the two cores sharing it could manage double that.
Sandy Bridge's FP capability matches that.
BD may have an edge in 128-bit SSE, though it may be mitigated by its overall lack of advantages in FP resources and ports.
SB should be very good with AVX, while BD is likely to appear weak.
One possible drag on BD is that the coprocessor arrangement will add latency it will need to hide compared to the unified scheduler in SB.

It will be interesting to see how much SB's decode cache helps, either in power or performance.

Some unknowns:
How SB and BD stack up with their branch prediction. SB apparently is increasing the effective capacity of its predictors. BD has revamped, but the exact method is unknown and it is hard to bet against Intel in a matchup. Even if BD wins, it sounds like its pipeline will feel mispredictions much more keenly than SB.

The front ends: SB is better known, though BD potentially has are more beefy front end because it is more symmetric.

Clocks...

Turbo capability is not quite certain. AMD is obviously going to seriously revamp this.
SB is actually going to violate TDP under controlled circumstances, so it sounds just as aggressive.

Prefetchers? BD has made noise, but the exact efficacy of the changes will need testing.

The cache subsystem:
From what is disclosed of BD, it appears to have some deficits against Core 2.
The L1 is small, and the shared L2, reminiscent of Conroe or Penryn, is slower on a per-cycle basis.
The L3 is not going to be faster than the L2, and since the L2 sets such a high latency cost, it may be very difficult to match the latency of admittedly cool-looking design for SB's L3.
The BD L3 has not yet been described, so there may be more to see.

On a side note, whatever is wrong with Intel graphics, it is definitely a more "fused" form of fusion than what has been promised for Llano. Then there's a transcode block, which has gutted the appeal of one of the few consumer-level GPGPU applications.

Memory controllers: a big question mark
Uncore: There is little discussed for BD. SB has some serious hardware there.

Die size: The outcome of this, once performance is known, will determine how much of BD's philosophy and AMD's implementation capability will be vindicated.

Time to market:
For desktop SB wins without question.
For upper-range laptops, it wins.
Lower range, perhaps Ontario or Zacate.

Server (BD's obvious focus) may be a significant win for BD, assuming it can outclass Westmere.

Lightman · Sep 18, 2010

Good analysis 3dilettante!

It will be very interesting once we start getting benchmarks of BD samples.
SB looks mighty for single threaded software as well as very potent for multi threaded tasks.
Judging by what we know about both architectures it looks like SB will be quite a bit faster clock-for-clock for single thread than BD core. On the other hand BD target clocks are unknown and this can be it saviour for desktop workloads. Paired with very good turbo it can give AMD better market positioning against Intel. I doubt AMD is planning on competing in $500-$1000 price bracket for desktop CPU, but moving them from $300 top end part to a $500 top end part would be something they should be able to achieve with BD.

Of course BD can be hiding some nice surprises for us which we might only find out after launch ...

rpg.314 · Sep 18, 2010

3dilettante said:
An unknown about the implementation of AVX on BD could make it worse, if the FP register count is halved because they are 128-bit, whereas SB natively supports the width.

How, exactly, can you support AVX and NOT have ymm registers?

hoom · Sep 19, 2010

The L1 is small, and the shared L2, reminiscent of Conroe or Penryn, is slower on a per-cycle basis.

With all the other changes being very carefully chosen then there must surely be a good reason for such a small L1D.

I'm presuming that they are very confident that the new pre-fetchers will make sure that the L1D will nearly always be populated with the next bit of needed data or at least that sufficient other instructions will be ready in the L1I to be executed while waiting for L1D to get populated.

Alternatively perhaps their modelling found that most of the time only about or less than 16KB of L1D is actually ever re-used so having a bigger one is just a waste because nearly all the data gets evicted or invalidated anyway.

High per-cycle latency for the L2 should be fine as-long as the actual clock rate is suitably high. (is cache latency measured in core clocks or cache clocks?)

I still hold out a probably forlorn hope for BDs L3 to be a big eDRAM or similar high density tech.

hkultala · Sep 20, 2010

rpg.314 said:
How, exactly, can you support AVX and NOT have ymm registers?

You can store the contents of a 256-bit "register" into two 128-bit registers when doing register renaming.

So if the ISA says there are 16 128-bit registers, and 16 256-bit registers, you could for example in reality have 64 or 80 actual 128-bit registers and all those values would fit into these, and still have some "freedom of renaming to avoid antidependencies".

rpg.314 · Sep 20, 2010

So if the ISA says there are 16 128-bit registers, and 16 256-bit registers, you could for example in reality have 64 or 80 actual 128-bit registers and all those values would fit into these, and still have some "freedom of renaming to avoid antidependencies".

IIRC, the ISA says there are 16 128 bit registers (xmm series) and 16 256 bit registers (ymm series) aliased to the xmm series.

3dilettante · Sep 20, 2010

rpg.314 said:
How, exactly, can you support AVX and NOT have ymm registers?

I'm referring to the physical registers, not the architectural ones. BD is free to use two physical registers to store a single 256-bit value. The impact from doing so is that BD will change a slight advantage in FP renaming resources to a more significant disadvantage versus SB--a 45% shortfall.

hoom said:
With all the other changes being very carefully chosen then there must surely be a good reason for such a small L1D.

AMD is targeting high clocks, and a larger L1 could have been too much to fit into the reduced time period, at least not without increasing the L1 latency even further.

I'm presuming that they are very confident that the new pre-fetchers will make sure that the L1D will nearly always be populated with the next bit of needed data or at least that sufficient other instructions will be ready in the L1I to be executed while waiting for L1D to get populated.

Alternatively perhaps their modelling found that most of the time only about or less than 16KB of L1D is actually ever re-used so having a bigger one is just a waste because nearly all the data gets evicted or invalidated anyway.

The smaller L1 may be an acceptable sacrifice for a multithreaded server load.
I'd assume that AMD figured the hit from a small L1 would be made up for in higher clock speeds, or even it wasn't compensated for, AMD has done as much as it can to mitigate the loss.
For much of BD's market, the additional throughput is worth the single-threaded IPC loss, but that goes to show that the desktop and laptop markets are not the primary target.

High per-cycle latency for the L2 should be fine as-long as the actual clock rate is suitably high. (is cache latency measured in core clocks or cache clocks?)

The latency should be measured in core clocks.

I still hold out a probably forlorn hope for BDs L3 to be a big eDRAM or similar high density tech.

AMD is free whip out a surprise for the L3. AMD's cache subsystem has normally been sub-par compared to Intel's, so it would be nice to see a change.
A denser L3 could help, particularly for servers but less so for desktop which tend to like lower latency.
The longer L2 latency puts a minimum value for the latency of the L3, and could also be a sign that memory latency will be eight or more cycles longer, unless AMD has changed how it does things for L2 misses. The anemic L3 and slow uncore of Barcelona made the latency of a memory access measurably worse.

Jawed · Sep 20, 2010

Wouldn't run-ahead pre-fetching for single-threaded work put everything in L2 at least, so L3 latency becomes irrelevant in that scenario?

Gubbi · Sep 20, 2010

3dilettante said:
For much of BD's market, the additional throughput is worth the single-threaded IPC loss, but that goes to show that the desktop and laptop markets are not the primary target.

If you're only running a single threaded workload on a BD module, resources seem more than ample.

The I$ is big, the fetch list can run ahead of the instruction fetch itself and effectively prefetch stuff into the I$. BD got rid of the instruction boundary markers and can fetch+decode 4 instructions/cycle right out of L2. This is one of the most server-centric things in BD, IMO. It's fairly common to have server software stacks thrash the I$.

The data cache is smaller but is 4 way associative vs 2-way for K8. The cache structure of the D$s and L2 is now inclusive, so a cache miss is simpler to serve, - in particular back-to-back misses don't see the latency penalty associated with swapping cache lines of earlier AMD CPUs.

With speculative loads and 128 instructions in flight there should be no problem covering the latency of the L2. In a single threaded situation, all of the L2's resources will be dedicated to a single core.

Cheers

3dilettante · Sep 20, 2010

Gubbi said:
If you're only running a single threaded workload on a BD module, resources seem more than ample.

Sandy Bridge is significantly better-provisioned for the single-threaded case than BD in terms of Load/Store queue depth, Int rename, and ROB capacity. Its L1 is larger, and its L2 is faster. BD has more capacity at the L2 level, the latency numbers are ho-hum and don't include the L3, which unfortunately cannot be massively faster because the L2's long latency sits in the way.
Integer capability is decidedly in SB's favor.

Within the processor core, BD has an assumed small per-clock advantage is in its FP rename resources, when not using AVX. It has potentially higher 128-bit throughput, though the extent that this can be exploited is capped by the load/store capability of a single core, and a smaller number of issue ports for FP operations.
Its decoder is beefy enough to probably do better than SB, although SB does have that micro-op cache to help mitigate some of its bottlenecks.

BD's branch prediction effectiveness is not known, some of the statements regarding it seem a little odd to me.
SB has a scheme based on what has gone before, but with more capacity and refinement.

The data cache is smaller but is 4 way associative vs 2-way for K8. The cache structure of the D$s and L2 is now inclusive, so a cache miss is simpler to serve, - in particular back-to-back misses don't see the latency penalty associated with swapping cache lines of earlier AMD CPUs.

Despite the simpler arrangement, the L2 latency is still slower on a cycle basis than other shared-cache implementations. The associativity of the L1 is twice as high, but the cache is a quarter of the size so we might see this as a break-even at best.

If BD can clock high, the L2 latency might be compensated for. The results of early SB overclocks have made me suspect that the advantage may be muted, and this assumes the 32nm gate-first process does very well.

With speculative loads and 128 instructions in flight there should be no problem covering the latency of the L2. In a single threaded situation, all of the L2's resources will be dedicated to a single core.

The L2's capacity is the big non-debatable advantage BD has over SB that I see.
Other features are unknowns (branch prediction, prefetchers, L3, clocks, memory controller), numerically inferior (Int units, FP ports, rename, load/store capacity, ROB length), or not future-proof (128-bit SSE advantage, slight FP rename advantage both negated by AVX).

BD should hopefully do okay against Westmere, but the single-core capability of Intel's chips is such that a good Ivy Bridge shrink could yield a single Intel core that double a BD core's execution resources. That may not happen until Haswell, but the trend line is present and we have not seen evidence AMD can pump BD's design cadence to match.

Jawed · Sep 20, 2010

3dilettante said:
Within the processor core, BD has an assumed small per-clock advantage is in its FP rename resources, when not using AVX. It has potentially higher 128-bit throughput, though the extent that this can be exploited is capped by the load/store capability of a single core, and a smaller number of issue ports for FP operations.

For single-threaded work, load/store for a single thread in FP should be able to use both load/store paths shouldn't it? L2->L1 only has to be duplicated so that everything in either L1 is in the other, to enable arbitrary loads. Stores from FP can use both paths to keep the two L1s coherent.

For dual-threaded work the FP unit is plain SMT with two independent load and two independent store paths.