AMD Bulldozer Core Patent Diagrams

hoom · Aug 29, 2010

No, because AM3 processors work in the AM3+ socket. Thuban should have been AM3 to go in a newly launched AM3+ socket.

Hmm, I see what you're getting at there, would have been nice yes.

1 SB core ~2x the fp rate of 1 BD core.

Presumably function split like with i7 so one is ADD, the other MUL? Which is not quite the same as 2* 256bit.

Comparison I thought goes like this for 4 module BD vs 4 core SB: 8 cores vs 4 cores, 8 threads vs 8 threads & 4*256bit FP vs 4*256bit FP

itsmydamnation · Aug 29, 2010

rpg.314 said:
Yes. 1 SB core ~2x the fp rate of 1 BD core.

from my understanding.

depends what actually being done and what you are comparing to what. if you compare a "module" to a SMT core then from my understanding BD has a whole heap more FP resources(2X 128bit add + 2 128bit mul or 2 128bit FMA vs 1 128bit add, 1 128bit mul) except for AVX where both SB and BD appear to be able to do 1X 256 add and 1X256 mul a cycle. Remeber that AMD has implemented an FMA bridge to allow seperate muls and adds from the 2 128bit FMA units.

if you then compare two SB cores (4 threads) to one BD Module (2cores, 2threads) then 128bit FP execution resources are the same but SB has twice the 256 AVX thoughput.

rpg.314 · Aug 29, 2010

itsmydamnation said:
from my understanding.if you then compare two SB cores (4 threads) to one BD Module (2cores, 2threads) then 128bit FP execution resources are the same but SB has twice the 256 AVX thoughput.

The 256bit and 128 bit resources aren't separate. The same FP unit is used according to the incoming instruction.

1 BD module can do 2x128b FMA/clock.

1 SB core can do 1x256b mul and 1x256b add per clock.

Which is why the fp thoughput in SB is 2x that of BD module.

itsmydamnation · Aug 29, 2010

rpg.314 said:
The 256bit and 128 bit resources aren't separate. The same FP unit is used according to the incoming instruction.

1 BD module can do 2x128b FMA/clock.

1 SB core can do 1x256b mul and 1x256b add per clock.

Which is why the fp thoughput in SB is 2x that of BD module.

each BD module can do 2X128bit muls and 2x128bit adds or 2X128bit FMA.

Figure 2 shows a high level block diagram of the bridge fused multiply-add architecture. The
design begins with common floating-point multiplier and floating-point adder units capable of
independent execution. Several blocks are added between the two arithmetic units, creating a
“bridge” capable of carrying data from one unit to the other to perform a fused multiply-add
instruction.

http://users.ece.utexas.edu/~quinnell/Research/Bridged Floating-Point Fused Multiply-Add Design.pdf

now as far as im aware SB cant do 4X 128bit non AVX (SSE) per core (2add 2mul) per cycle but per module BD can. So as far as im aware im correct?

AVX is another matter BD should be able to do 1X 256bit add and 1X 256bit mul a cycle.

Ethatron · Aug 29, 2010

itsmydamnation said:
each BD module can do 2X128bit muls and 2x128bit adds or 2X128bit FMA.

You mean 256bit.
It can also issue 1x256bit mul + 1x256bit add, and 2x128bit FMA (if Dresdenboy's FMA-block is correct; which is in line with what Cypress can do).

More problematic seems that previously the integer SSE2-instructions like pshuf had a throughput of 2/1 because FADD and FMUL both were able to do that. Now, I can't believe the FMA-block is able to do that, or that is some crazy complex piece of logic.

hkultala · Aug 29, 2010

Ethatron said:
You mean 256bit.

No, he does not.

His "and" is "can do those at the same time".

Ethatron · Aug 29, 2010

hkultala said:
No, he does not.

His "and" is "can do those at the same time".

My bad, I even repeated it. :smile:

I think the "MMX"-pipe should have the old FADD capabilities and the "FMA"-pipe should have the old FMUL capabilities including defuseable fma. If that is correct, the add-branch of the fma should be able to do bit-operations (and/or/xor/unpack etc.).

That could allow
2x256bit add/etc. + 1x256bit mul or
1x256bit add/etc. + 1x256bit fma.

fellix · Aug 30, 2010

Bulldozer 20 questions -- part II

fehu · Aug 30, 2010

the best explanation that i've read until now :|

Ask yourself, would you rather have a 4-cylinder engine that delivered 300HP or a 6-cylinder engine that delivered 360HP and consumed less gas? The cylinder to horsepower ratio for 4-cylinder is obviously higher (75HP/cylinder vs. the V6’s 60HP/cylinder), meaning that each cylinder can give you more performance. However, looking at the overall enginge, you are getting less total output; and you are getting that lower output at a higher cost (higher gas consumption).

Albuquerque · Aug 30, 2010

fehu said:
the best explanation that i've read until now :|

...Except that, ever since the Core 2 series, AMD has been the four cylinder of this equation. When compared to a performance-equivalent Intel part, AMD has been clocked higher and consuming more power.

Here's hoping that Bulldozer is all they hope it to be, but I still don't see hit hitting the perf/watt capacity of Nehalem at 32nm -- or perhaps even Nehalem at 45nm.

3dilettante · Aug 30, 2010

One of AMD's central problems was pushing the K8 core as long as it has. Aside from incremental changes, the scheduling, branch prediction, and memory pipeline weaknesses were inextricably linked to design decisions made a decade ago.

I would hope a new core redesigned to face the environment at the current process geometries and design paradigms would do better.

At 32nm, I would also hope Bulldozer manages to do better than a 45nm Nehalem.
A Bulldozer chip should have peak FP resources much higher than a current Nehalem, so hopefully it can at least manage that.
Integer performance is less clear cut, particularly in a single-threaded situation.
One of the unknowns that could shift things is if the design's emphasis on higher clocks and power management succeeds, and a core's turbo can hit the 25-30% higher clocks, given the alleged FO4 reduction per stage. Naively, a Phenom with a pipeline that allows turbo to 3.6 GHz would, if transformed into a Bulldozer core, have clocks around 4.5 GHz (much laughter at the simplification of a complex problem with specious math aside).
That hopefully will put it at least a little ahead of the current Nehalems. (edit: this does exclude the claimed IPC advantage BD has over Phenom)

One problem is that server-derived Zambezi looks to be a poor fit for desktop workloads. There will not be a massive need for 8 int cores, and the FP unit does not get the full L1 bandwidth it could use if the chip is idling one of the cores. It may not hurt too much against a 45nm Nehalem, but Sandy bridge will not be operating at that level.

edit edit:
Sorry for the edit-fest, but one thing that is really starting to bug me is the prevalence of the claim that Phenom can only do a mix of 3 ALU/AGU ops per cycle, whereas BD can do ALU+ALU+AGU+AGU per cycle, thus claiming that BD is wider.

The three instruction schedulers in K8 receive macro ops.
A macro op is similar to a fused micro op in an Intel chip.
That means it is often an ALU and AGU op in the same entry.
Each scheduler is capable of breaking a macro op down into its constituent micro ops, that is up to one ALU and one AGU per cycle.

In theory, K8 can send send off a burst of ALU+ALU+ALU+AGU+AGU+AGU in a cycle. Two memory ops can be sent through the AGUs, along with one LEA.

So that means Phenom is definitely wider than Bulldozer, though much less capable of hitting that peak.

source: www.agner.org/optimize/microarchitecture.pdf

hoom · Aug 31, 2010

would you rather have a 4-cylinder engine that delivered 300HP or a 6-cylinder engine that delivered 360HP and consumed less gas?

Depends on whether the distributor can be trusted to properly distribute firing commands between cylinders, if the petrol gets to choose how many cylinders to fire on & if a lot of the petrol I use is likely to only fire on one cylinder

source: www.agner.org/optimize/microarchitecture.pdf

Cool, will be studying that

Jawed · Aug 31, 2010

David Kanter's article

http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=1

3dilettante · Aug 31, 2010

After a quick read-through, it appears that a single Bulldozer core in many aspects acquits itself rather well against Westmere.
If one core is idled, there is an embarassment of riches when it comes to decode hardware, though the extent of it would depend on whether the single-cycle switching allows for one thread to take up both cycles in single-thread mode, instead of a fetch/idle/fetch/idle pattern. There is no mention of a loop cache per se, which would be handy in certain cases if it involved idling that hefty front end, or allowing one core to loop while the other could grab the whole front end.

The load/store unit is better than K8, though in numerical terms it is weaker than Westmere in a single-threaded case. In multithreaded cases, Westmere would have its L/S capabilities halved per-thread, while Bulldozer would have the full complement per thread and much more in aggregate.

There are too many variables, and there is the prospect that AMD retooled the coherence protocol, which may help quite a bit, as Intel's MESIF likely contributed to some of its scalability with Nehalem.

I find it difficult to compare the capabilities of Westmere and Bulldozer when it comes to FP. There are a number of questions there. It does seem that the FP unit's sustainable read bandwidth is capped at what a single core can offer, which puts a dent in what can be expected with the surfeit of bandwidth available with two cores with their memory pipelines supplying operands. The general level of utilization should be decent. Westmere looks like it can handle more per clock, and it can muster higher math throughput without losing an FP pipe to a shuffle, which the XBAR ops look to do in Bulldozer.

The L2 cache latency is a potential achilles heel, particularly coupled with the tiny L1. This chip, and the L2 in particular, better clock high to justify the yawning gap in latency. It is nicely sized, though. (edit: but damn, if it takes that long to get the L2, how long is it going to be for the L3? The L3 still may be too small.)
Maximal FP throughput may not be acheivable without using FMAC instructions (the bridged FMAC scheme is not confirmed or denied, exactly).

Then there is the problem that it is not Westmere Bulldozer needs to worry about.
I still do not see Zambezi as being well-fitted to the desktop, and making Bulldozer match that market may be a siginificant test of AMD's modular philosophy.

rpg.314 · Aug 31, 2010

3dilettante said:
I still do not see Zambezi as being well-fitted to the desktop, and making Bulldozer match that market may be a siginificant test of AMD's modular philosophy.

My guess would be that AMD is relying on Fusion to save the desktop market.

It's a pity fusion based chips can't have bandwidth anywhere near their discrete brethren.

3dilettante · Aug 31, 2010

I am skeptical that Llano can hold ground against what is coming, especially since it has attained that all-too-common AMD product adjective: delayed.
Even if Llano were on-time, that would only be relative to a seriously delayed rollout, and Intel may have a number of quarters where it can saturate much of the market with superior CPUs with inferior but acceptable graphics.

One thing I failed to note earlier for Bulldozer is that integer side is still slanted in favor of Intel.
The scheduling capabilities are somewhat close, but the peak capabilities are significantly lower for Bulldozer. In a multithreaded case, this would not be the case, but without very high turbo the single-threaded case is in trouble. Even with high turbo, that cache latency could rapidly exhaust the amount of speculative work the chip can marshal.

Given the lack of a true desktop variant of Zambezi, the unknown delay before there is a Bulldozer Fusion variant might presage enough stagnation that Llano vs Sandy Bridge is the *best* matchup AMD can manage, and that it may get worse with Ivy Bridge onwards for another generation or so before Fusion2.

The other elephant in the room is the 32nm gate-first HiKMG SOI process.
Which for all of AMD's and GF's bluster has not shown to have overcome known problems, and the public weight of pretty much everybody else going gate-last, ahead of AMD, faster than AMD, and with great effect in terms of yields and variability.

Jawed · Aug 31, 2010

3dilettante said:
After a quick read-through, it appears that a single Bulldozer core in many aspects acquits itself rather well against Westmere.
If one core is idled, there is an embarassment of riches when it comes to decode hardware, though the extent of it would depend on whether the single-cycle switching allows for one thread to take up both cycles in single-thread mode, instead of a fetch/idle/fetch/idle pattern. There is no mention of a loop cache per se, which would be handy in certain cases if it involved idling that hefty front end, or allowing one core to loop while the other could grab the whole front end.

This, "These stages are effectively multi-threaded with single cycle switching between threads. The arbitration between the two cores is determined by a number of factors including fairness, pipeline occupancy and stalling events." seems to imply that a single thread can entirely occupy the front-end. Otherwise there'd be no reason to have arbitration, as they'd run in a simplistic time-sliced fashion.

The L2 cache latency is a potential achilles heel, particularly coupled with the tiny L1. This chip, and the L2 in particular, better clock high to justify the yawning gap in latency. It is nicely sized, though. (edit: but damn, if it takes that long to get the L2, how long is it going to be for the L3? The L3 still may be too small.)

The whole thing seems consistently biased towards high clocks and making prediction considerably more robust.

Maximal FP throughput may not be acheivable without using FMAC instructions (the bridged FMAC scheme is not confirmed or denied, exactly).

I get a strong sense that AMD isn't interested in maximal FP throughput.

Jawed · Aug 31, 2010

3dilettante said:
I am skeptical that Llano can hold ground against what is coming, especially since it has attained that all-too-common AMD product adjective: delayed.
Even if Llano were on-time, that would only be relative to a seriously delayed rollout, and Intel may have a number of quarters where it can saturate much of the market with superior CPUs with inferior but acceptable graphics..

I'm dubious that x86 CPUs with on-die graphics will be compared on CPU performance. Particularly when SB is aiming merely to outclass Ontario, while Llano is likely to be considerably faster.

rpg.314 · Aug 31, 2010

3dilettante said:
I am skeptical that Llano can hold ground against what is coming, especially since it has attained that all-too-common AMD product adjective: delayed.
Even if Llano were on-time, that would only be relative to a seriously delayed rollout, and Intel may have a number of quarters where it can saturate much of the market with superior CPUs with inferior but acceptable graphics.

True.

BD and fusion are coming 2 years later than AMD's original plans. If they had been , say even a year earlier, we would have seen a quite different environment.

The other elephant in the room is the 32nm gate-first HiKMG SOI process.
Which for all of AMD's and GF's bluster has not shown to have overcome known problems, and the public weight of pretty much everybody else going gate-last, ahead of AMD, faster than AMD, and with great effect in terms of yields and variability.

GF is sticking for gate-first for 28nm bulk too. Seems like they have drunk too much of their own "10% more density" kool-aid. Let's hope they get their priorities right for 22nm.

3dilettante · Aug 31, 2010

Jawed said:
I'm dubious that x86 CPUs with on-die graphics will be compared on CPU performance. Particularly when SB is aiming merely to outclass Ontario, while Llano is likely to be considerably faster.

The product that gluts the market first wins.
Much of the market is not going to care past the fact that Sandy Bridge's frame rates aren't an absolute travesty (though Intel is free to royally screw up its drivers).
More importantly, it will likely be available first and in volume.

I think Intel's movements in the mainstream indicate this is a likely goal. If AMD were uncontested for a period of time, the upside for Llano would have been higher.

It will be interesting to see how AMD fares in producing Llano, since any difficulty in manufacturing a known design versus Bulldozer may impact the speed racer more so than the design that has given up on circuit speed leadership.

AMD Bulldozer Core Patent Diagrams

Red-headed step child