AMD Bulldozer Core Patent Diagrams

rpg.314 · Aug 24, 2010

Squilliam said:
Thanks! It looks like they got it out a day early, ZOMG!

I don't think this is their entire presentation @HotChips.

Ethatron · Aug 24, 2010

Instruction-set extensions ... does it what I think? Real Bulldozer (or 2i:1f shared lego) instructions? Bypass this or that at front or back, exclusive FP mode for a thread, FP-free thread declaration, 128bit/256bit FP-stack distinction, etc.?

rpg.314 · Aug 24, 2010

Ethatron said:
Instruction-set extensions ... does it what I think? Real Bulldozer (or 2i:1f shared lego) instructions? Bypass this or that at front or back, exclusive FP mode for a thread, FP-free thread declaration, 128bit/256bit FP-stack distinction, etc.?

My guess would be ssse3+sse4.1+sse4.2+avx+xop+FMA3+FMA4

rpg.314 · Aug 24, 2010

Finally, a non fluff piece.

http://www.anandtech.com/show/3863/amd-discloses-bobcat-bulldozer-architectures-at-hot-chips-2010

However, this is something I don't understand at all.

If 2 module (4-core) Bulldozer CPUs go up against dual-core Sandy Bridge things could get very interesting.

Giving up on BD by THAT margin already? Or is it just a typo?

hoom · Aug 24, 2010

Haven't read any of it yet but here's a few more bits:
John Fruhe 20 questions on Bulldozer

PC Perspective

Hotchips slides at [H]

trinibwoy · Aug 24, 2010

rpg.314 said:
Giving up on BD by THAT margin already? Or is it just a typo?

Well if a single BD module (2 cores) is only as wide as a single SB core then why not? BD looks like it might do well in terms of area efficiency but I don't see anything to indicate blockbuster absolute performance vs Intel's stuff especially in lightly threaded applications.

Jawed · Aug 24, 2010

I think AMD has just pulled another socket 939 fiasco. AM3 motherboards won't accept Bulldozer.

Psycho · Aug 24, 2010

Is that confirmed? I noticed it here as the only place. I think AM3 has been said to be BD compatible some months ago.

AlexV · Aug 24, 2010

Jawed said:
I think AMD has just pulled another socket 939 fiasco. AM3 motherboards won't accept Bulldozer.

AFAIK it'll sortof work, similarly to how K8L worked in some AM2 boards. Problem being you're likely to lose some functionality and are at the mercy of motherboard makers, which apparently hire monkeys to work on their BIOSes more often than not, and thus have trouble making more trivial things to work. Mind you, my data about AM3 compatibility may be outdated/incorrect, so take it with a grain of salt.

hoom · Aug 24, 2010

AM3 motherboards won't accept Bulldozer.

Oh bugger

I guess I'm not especially surprised though, AMD sockets have been around for ~2 years for the last several generations & AM3 will be about that age when Bulldozer is out.

2 years is not worse longevity than most Intel sockets.
LGA 775 was abnormally long lived for Intel, covering 4 years & 3 generations of CPU (P4 to Core 2 Quad) but within that period there were I think 3 sub-variants with each new CPU generation requiring new mobo revisions.

Squilliam · Aug 24, 2010

So how does bulldozer compare to bobcat? I mean in terms of performance/watt and area? Anyone want to hazard on this?

V3 · Aug 24, 2010

I am just not seeing how BD will be competitive with Sandy Bridge. But I rather them designed it for new socket, might as well, its 2011 part (with desktop after server parts). No need to take a compromise route with AM3.

3dilettante · Aug 24, 2010

The integer side is narrower than Phenom, with a 2 INT and 2 AGU setup. I had thought it could have been something a little more than that, such as having a 4-way INT setup with two of the lanes capable of AGU and some simple operations. It does not seem to be that way, going with what has been released so far.
It's not necessarily a killer because with so many reg/mem ops out there the expected throughput would be closer to the maximum number of sustainable loads, not the number of integer units stuck doing nothing. I would expect a per-clock deficit in codes that do manage to find a way to use 3 or more integer units.

That is per-clock, though. It seems the design is very much interested in upping that part.

The decoupled branch predictor would have been needed when the front-end was shared. It may correct a known deficit with K10, where even predicted branches insert a single-cycle bubble.

There are hints, but not much meat about the significant front-end and memory pipe changes. If the turbo mode can drive a single core to very high clocks, it can make up for a lot, particularly in low-IPC pointer chasing.
More would be revealed in later disclosures. A lot of interesting details hinted at at this early time could change things.

One possible downside is that such dramatic clocking may be held back by the module-level gating and the shared FPU.
Also, the higher core count will hurt AMD in software that is licensed at a per-core level.
I'd think Bulldozer would be hard-pressed to make up for a full node disadvantage (edit: as in that is what is likely to be its competition for most of its volume release). I don't think it is revolutionary enough to compensate for being late to the party, and that is with the hope that process issues won't constrain it. The smaller die size may be necessary just to keep yield acceptable.

rpg.314 · Aug 24, 2010

3dilettante said:
I'd think Bulldozer would be hard-pressed to make up for a full node disadvantage (edit: as in that is what is likely to be its competition for most of its volume release). I don't think it is revolutionary enough to compensate for being late to the party, and that is with the hope that process issues won't constrain it. The smaller die size may be necessary just to keep yield acceptable.

Be that as it may, any improvement is welcome over current competitive situation. I dare say, if it can even match Nehalem on the desktop, it would be a big sigh of relief. Not that it will be easy.

fellix · Aug 24, 2010

3dilettante said:
The integer side is narrower than Phenom, with a 2 INT and 2 AGU setup. I had thought it could have been something a little more than that, such as having a 4-way INT setup with two of the lanes capable of AGU and some simple operations. It does not seem to be that way, going with what has been released so far.
It's not necessarily a killer because with so many reg/mem ops out there the expected throughput would be closer to the maximum number of sustainable loads, not the number of integer units stuck doing nothing. I would expect a per-clock deficit in codes that do manage to find a way to use 3 or more integer units.

The third ALU/AGU in K8 an K10 was actually mostly redundant, from the point of utilization. AMD left it in the last iteration of its architecture to keep the dispatch symmetry and avoid major and costly layout redesign.

3dilettante · Aug 24, 2010

fellix said:
The third ALU/AGU in K8 an K10 was actually mostly redundant, from the point of utilization. AMD left it in the last iteration of its architecture to keep the dispatch symmetry and avoid major and costly layout redesign.

Per AMD, via Anandtech, the third AGU was kept around for symmetry. Given how miniscule it was, it wasn't a major sacrifice to make the job of scheduling a little easier.
I've seen that reasoning for the third AGU before.

I did not see that argument made for the third ALU, particularly since that would have significantly bumped up the size of the scheduling portion of the chip.

Some of the undisclosed information is how the alleged 2 Load/1 Store architecture aims to maintain that throughput with 2 AGUs.

eastmen · Aug 24, 2010

Only 4 cores huh. I was hoping to see 6 and 8 core varients. Will amd ever compete at the high end again?

3dilettante · Aug 24, 2010

The products are outlined to go up to 4 modules per chip, which is 8 cores, and 16 cores with an MCM of two 4-module chips.

hoom · Aug 24, 2010

From the Bulldozer 20 questions:

As for core counts, here is what we have committed to at this point:
* “Interlagos” – 16-core server processor
* “Valencia” – 8-core server processor
* “Zambezi” – 8-core client processor

Valencia is a 4 module chip.
Interlagos is a Magny Cours type MCM.

Wow, Anandtech says 16KB L1 Data cache

I'd been expecting they'd probably drop to 32KB.
They must be very confident in their new pre-fetching then & being so small it better be really fast.

No mention of L2 & L3 sizes yet.
I don't speak German but I think Dresdenboys' article seemed to be revising his estimate of L2 down to 1MB vs 2Mb thats previously been thrown around? Also I think he was suggesting a move to Inclusive cache structure instead of the historical AMD exclusive cache. That would also fit with the smaller L1 Data cache.

Confirmed 2 ALU wide + 2 AGUs per core is kind of relieving, kind of sad.
Relieving because it seemed insane for decoding/scheduling complexity to try to feed 2* 4 wide cores with a single decoder. Kind of sad that AMD engineers haven't worked out some loophole in physics/scheduling complexity that would have enabled it :???:

Bulldozer 20 questions said:
Modules do impact the way that certain CPU features are addressed

A hint that they have made sure Windows scheduler will play nicely with Modules? Or more mundane like Anand says power gating etc is Module level.

Bulldozer 20 questions said:
Today most workloads are integer with a much smaller portion being floating point.

I'll take that as confirmation of my premise that INT:FP instructions is typically at least 2:1

Jawed · Aug 24, 2010

My understanding is that it's possible to issue AGU+AGU+ALU+ALU in a single core.

AMD Bulldozer Core Patent Diagrams

rpg.314

Ethatron

rpg.314

rpg.314

hoom

trinibwoy

Meh

Jawed

Psycho

AlexV

Heteroscedasticitate

hoom

Squilliam

Beyond3d isn't defined yet

V3

3dilettante

rpg.314

fellix

3dilettante

eastmen

3dilettante

hoom

Jawed