AMD Bulldozer Core Patent Diagrams

Gubbi · Sep 9, 2009

3dilettante said:
1: Why do this? It seems that this architecture is going to lean much more heavily on microcode than previously, going by the patents.
Does this mean a regression of sorts, are the direct-path encoders weaker than they have been in the past, or is it that there are going to be a ton of new microcode entries?
There are patents that are interesting, if vaguely reminiscent of failed CISC architectures that went for extremely programmable microcode engines.

Going by the diagram, with seperate AGUs and ALUs, I think they intend to crack instructions with memory oprerands, similar to what Intel does and unlike K7/8/x. Fastpath instruction could just be instructions with a one-to-one correspondence to internal micro-ops. The micro-decode would output two (ore more) micro-ops per instruction.

Cheers

mboeller · Sep 9, 2009

I wondered a little bit about the Bulldozer CMT feature after reading / glazing over the PDFs mentioned in the lastest information from Dresdenboy:

http://citavia.blog.de/2009/09/06/architecture-cpu-cluster-microarchitecture-multithreading-6910375/

It seems that CMT hurts the proMHz performance of normal single thread workloads between 5% and 40%. So does it really make sense to use a clustered integer pipeline?

3dilettante · Sep 9, 2009

Gubbi said:
Going by the diagram, with seperate AGUs and ALUs, I think they intend to crack instructions with memory oprerands, similar to what Intel does and unlike K7/8/x.

Separate AGUs are nothing new for AMD chips.
I'd view the ALU/AGU pairs as evidence that this hasn't changed.

Micro-op fusion for Pentium M and onwards indicates Intel has moved towards not cracking ALU/MEM ops as well (or cracking them and then immediately fusing them back together or however else Intel's design does this internally).

mboeller said:
I wondered a little bit about the Bulldozer CMT feature after reading / glazing over the PDFs mentioned in the lastest information from Dresdenboy:

http://citavia.blog.de/2009/09/06/architecture-cpu-cluster-microarchitecture-multithreading-6910375/

It seems that CMT hurts the proMHz performance of normal single thread workloads between 5% and 40%. So does it really make sense to use a clustered integer pipeline?

Clustering cuts down on the complexity of structures like the register file, reorder buffer, and forwarding network versus a structure where the same number of execution resources are linked together completely.
If all else were equal, clustering could potentially be a loss.

However, as the diagram suggests, integer resources increase by a third.
Each local register file can be physically smaller for a given amount of capacity, as it doesn't need to have as many register read/write ports.
This can either allow a larger register file per cluster, or a faster register file.

A larger file can mean more instructions in flight, which means more opportunities for OOE.
A faster register file might mean higher clock speeds, which means clocks can be higher.
These could reduce the penalties in clustering.

Either way, it doesn't look like it's an equal case to a larger unclustered core that AMD apparently has decided would not be worth implementing.

Given that there is a resource monitor block in a lot of places, I'm curious if there is another opportunity for power savings and higher clocks.

In a single-threaded case where the resource monitor detects that the core cannot achieve full issue width (some kind of narrow pointer-chasing code), it would be potentially easier to hit a higher turbo mode if a whole cluster is deactivated.
That means half of the issue logic is shut off, and half of the resources that would have been wasted anyway are shut off.
The rest of the core can ramp up frequency higher with a given power envelope than a big core that must ramp up a full complement of units.

The front end and dispatch parts are more complex, however.
This might not impact clock speeds directly, as this can be distributed over multiple stages, but this would hurt branch mispredict penalties and power as well.

There seem to be additional levels of logic dedicated to speeding up mispredict handling.

Gubbi · Sep 9, 2009

3dilettante said:
Separate AGUs are nothing new for AMD chips.
I'd view the ALU/AGU pairs as evidence that this hasn't changed.

Micro-op fusion for Pentium M and onwards indicates Intel has moved towards not cracking ALU/MEM ops as well (or cracking them and then immediately fusing them back together or however else Intel's design does this internally).

Found this (pages 8 & 9) details the micro-op fusion. Micro-op fusion increases the instruction density in the ROB which results in higher performance. The trade off is a more complex dispatcher, since one entry in the ROB can result in two partial dispatches, or ..... ?

In K7/8 an instruction with a memory operand cannot be dispatched for execution before all data dependencies on the registers are resolved. Intel can start the calculation of the address as soon as the base (and index) registers are ready and start accessing the memory hierarchy. This can be a huge win, as an example think of a tight loop summing values in memory into a register. The loop would be dominated by memory latency. With a cache miss execution on K7/8 would grind to a halt because subsequent adds cannot start dispatching until the previous add completes and the RAW hazard on the accumulator is resolved. On Intel's P6 derivatives address calculations and load for the subsequent adds would still be dispatched, and data would be ready sooner the actual add to take place.

I'm mildly surprised micro-op fusion is a win at all. Why not just increase the size of the ROB ?

*scratches head*

Intel must do something clever to split entries in the ROB and still be able to track them.

I can see the benefit of macro-op fusion where compares and conditional jumps are fused. The jump is data-dependent on the compare, so no point in injecting it in the ROB as two separate micro-ops to try to extract parallism.

Cheers

3dilettante · Sep 9, 2009

Gubbi said:
In K7/8 an instruction with a memory operand cannot be dispatched for execution before all data dependencies on the registers are resolved. Intel can start the calculation of the address as soon as the base (and index) registers are ready and start accessing the memory hierarchy.

Going from what I've read, a reg/mem macro-op in a K8 reservation station dispatches the AGU operation first. The ALU will dispatch once the data comes back.
AGU ops issue when the address value and the base and index registers are ready.
The way this has been described seems very similar to what Intel does with its micro-op fusion.

I'm mildly surprised micro-op fusion is a win at all. Why not just increase the size of the ROB ?

The ROB is a rather important structure to the execution pipeline, so perhaps it is not so easy to grow.
Even then, the actual retirement logic can only retire a finite number of operations.
Splitting the load/op instruction into two instructions means that one front-end instruction has twice the burden on retirement of one that was not split.

Intel must do something clever to split entries in the ROB and still be able to track them.

I suppose we could ask AMD what they do.

The way it's described in the Intel paper, the ROB entry won't retire until both ops complete.
Since the op portion can't complete until after the load does, it may just be that a fused micro-op's load will neglect to set a finished status on the combined ROB entry, and only the ALU operation can set it.

Tchock · Sep 21, 2009

Less talk more walk.

rpg.314 · Sep 21, 2009

100% a fake.

a) I dont think gf's 32 nm is that far along yet

b) 28MB L3 + 8MB L2 + 16 cores, is this thing a 1000mm2 chip??? Unless they threw away all ooo hw, i dont see how that is possible. I dont see the formaer happening in the first place either.

rpg.314 · Sep 21, 2009

Even the wPrime clock doesn't agree with what cpu-z shows. What an outrageous waste of time.

Tchock · Sep 21, 2009

Err, it's overclocked and HTRef flunctuations aren't abnormal.

This is at least dual-die thanks to G34 requiring such (no Lisbons/R2 Istanbuls on G34), and could even be dual-die + dual-socket for the extra bandwidth.

2X 8/8*512K/16MB L3 OR 4X 4/4*512K/8MB L3. Still seems big and impossible for the L3 unless you consider EDRAM. Which suddenly kills the L3 die area prerequisite so much that everything could make sense at ~300mm^2.

The only thing that doesn't make sense is a 32nm CPU with 45nm EDRAM (IBM's timelines only accomodate for that), but since IBM takes its POWER iterations so slowly without too much regards to process advancements I guess AMD would have to get second choice. Plus, it's still small.

nutball · Sep 21, 2009

This is a leading question, but ...

How does CPU-Z figure out the process technology used to manufacture a processor?

rpg.314 · Sep 21, 2009

So do multi socket setups show up as a single cpu (ie just like a single socket cpu) in cpu-z?

Tchock · Sep 21, 2009

nutball said:
This is a leading question, but ...

How does CPU-Z figure out the process technology used to manufacture a processor?

I have no idea.

rpg.314 said:
So do multi socket setups show up as a single cpu (ie just like a single socket cpu) in cpu-z?

The CPU-Z ("fixed"/updated) shot I showed has a processor drop-down button.

This is Magny.

EDIT: I stand corrected. The Magny shot is 2 Sockets, 12 cores each.

Speculatively the "Bulldozer" shot would be 2*(2*8).

swaaye · Sep 22, 2009

The 12 core CPU is a dual die package with dual hex core dies. Each die has its own two RAM interfaces too, so the CPU essentially has a quad channel RAM interface. I'm sensing special, expensive sockets and mobos.

Also, 1MB of the 6MB L3 per die goes to something called the probe filter. That's why it's reading 10MB L3.
http://www.techreport.com/forums/viewtopic.php?f=2&t=68022

Die_Hard_AMD · Nov 11, 2009

Bulldozer

Bobcat

And the rest of conference
http://phx.corporate-ir.net/phoenix.zhtml?c=74093&p=irol-analystday

fellix · Nov 11, 2009

So, Sandy Bridge should still have an edge in SIMD performance with it's native 256-bit AVX impl?!
In K10, AMD went for two 64-bit (64-bit and 80-bit x87 legacy one, to be exact) FP/SIMD units, acting as one for full speed 128-bit SSEx processing, and now in Bulldozer we see similar tactics.

fehu · Nov 11, 2009

shared L2?... :S

fellix · Nov 11, 2009

Shared between the two INT pipes and the FP/SIMD box.

fehu · Nov 11, 2009

sorry i think to do not understand that diagram

is that a sigle bulldozer core?
every bulldozer core is something like a dual core? (lack of better terms)

fellix · Nov 11, 2009

Yup -- CMT execution.

hoom · Nov 11, 2009

Well, looks like Dresdenboy was pretty much on the mark.

Fudzilla says 8MB L3 which is a bit disappointing, I'd been hoping for the IBM eDRAM tech & much more L3

Charlie D says Bulldozer is taped out :smile:

AMD Bulldozer Core Patent Diagrams

Entirely Suboptimal