AMD Bulldozer Core Patent Diagrams

btw anandtech has updated his piece about bulldozer to clear up confusion on core count etc.
And AMD indeed states that bulldozer has 10-35% better integer performance per int core compared to Phenom II (at "similar" clock), given the less execution units that's really much better than what I expected. Though we'll see how these numbers hold up...
 
Potentially a bulldozer quadcore (2 modules) can also be slower than a comparable phenom2 quadcore :S
but what about die size? maybe a 3 modules / 6 core can be equal in size or even smaller than a phenom2...
 
AMD's statements are that these INT cores are wider than the INT side of Phenom.
If the per-core statements are correct, it doesn't make sense to then state that 4 Bulldozer cores that are 10-30% faster per-core are somehow slower than 4 Phenom cores that they just stated are slower.

I'm not sure the 10-30% integer improvement over Phenom at similar clocks is sufficient. That would make a nice match to Nehalem right now, not a successor in 2011.
 
I'm not sure the 10-30% integer improvement over Phenom at similar clocks is sufficient. That would make a nice match to Nehalem right now, not a successor in 2011.

But AMD is saying that on well threaded integer workloads, it can get 80% more perf with ~5% more area. But yes, Bulldozer is going up against Sandy Bridge, with probably a process handicap to boot, not nehalem.
 
But AMD is saying that on well threaded integer workloads, it can get 80% more perf with ~5% more area. But yes, Bulldozer is going up against Sandy Bridge, with probably a process handicap to boot, not nehalem.

Anandtech corrected that. Somebody dropped a zero off of that figure, as the newest edit indicates it is 50%.
 
Anandtech corrected that. Somebody dropped a zero off of that figure, as the newest edit indicates it is 50%.

That balances it much more. AMD says up to 80% perf increase for integer threaded workloads, so it's more like 50-60% more often. So, umm..., perf/area is about the same as now most of the time. It will do better than that on some workloads though.
 
AMD's statements are that these INT cores are wider than the INT side of Phenom.
Yes, but the diagrams don't indicate this. 2 alus, 1 store, 1 load unit. Compared to K8/K10, 3 alus, plus 1 shared load/store unit (which can do 2 ops per clock however). Maybe there's a mistake somewhere in those diagrams though.

I'm not sure the 10-30% integer improvement over Phenom at similar clocks is sufficient. That would make a nice match to Nehalem right now, not a successor in 2011.
I dunno, so far there's no indication really that sandy bridge will get any faster on the int side (it'll surely have beefed up float units, though I don't know if they'll actually be any faster when executing old code instead of avx).

fehu said:
Potentially a bulldozer quadcore (2 modules) can also be slower than a comparable phenom2 quadcore :S
Seems quite possible at least with FP code - K10 has 1 fadd + 1 fmul fp pipe, and bulldozer needs to share its 2 fmac pipes with 2 cores, so it could be quite a bit slower at least when not using fmac instructions.
but what about die size? maybe a 3 modules / 6 core can be equal in size or even smaller than a phenom2...
A good question. The slides though don't indicate really much higher core count, so it might not be much smaller (per 2 cores) or smaller at all. In particular, the biggest native server chip (not counting 2 stitched chips together) seems to be 4 modules / 8 cores, at 32nm even, compared to current 6 cores at 45nm.
 
Yes, but the diagrams don't indicate this. 2 alus, 1 store, 1 load unit. Compared to K8/K10, 3 alus, plus 1 shared load/store unit (which can do 2 ops per clock however). Maybe there's a mistake somewhere in those diagrams though.
The diagrams for Bulldozer do not indicate 2xINT+Load+Store, they indicate 4xINT.

Bobcat's diagram has the explicit load and store pipes. People are assuming the same applies to Bulldozer. AMD's statements, while not fully clear, seem to indicate more than just load and store can be done on the pipes.

I dunno, so far there's no indication really that sandy bridge will get any faster on the int side (it'll surely have beefed up float units, though I don't know if they'll actually be any faster when executing old code instead of avx).
Sandy Bridge would need to be at a complete standstill or regress. The rumored trace cache, which might be a further elaboration of the decoded instruction loop buffer in Nehalem, should help. Further general enhancements to scheduling logic like more reorder entries or load/store buffers seem likely and should deliver some incremental improvement.

Seems quite possible at least with FP code - K10 has 1 fadd + 1 fmul fp pipe, and bulldozer needs to share its 2 fmac pipes with 2 cores, so it could be quite a bit slower at least when not using fmac instructions.
Unless the unit has the necessary hardware and data paths to possibly separate the components of the FMAC units and use the FADD and FMUL hardware separately.
There are AMD patents that indicate the idea has at least been thought about.
 
Unless the unit has the necessary hardware and data paths to possibly separate the components of the FMAC units and use the FADD and FMUL hardware separately.
There are AMD patents that indicate the idea has at least been thought about.

they have far more then just thought about it, there is a document in this thread that shows they have built and tested a FMAC unit using 65nm K10 design libs where the FADD and FMUL per FMAC unit can have interpendant execution, this FMAC unit is around 50% larger then a regular FMAC unit has identical FADD and FMUL perf to stand alone units but is slower for FMAC.


Also given that Int hardware takes so little space compared to floating point i dont really understand how adding extra int unit can add 50% space that makes no sense. What makes more sense is that a bulldozer module int+FP is 50% larger then a "regular" core, which considering your doubling your hardware execution units seems very good to me considering sandy bridge is still just a "traditional" run of the mill CPU where each cores resourses are very seperate.


cheers


edit:

to quote John Fruehe

I guess the easiest way for me to address all of the folks that think that this is somehow a step back or that shared resources are somehow a limiting factor, let me just say this: Pound for pound, when comparing both integer and FP to a Magny Cours processor, there is no place where you will not see a significant improvement. Integer. FP. All of it will be significantly higher on Bulldozer.

For those trying to do the math in their heads, trying to triangulate the capabilities, there is a lot that we have not shared. You can't use the old world constraints to try to figure out the new architecture.


edit 2:

also from john

What I was talking about was the total size of the duplicated circuitry of integer unit relative to the rest of the die. In simple terms, if you pulled out the second dedicated integer core in each module, the die would shrink by ~5%. This has no bearing on our existing die, nor the competition.


edit3:

dresdenboy had some slides up about power management but they have been pulled for some reason :oops:
 
Last edited by a moderator:
Also given that Int hardware takes so little space compared to floating point i dont really understand how adding extra int unit can add 50% space that makes no sense.
The ALUs themselves are small, though that part should be 33% larger per core.
The scheduler, L1 data cache, and the Load/Store units would be duplicated as well, and those are very big.

What makes more sense is that a bulldozer module int+FP is 50% larger then a "regular" core, which considering your doubling your hardware execution units seems very good to me considering sandy bridge is still just a "traditional" run of the mill CPU where each cores resourses are very seperate.
That was what Anand wrote about, he just had it wrong by an order of magnitude.
 
The diagrams for Bulldozer do not indicate 2xINT+Load+Store, they indicate 4xINT.

Bobcat's diagram has the explicit load and store pipes. People are assuming the same applies to Bulldozer. AMD's statements, while not fully clear, seem to indicate more than just load and store can be done on the pipes.
Well the diagram just indicate 4xpipe so it's really hard to say what they can do. You're right though it should be different to bobcat, otherwise this doesn't make sense (bobcat is supposed to have a 90% as fast int core as K10, whereas bulldozer should be 10-35% faster than K10). Hmm...
 
here:
20091202215451672.jpg


:LOL:
 
so 32nm and fusion in about 6 months? :oops:

I reckon 9-11, they always like to be (too) late for the Christmas shopping season (for example the 780G Neo II platform)

But nevertheless with the 480SP + hopefully (hopefully) resolved bottleneck issues the internal GPU performs at near HD4650 levels, perhaps 3-4x current IGP standards, wiping out basically everything lowend, and actually sell K10...
 
found this link at http://citavia.blog.de/

The Bulldozer architecture can provide up to 80% greater expected throughput when running 2 threads simultaneously compared to a single thread running on a single integer core. Our engineers estimate that the amount of discrete circuitry that is added to each Bulldozer module in order to allow for a second integer thread to run only adds ~12% additional circuitry to each module, which translates into only ~5% of circuitry to the total Bulldozer die.

source: http://blogs.amd.com/work/2009/12/11/aiming-for-the-sweet-spot-in-2010-and-beyond/
 
Back
Top