AMD Bulldozer Core Patent Diagrams

Oh, c'mon. How much area per core does Nehalem's SMT use? It should be really low (lower than 5%) since it only means duplicating some registers and adding a few more wires. If intel's 5% area brings 20-50% perf than what would be the big thing about AMD's 50% (per core ?)extra area for 30-80% perf?? Of course is a.. semi-accurate comparison but I think it stands.


Anyway, Anand's 50% comes from his readers (mis-, in my opinion)understanding the same info commented on the bulldozer blog. Anand listens to their comments and corrects his article.
 
i think you will find that people think that a module takes 50% more space vs a regular single core.

a module being 2X 128bit FMAC and 2X int
a core being 1 X 128Bit FMAC and 1X int

i have followed johns posts a lot in regards to bullldozer and he has said severial times that the Floating point units in bulldozer are massive and that having a module configuration is as much about lowering power consumption as it is perf/mm sq.

AMD are really pushing the 2p and 4p server markets with this chip and onces you factor in half high blade enclosures and the fact you can fit 3/4 of them in a 40-48 RU rack heat becomes a massive issue in datacentres. From what i can see bulldozer is aiming for the same power foot print as instanbul/mangy-cours with more "cores" and higher perf per core.

16 cores X 2P X 16 blades X 3/4 enclosures = a massive amount of perf per rack compatred to what can be had today even if you assume currently k10.5 level of performace.

only if AMD had there own compiler to start taking advantage of there FMAC from day dot.
 
Either way it seems AMD is going the opposite way compared to their older designs which emphasized FP performance, in my layman's understanding of it. Very eager to see the performance of it.
 
Either way it seems AMD is going the opposite way compared to their older designs which emphasized FP performance, in my layman's understanding of it. Very eager to see the performance of it.

That's because they expect floating point workloads to migrate to GPUs.
 
I'd like to see some sort of utilisation stats INT|FP or x86|SSEx for typical tasks ie Windows desktop stuff, Flash, browsers, videos, games etc.
It seems like AMD & Intel are taking wildly different directions on INT|FP ratio, with i7 having 3*128bit FP|core vs Bulldozer heading for 1*128bit/0.5*256bit FP|core :oops:

I guess if you take into account HT its less dramatic, 1.5*128bit FP|thread vs 1*128bit FP|thread.
But then the INT|thread becomes massively in AMDs favor with 1.5 ALUs|thread vs 4 ALUs|thread :???:

Someone has to be very wrong here. Or at least targeting a very different market.

That's because they expect floating point workloads to migrate to GPUs.
This could be the cause of the difference. Though if so, why bother to make a supposedly very large 256bit FP unit at all rather than staying with 128bit?
 
Last edited by a moderator:
I'd like to see some sort of utilisation stats INT|FP or x86|SSEx for typical tasks ie Windows desktop stuff, Flash, browsers, videos, games etc.

For the desktop stuff, flash, browsers, it's almost fully integer based stuff.

It seems like AMD & Intel are taking wildly different directions on INT|FP ratio, with i7 having 3*128bit FP|core vs Bulldozer heading for 1*128bit/0.5*256bit FP|core :oops:

I thought i7 had one 128 bit sse unit per core.
 
This could be the cause of the difference. Though if so, why bother to make a supposedly very large 256bit FP unit at all rather than staying with 128bit?

This FPU seems like something modular shared by a pair of "cores" rather than something integrated into the core itself. For now, they'll need atleast 256 width to be competetive w/ Sandy Bridge in executing next gen AVX style instructions. In the future, they'll probably replace it w/ a block of stream processors that can be freely used as CPU or GPU resources.
 
For the desktop stuff, flash, browsers, it's almost fully integer based stuff.
Sure about flash (video)?

I thought i7 had one 128 bit sse unit per core.
Depends on how you count them. But yes this is usually refered to as 3 units - 1 mul unit, 1 add unit, 1 mov unit. Needless to say a mov unit isn't exactly powerful. And if you compare that to BD where both units can do fmac it gets a bit complicated, though as was mentioned it may be possible the fmac units can get split so that would basically double the units.
Though really this needs to be compared to Sandy Bridge I guess, which supposedly has twice as wide units, though it's unclear to me yet if these are actually physically twice as wide, and if so what happens with "old" SSE code (half the unit just idle or not).
 
Ok, that sounds more likely than what I was getting out of that diagram I posted earlier :)

Seemed unlikely but nobody complained about my reading of 3*128bit before :eek:
 
Sure about flash (video)?

It's just video decode, right? Except possibly in scaling frames, why should it have any floating point business?


Depends on how you count them. But yes this is usually refered to as 3 units - 1 mul unit, 1 add unit, 1 mov unit. Needless to say a mov unit isn't exactly powerful. And if you compare that to BD where both units can do fmac it gets a bit complicated, though as was mentioned it may be possible the fmac units can get split so that would basically double the units.

No I don't think fmac can be split. And I count mov+add+mul as one unit.

Though really this needs to be compared to Sandy Bridge I guess, which supposedly has twice as wide units, though it's unclear to me yet if these are actually physically twice as wide, and if so what happens with "old" SSE code (half the unit just idle or not).
There are 2 128 bit units to take care of old sse code.
 
It's just video decode, right? Except possibly in scaling frames, why should it have any floating point business?




No I don't think fmac can be split. And I count mov+add+mul as one unit.


There are 2 128 bit units to take care of old sse code.

yes Fmac can be split, its been link in this thread:
http://users.ece.utexas.edu/~quinnell/Research/Bridged Floating-Point Fused Multiply-Add Design.pdf

it appears that the paper has been pulled?

basiclly they(two AMD engineers + some other researcher) designed and implemented a FMAC bridge using 65nm K10 design lib and had it running @ around 700mhz. they then tested FMAC perf and just straight add and mul against descrete units and a straight FMAC unit . the add and mul performace was identical to sperate units but the FMAC took what looked to be about a 30% performance hit. the FMAC bridge unit was 50% larger then a "traditional" FMAC.

they also explained that to run just an add or a mul in an FMAC unit is much slower then doing an FMAC, so if they dont do a bridge they either have extra adds and muls or they are taking a big "traditional" floating point hit.

its just a question of have they run with the design, the proof of concept is more then there.


edit: this post here has some concept details:
http://citavia.blog.de/2009/11/23/some-additional-bits-of-information-7441398/
 
they also explained that to run just an add or a mul in an FMAC unit is much slower then doing an FMAC, so if they dont do a bridge they either have extra adds and muls or they are taking a big "traditional" floating point hit.

This makes zero sense. Adds or muls can never be slower than a fmac in a fmac unit, - they are just special cases of ordinary operation:
fmac: a+b*c
add: a+b*1
mul: 0+b*c

Cheers
 
Two things the way I understand it:
1: Latency
Going all the way through a FMAC unit for an ADD is going through more steps in the pipeline than going through a dedicated ADD unit.

2: Scheduling/utilisation efficiency
If you need A+B and C*D, doing A+B*1 then 0+C*D on FMAC unit is going to take longer than doing A+B parallel with C*D using the bridged FMAC ADD & MUL units in independent mode.

The price is that (A+B)*C with the bridged FMAC is slower than the dedicated FMAC (but faster than completely separate ADD & MUL units).
 
taken from the document that i linked that is now back up:

The use of a fused multiply-add unit in place of a floating-point adder and floating-point
multiplier has yet another drawback. Due to their large area and power consumption,
implemented fused multiply-add blocks typically replace the floating-point adder and floatingpoint
multiplier entirely. This replacement removes the ability to have floating-point add and
multiply instructions execute independently in different parallel units. For code that needs strings
of floating-point adds and multiplies executed independently, the use of a fused multiply-add
unit will reduce the throughput by 30% to 75%.
 
taken from the document that i linked that is now back up:

Normally fused multiply-add is used for instruction-issue bandwidth reasons.


Being able to co-issue muls and adds to a single FMAC unit is completely pointless AFAICS.

A fmadd has 3 operands and produce a single result, A mul and an add has 4 operands and produce two results in total. So not only would you need logic to seperate the add and the mul, but you'd need more bandwidth going into and out of the unit.

It would be much easier to just add a seperate add unit, - adders are a fraction of multipliers in terms of size and power. Run madds and muls through your fused multiply-adder and run pure adds through the adder.

Cheers
 
Just finished reading that paper :)
Its mostly in simple language & not too long, worth reading (or at least the tables in the Results section).

Bridged FMA keeps old code running full speed & same power usage while still giving some performance boost to FMA coded software but at the expense of taking more area & using quite a lot more power when doing the FMA operation.

In combination with upgrading to 256bit width & offloading of a lot of FP stuff like video decoding to GPU, that certainly would be pushing strongly towards fewer, higher utilisation FP units like the Bulldozer setup :D
 
It's just video decode, right? Except possibly in scaling frames, why should it have any floating point business?
Pretty sure at least for some video formats things like idct are usually done with float arithmetic. Given the general slowness of flash video, I wouldn't be surprised if it does everything with traditional x87 code :).
 
Back
Top