AMD Bulldozer Core Patent Diagrams

3dilettante · Mar 11, 2011

While it hasn't been emphasized, I do not think many were under the impression that the MMX units in the Bulldozer diagrams were merely MMX. I think it was commonly assumed they would include the integer vector ops.

That is 2 FP vector pipes and 2 INT vector pipes.

JFAMD's listing does simplify certain complications and it doesn't clarify some things. The integer MAC operation has to go down the FMAC pipe. Where do shuffles go?

I am curious about the 256b INT case. Is this accurate? Is this by an XOP peculiarity since AVX does not provision for 256b integer ops?

Move elimination sounds interesting, it may have been enabled by the rearchitecting of the scheduler and could mean moves resolve to register renames further up in the pipeline. This may mean Bulldozer has an expanded number of operations that do not take up execution slots compared to other chips (beyond FXCH, register stack engine ops, etc.), but on this I am not sure.

BD has decent width in an INT/FP SSE case, though the load and store restrictions do limit things somewhat. A BD2 with expanded L/S capability would be more able to sustain its throughput in optimized routines that are typically L/S limited.
I'm not sure if SB has the hardware to match all or part of BD's move elimination, so this could be an advantage in 128-bit code (JFAMD stating this was done for 128 bit ops).

SB, by comparison, is has a different loadout.
I am glossing over some areas where I am not certain how BD and SB can be equivalently compared.
Roughly, it is as follows.

3x 128 FP or INT (there are complications, the rough mix is 1 ADD, 1 MUL, 1 Other(such as a shuffle)).
With 256-bit AVX, it is 3x256 FP (same mix as above, but the FP part repurposes the INT path.). INT AVX doesn't provide for 256-wide integer ops, so I am curious about the AMD listing. In SB, the registers are 256b for int, but it may take shuffling things around to get to all parts of the extended register if using it for integer purposes

BD has a possible throughput advantage in 128-bit math when it comes to loads and stores done in the mult-threaded case. With 2 AGUs, one thread can only manage 2 memory ops, but with two cores the FP unit can do 2 loads and 1 store.
SB can only issue 2 loads or 1 load+1 store with two AGUs. In the 256-bit case, SB can issue 2 loads and 1 store, because the ports take twice as long to load, but not twice as long to calculate addresses.

The ghist of this is that BD may offer higher utilization in a multitheaded 128-bit case with a good mix of integer and FP vector ops. SB's focus seems stronger in the single-threaded case with a more focused mix, particularly with AVX.

Gubbi · Mar 11, 2011

3dilettante said:
Move elimination sounds interesting, it may have been enabled by the rearchitecting of the scheduler and could mean moves resolve to register renames further up in the pipeline. This may mean Bulldozer has an expanded number of operations that do not take up execution slots compared to other chips (beyond FXCH, register stack engine ops, etc.), but on this I am not sure.

With a physical register file, move elimination is possible in the rename stage. The destination of the instruction just copies the register pointer of the source, so that both points to the same register file entry. Saves register file entries too, since no real result is produced by the register-register move.

Sandy Bridge does the same AFAIK.

Cheers

3dilettante · Mar 11, 2011

Opteron's FPU has a physical register file. Perhaps it is a limitation of the front-end or scheduler in K8 that it isn't a feature there, or JFAMD's not knowledgable about the particulars of the older chips to know that this isn't quite so new.

JFAMD's statement only applied to SSE moves, but it seems like it could also apply to the integer side, which would be new with BD.

EDIT:
I checked the diagram at RWT for SB, and in an interesting aside, it seems the 128-bit MOV unit in Nehalem is no longer listed for SB, hinting at it not being necessary.

BD does have a few wierd kinks in it.
The INT MAC instruction has a throughput hit on the FP side since it blocks one FMAC, and since there is only one IMAC, it has half the throughput. This may require care in its use, and care in a multithreaded case.
The store pipe being one of the MMX pipes may also be annoying, possibly more so with two threads.

Npl · Mar 11, 2011

3dilettante said:
While it hasn't been emphasized, I do not think many were under the impression that the MMX units in the Bulldozer diagrams were merely MMX. I think it was commonly assumed they would include the integer vector ops.

yers, given that sse integer ops are mmx widened to 128bit (for the most part) I dont think this should be too surprising. Really makes no sense to put plain mmx anywhere.

3dilettante said:
I am curious about the 256b INT case. Is this accurate? Is this by an XOP peculiarity since AVX does not provision for 256b integer ops?

Err.. I thought all AVX ops could be brought to 256/512 bit at a later time?
atleast integer load/stores can already be 256bit AFAIK, could help out if the FPU LS pipes are busy.

3dilettante · Mar 11, 2011

Vector integer ops have not yet been promoted to 256b width. I think the shuffle operations that allow you to take the high half of a 256b register down to the lower half to perform a calculation have been promoted.

I'm not sure what you mean about integer L/S, both data types go through the exact same L/S units.

Npl · Mar 11, 2011

3dilettante said:
Vector integer ops have not yet been promoted to 256b width. I think the shuffle operations that allow you to take the high half of a 256b register down to the lower half to perform a calculation have been promoted.

ok, I thought a good part of the simpler nonFPU options were already 256bit (bit-ops etc).

3dilettante said:
I'm not sure what you mean about integer L/S, both data types go through the exact same L/S units.

VMOVDQA/VMOVDQU. The FP load instructions need to handle denormals based on the mode so I figured these could be launched from the simpler INT pipe.

3dilettante · Mar 11, 2011

I checked the instruction listing at Agner Fog's site, and I think the move instructions that seem to be closest to your example don't differentiate data type within an XMM register.

The actual Load/Store part of instruction execution goes through ports 2 and 3, regardless of data type.
As far as skipping denormals or whatever else, it looks like INT AVX and FP AVX moves have the same latency for a load from memory, 1 cycle longer than a standard move on the integer pipeline.

The determining factor seems to be the destination of the load, and the design does not seem to provide a special path for integer data for the AVX register, if I interpreted the table correctly.

fellix · Mar 12, 2011

Boxart for AMD Zambezi FX Series

swaaye · Mar 12, 2011

Zambezi?

CNCAddict · Mar 13, 2011

hehe, I hope BattleField 3 makes full use of those 8 juicy cores

If they price this one right, there will be a flood of upgrading come summertime.

swaaye · Mar 13, 2011

Is it actually going to be branded Zambezi or is that a codename?

entity279 · Mar 13, 2011

codename used to distinguish the desktop SKUs of BD

hoom · Mar 13, 2011

If there is any truth to those images the brand name will be FX.
That'd fit with previous rumors of FX branding being dusted off but an interesting twist.

Svensk Viking · Mar 15, 2011

Think this is the best place to post this(sry if old)...Some Asus AM3 boards have support for AM3+ with a BIOS update

http://translate.google.se/translate?js=n&prev=_t&hl=sv&ie=UTF-8&layout=2&eotf=1&sl=sv&tl=en&u=http%3A%2F%2Fwww.sweclockers.com%2Fnyhet%2F13665-asus-nuvarande-moderkort-far-stod-for-bulldozer-med-bios-uppdatering&act=url

Blazkowicz · Mar 15, 2011

nice

this remind me of how am2+ wasn't really required afterall.

karlotta · Mar 15, 2011

You still need the correct pin socket. Maybe ASUS has the real am3+ sockets already installed?

Blazkowicz · Mar 15, 2011

only if you're the current owner of a cannabis husband, if we're to believe the article

2senile · Mar 16, 2011

Blazkowicz said:
only if you're the current owner of a cannabis husband, if we're to believe the article

http://event.asus.com/2011/mb/AM3_PLUS_Ready/

ASUS, the worldwide leader in motherboard design and sales, today announced the release of the industry's first AM3+ CPU enabled motherboard solution based on the existing AMD 8-Series Chipsets. Current owners of an AM3-based board* will make their AMD 8-Series motherboards compatible with the latest AM3+ CPUs with a simple BIOS** update from the official ASUS website.

ASUS provides options for users to be the first to enjoy AM3+

[Crosshair IV Extreme and M4A89TD PRO/USB3 support AMD AM3+] AMD’s new AM3+ CPU is a complete microarchitecture redesign from previous AMD CPUs, and offers better performance over the previous generation. As such, ASUS is committed to provide continuous support for the latest technologies, and is the first to market for a product solution for user’s needs. Current ASUS 890FX and 890GX series motherboards can be upgraded to enjoy the extra performance offered by future AM3+ CPUs. ASUS will also be releasing*** the AMD 8-Series Chipset motherboards based on 880G and 870 as well as the 760G Chipset on the AM3+ socket for increased selection so users can enjoy AM3 and AM3+ CPUs.

I'd like to see more info' on this. For examle, is it only for specific versions of the boards that are fitted with the newer "grey" sockets & what limitations will you have to expect over a BD in a 9xx series mobo?

MSI are also claiming compatibility with AM3+ btw

swaaye · Mar 16, 2011

It would have been amazing if they had managed to also have AM2+ compatibility.

2senile · Mar 16, 2011

Just 8 series chipset from what I can see swaaye. Just wish they'd hurry up & release the damned things, mobos & BD!

Have to say that the return of "FX" worries me tho. Might mean it's toooo pricey for Wifey which will mean I either don't get a new system. ...... or I get a divorce ....

EDIT: Plus 760 chipset which I assume would be for OEMs to con ... I mean, offer the latest CPUs in "affordable" systems to fools ... I mean, loyal customers.

AMD Bulldozer Core Patent Diagrams

3dilettante

Gubbi

3dilettante

Npl

3dilettante

Npl

3dilettante

fellix

swaaye

Entirely Suboptimal

CNCAddict

swaaye

Entirely Suboptimal

entity279

hoom

Svensk Viking

Blazkowicz

karlotta

pifft

Blazkowicz

2senile

swaaye

Entirely Suboptimal

2senile