AMD Bulldozer Core Patent Diagrams

While it hasn't been emphasized, I do not think many were under the impression that the MMX units in the Bulldozer diagrams were merely MMX. I think it was commonly assumed they would include the integer vector ops.

That is 2 FP vector pipes and 2 INT vector pipes.

JFAMD's listing does simplify certain complications and it doesn't clarify some things. The integer MAC operation has to go down the FMAC pipe. Where do shuffles go?

I am curious about the 256b INT case. Is this accurate? Is this by an XOP peculiarity since AVX does not provision for 256b integer ops?

Move elimination sounds interesting, it may have been enabled by the rearchitecting of the scheduler and could mean moves resolve to register renames further up in the pipeline. This may mean Bulldozer has an expanded number of operations that do not take up execution slots compared to other chips (beyond FXCH, register stack engine ops, etc.), but on this I am not sure.

BD has decent width in an INT/FP SSE case, though the load and store restrictions do limit things somewhat. A BD2 with expanded L/S capability would be more able to sustain its throughput in optimized routines that are typically L/S limited.
I'm not sure if SB has the hardware to match all or part of BD's move elimination, so this could be an advantage in 128-bit code (JFAMD stating this was done for 128 bit ops).

SB, by comparison, is has a different loadout.
I am glossing over some areas where I am not certain how BD and SB can be equivalently compared.
Roughly, it is as follows.

3x 128 FP or INT (there are complications, the rough mix is 1 ADD, 1 MUL, 1 Other(such as a shuffle)).
With 256-bit AVX, it is 3x256 FP (same mix as above, but the FP part repurposes the INT path.). INT AVX doesn't provide for 256-wide integer ops, so I am curious about the AMD listing. In SB, the registers are 256b for int, but it may take shuffling things around to get to all parts of the extended register if using it for integer purposes

BD has a possible throughput advantage in 128-bit math when it comes to loads and stores done in the mult-threaded case. With 2 AGUs, one thread can only manage 2 memory ops, but with two cores the FP unit can do 2 loads and 1 store.
SB can only issue 2 loads or 1 load+1 store with two AGUs. In the 256-bit case, SB can issue 2 loads and 1 store, because the ports take twice as long to load, but not twice as long to calculate addresses.

The ghist of this is that BD may offer higher utilization in a multitheaded 128-bit case with a good mix of integer and FP vector ops. SB's focus seems stronger in the single-threaded case with a more focused mix, particularly with AVX.
 
Move elimination sounds interesting, it may have been enabled by the rearchitecting of the scheduler and could mean moves resolve to register renames further up in the pipeline. This may mean Bulldozer has an expanded number of operations that do not take up execution slots compared to other chips (beyond FXCH, register stack engine ops, etc.), but on this I am not sure.

With a physical register file, move elimination is possible in the rename stage. The destination of the instruction just copies the register pointer of the source, so that both points to the same register file entry. Saves register file entries too, since no real result is produced by the register-register move.

Sandy Bridge does the same AFAIK.

Cheers
 
Opteron's FPU has a physical register file. Perhaps it is a limitation of the front-end or scheduler in K8 that it isn't a feature there, or JFAMD's not knowledgable about the particulars of the older chips to know that this isn't quite so new.

JFAMD's statement only applied to SSE moves, but it seems like it could also apply to the integer side, which would be new with BD.

EDIT:
I checked the diagram at RWT for SB, and in an interesting aside, it seems the 128-bit MOV unit in Nehalem is no longer listed for SB, hinting at it not being necessary.

BD does have a few wierd kinks in it.
The INT MAC instruction has a throughput hit on the FP side since it blocks one FMAC, and since there is only one IMAC, it has half the throughput. This may require care in its use, and care in a multithreaded case.
The store pipe being one of the MMX pipes may also be annoying, possibly more so with two threads.
 
Last edited by a moderator:
While it hasn't been emphasized, I do not think many were under the impression that the MMX units in the Bulldozer diagrams were merely MMX. I think it was commonly assumed they would include the integer vector ops.
yers, given that sse integer ops are mmx widened to 128bit (for the most part) I dont think this should be too surprising. Really makes no sense to put plain mmx anywhere.
I am curious about the 256b INT case. Is this accurate? Is this by an XOP peculiarity since AVX does not provision for 256b integer ops?
Err.. I thought all AVX ops could be brought to 256/512 bit at a later time?
atleast integer load/stores can already be 256bit AFAIK, could help out if the FPU LS pipes are busy.
 
Vector integer ops have not yet been promoted to 256b width. I think the shuffle operations that allow you to take the high half of a 256b register down to the lower half to perform a calculation have been promoted.

I'm not sure what you mean about integer L/S, both data types go through the exact same L/S units.
 
Vector integer ops have not yet been promoted to 256b width. I think the shuffle operations that allow you to take the high half of a 256b register down to the lower half to perform a calculation have been promoted.
ok, I thought a good part of the simpler nonFPU options were already 256bit (bit-ops etc).

I'm not sure what you mean about integer L/S, both data types go through the exact same L/S units.
VMOVDQA/VMOVDQU. The FP load instructions need to handle denormals based on the mode so I figured these could be launched from the simpler INT pipe.
 
I checked the instruction listing at Agner Fog's site, and I think the move instructions that seem to be closest to your example don't differentiate data type within an XMM register.

The actual Load/Store part of instruction execution goes through ports 2 and 3, regardless of data type.
As far as skipping denormals or whatever else, it looks like INT AVX and FP AVX moves have the same latency for a load from memory, 1 cycle longer than a standard move on the integer pipeline.

The determining factor seems to be the destination of the load, and the design does not seem to provide a special path for integer data for the AVX register, if I interpreted the table correctly.
 
hehe, I hope BattleField 3 makes full use of those 8 juicy cores :cool: If they price this one right, there will be a flood of upgrading come summertime.
 
If there is any truth to those images the brand name will be FX.
That'd fit with previous rumors of FX branding being dusted off but an interesting twist.
 
only if you're the current owner of a cannabis husband, if we're to believe the article :)

http://event.asus.com/2011/mb/AM3_PLUS_Ready/

ASUS, the worldwide leader in motherboard design and sales, today announced the release of the industry's first AM3+ CPU enabled motherboard solution based on the existing AMD 8-Series Chipsets. Current owners of an AM3-based board* will make their AMD 8-Series motherboards compatible with the latest AM3+ CPUs with a simple BIOS** update from the official ASUS website.




ASUS provides options for users to be the first to enjoy AM3+

[Crosshair IV Extreme and M4A89TD PRO/USB3 support AMD AM3+] AMD’s new AM3+ CPU is a complete microarchitecture redesign from previous AMD CPUs, and offers better performance over the previous generation. As such, ASUS is committed to provide continuous support for the latest technologies, and is the first to market for a product solution for user’s needs. Current ASUS 890FX and 890GX series motherboards can be upgraded to enjoy the extra performance offered by future AM3+ CPUs. ASUS will also be releasing*** the AMD 8-Series Chipset motherboards based on 880G and 870 as well as the 760G Chipset on the AM3+ socket for increased selection so users can enjoy AM3 and AM3+ CPUs.

I'd like to see more info' on this. For examle, is it only for specific versions of the boards that are fitted with the newer "grey" sockets & what limitations will you have to expect over a BD in a 9xx series mobo?

MSI are also claiming compatibility with AM3+ btw
 
Just 8 series chipset from what I can see swaaye. Just wish they'd hurry up & release the damned things, mobos & BD!

Have to say that the return of "FX" worries me tho. Might mean it's toooo pricey for Wifey which will mean I either don't get a new system. ...... or I get a divorce .... ;) :LOL:

EDIT: Plus 760 chipset which I assume would be for OEMs to con ... I mean, offer the latest CPUs in "affordable" systems to fools ... I mean, loyal customers.
 
Back
Top