While it hasn't been emphasized, I do not think many were under the impression that the MMX units in the Bulldozer diagrams were merely MMX. I think it was commonly assumed they would include the integer vector ops.
That is 2 FP vector pipes and 2 INT vector pipes.
JFAMD's listing does simplify certain complications and it doesn't clarify some things. The integer MAC operation has to go down the FMAC pipe. Where do shuffles go?
I am curious about the 256b INT case. Is this accurate? Is this by an XOP peculiarity since AVX does not provision for 256b integer ops?
Move elimination sounds interesting, it may have been enabled by the rearchitecting of the scheduler and could mean moves resolve to register renames further up in the pipeline. This may mean Bulldozer has an expanded number of operations that do not take up execution slots compared to other chips (beyond FXCH, register stack engine ops, etc.), but on this I am not sure.
BD has decent width in an INT/FP SSE case, though the load and store restrictions do limit things somewhat. A BD2 with expanded L/S capability would be more able to sustain its throughput in optimized routines that are typically L/S limited.
I'm not sure if SB has the hardware to match all or part of BD's move elimination, so this could be an advantage in 128-bit code (JFAMD stating this was done for 128 bit ops).
SB, by comparison, is has a different loadout.
I am glossing over some areas where I am not certain how BD and SB can be equivalently compared.
Roughly, it is as follows.
3x 128 FP or INT (there are complications, the rough mix is 1 ADD, 1 MUL, 1 Other(such as a shuffle)).
With 256-bit AVX, it is 3x256 FP (same mix as above, but the FP part repurposes the INT path.). INT AVX doesn't provide for 256-wide integer ops, so I am curious about the AMD listing. In SB, the registers are 256b for int, but it may take shuffling things around to get to all parts of the extended register if using it for integer purposes
BD has a possible throughput advantage in 128-bit math when it comes to loads and stores done in the mult-threaded case. With 2 AGUs, one thread can only manage 2 memory ops, but with two cores the FP unit can do 2 loads and 1 store.
SB can only issue 2 loads or 1 load+1 store with two AGUs. In the 256-bit case, SB can issue 2 loads and 1 store, because the ports take twice as long to load, but not twice as long to calculate addresses.
The ghist of this is that BD may offer higher utilization in a multitheaded 128-bit case with a good mix of integer and FP vector ops. SB's focus seems stronger in the single-threaded case with a more focused mix, particularly with AVX.
That is 2 FP vector pipes and 2 INT vector pipes.
JFAMD's listing does simplify certain complications and it doesn't clarify some things. The integer MAC operation has to go down the FMAC pipe. Where do shuffles go?
I am curious about the 256b INT case. Is this accurate? Is this by an XOP peculiarity since AVX does not provision for 256b integer ops?
Move elimination sounds interesting, it may have been enabled by the rearchitecting of the scheduler and could mean moves resolve to register renames further up in the pipeline. This may mean Bulldozer has an expanded number of operations that do not take up execution slots compared to other chips (beyond FXCH, register stack engine ops, etc.), but on this I am not sure.
BD has decent width in an INT/FP SSE case, though the load and store restrictions do limit things somewhat. A BD2 with expanded L/S capability would be more able to sustain its throughput in optimized routines that are typically L/S limited.
I'm not sure if SB has the hardware to match all or part of BD's move elimination, so this could be an advantage in 128-bit code (JFAMD stating this was done for 128 bit ops).
SB, by comparison, is has a different loadout.
I am glossing over some areas where I am not certain how BD and SB can be equivalently compared.
Roughly, it is as follows.
3x 128 FP or INT (there are complications, the rough mix is 1 ADD, 1 MUL, 1 Other(such as a shuffle)).
With 256-bit AVX, it is 3x256 FP (same mix as above, but the FP part repurposes the INT path.). INT AVX doesn't provide for 256-wide integer ops, so I am curious about the AMD listing. In SB, the registers are 256b for int, but it may take shuffling things around to get to all parts of the extended register if using it for integer purposes
BD has a possible throughput advantage in 128-bit math when it comes to loads and stores done in the mult-threaded case. With 2 AGUs, one thread can only manage 2 memory ops, but with two cores the FP unit can do 2 loads and 1 store.
SB can only issue 2 loads or 1 load+1 store with two AGUs. In the 256-bit case, SB can issue 2 loads and 1 store, because the ports take twice as long to load, but not twice as long to calculate addresses.
The ghist of this is that BD may offer higher utilization in a multitheaded 128-bit case with a good mix of integer and FP vector ops. SB's focus seems stronger in the single-threaded case with a more focused mix, particularly with AVX.