Also, no program will have 100% of their instructions FPU/SIMD instructions, since you have to calculate addresses (array indices, pointers, etc), do logic operations, data load/store and branching.
BD has tradeoffs in this area as well. The address calculation part in the multithreaded case is better, but equivalent in single to one SB core and weaker in peak than a K8 (various other inflexibilities in that core aside).
Branching, permutation, and data load/store are a mixed bag.
There's one front-end, so branching is not better than a hypothetical 2 core solution. There is only one pipe for shuffle operations, so this is a step down, although the pipe does have a more capable permute instruction. In code that uses shuffles and blends, particularly if it targets the multiple shuffle units of Intel architectures, this could penalize BD.
The load/store bandwidth is that of a single core.
Assuming other games have similar FPU/SIMD usage patterns, it would be safe to say that sharing the floating point coprocessor between two cores shouldn't reduce the Bulldozer gaming performance at all compared to having a separate floating point unit in each core...
Two separate cores could branch better, shuffle better, and read/write better, depending on the mix.
Actually it might be the opposite. Bulldozer FPU/SIMD coprocessor has improved a lot compared to AMDs last designs, and those weren't that bad either. The optimization guideline states that Bulldozer FPU/SIMD coprocessor has four times the throughput compared to previous AMD architecture.
Is half of that based on using FMA?
The processor has improved LHS stall reduction systems, improved memory/cache systems (to reduce cache stalls),
Some things are better, some less so.
The cache is more scalable with regards to clock and power, which is likely why the L1 latency went up even after quartering its size.
The L2 is bigger, but slower.
The L3 is appears like it could be partitioned, but we'll have to wait to see if this helps with the latency numbers, which are at best not very good for AMD's current chips.
There are some very nasty corner cases for write combining, so much so AMD's optimization guide flat-out states there's going to be a new version that won't be so horrible.
edit:
On the plus side, there are some serious bandwidth and utilization problems with the current AMD northridges and memory controllers, which BD can't help but improve.
AMD chose to go with simpler integer/logic cores, have more of them, clock them higher and share beefy FPU/SIMD between pairs of cost efficient cores. I was hoping they would go as high as Power6 in clocks (5 GHz), since one of the original goals of Bulldozer architecture was to allow higher clock frequencies.
The clocks promised so far are higher, but not massively higher than competitors and predecessors.
The more likely goal was not higher clock speeds, but a pipeline that could maintain mostly equivalent IPC and modest clock gain without having as good a process nor the same level of custom circuit design as other high-performance x86 chips.
We'll need to wait on the final clocks and performance figures, particularly for the desktop market for which it seems less suited.
The to FX SKU seems to be priced around the level of a 2600K.
The AMD chip would come with a 30% larger die and 30% higher TDP.
This is somewhat higher than I had originally thought. I thought it would be between Westmere and Sandy Bridge, though closer to Westmere.
It might be a shade higher.
One thing I did not anticipate was the lack of clock scaling for Intel from the release clocks.
Aside from the Xeon model that shuts off the GPU for an extra speed grade, the clocks have not advanced where I had expected 1 or 2 bumps in the better part of a year. Perhaps there is something that makes this problematic, or Intel hasn't seen the need yet.