I don't think a 4-core SB vs a 2 module BD is an apples-to-apples comparison, nor do I think a 4-core SB vs a 4 module BD is, hence my comment above about apples and oranges.
From the i7 to i5, that will be the comparison.
A 2 module BD would be going against a SB i3, or rather, it would if it were released in the same year and not probably after Ivy Bridge.
The number of FP blocks looks like a decent marketing point to gauge comparisons.
At best, an i3 versus BD will likely have the same performance comparison as the larger chips: weaker IPC in integer, potentially better integer multithreading, roughly equal FP in SSE, likely loss in AVX.
In order to evaluate BD and SB we need to look at:
1. Go-for-broke single thread performance of a SB core with SMT disabled vs. a BD core/module.
A common client workload where performance matters at all.
2. Massively multithreaded performance; as many cores in a package within a given power envelope.
Not a significant factor for the client space.
3. Something in between, like a multithreaded application that can reasonably use 8-16 contexts. Again limited by power envelope.
Not too common as of yet, outside of a few cases like multimedia apps. SB can trade blows here, unless it's an app that loves AVX.
Power is an unknown factor. If AMD's process does not reasonably match Intel's, it may not be able to clock as high as needed.
The data cache is smaller, the instruction cache isn't.
Instruction cache misses are a fraction of what a data cache experiences. I was not considering the Icache.
A 16KB 4-way cache has a 1% miss rate on SpecInt 2000 (old, but that is the one I can remember off the top of my head ), a 64KB 2-way cache might have half that, but that doesn't impact average latency if the smaller cache allows operating frequency to be just a few percent higher.
The L1 is 33% longer latency, as is the cost of an L2 hit.
The average latency is 33% longer by default, without changing the ratio of L1 and L2 hits.
On top of that AMD must be confident that they can schedule around L2 latency, otherwise they would have made it smaller and faster.
There are other reasons why they wouldn't or couldn't, some of which depend on the latency and effectiveness of the L3 and uncore.
If the L3 is prohibitively slow, or whatever xbar used to communicate with the memory controller, L3, or other cores prioritizes bandwidth over latency, then minimizing L2 misses is worthwhile even if it hurts performance in cases where accesses stay within the L2 or L1.
Pathological cases where you are pointer chasing in a working set that is larger than 16KB and smaller than 64KB will suck on BD compared to Opteron, but that case is just that, pathological.
16KB is pretty small on an absolute scale. It's not tiny like the P4's 8KB data cache, but that cache was uncomfortably small a decade ago. Working sets fortunately don't follow Moore's law, but they do grow slowly as time passes. I'd be curious how close they've come to doubling in a decade.
The vast majority of misses at that size are capacity misses, and that was with old SPEC benchmarks.