in order to get that 2ops/cycle you'd need to decode 4 instructions (assumed single path) every cycle, since the decoder *alternates* between cores every cycle, so you decode (up to) 4 instructions every TWO cycles. Let me highly doubt you can get them: say you decode 3 mops/cycle, quite high. You then have 2 full cycles for executing 3 mops, filling 1.5 ALU, and supposing no AGLU usage, not even for mov r,r/mov r,m. Averaging a decoding between 2 and 3 mops/cycle, you get your 2+2 stuff quite underused (between 1 and 1.5 mops/cycle!).
Masking the other core off showed performance improvement, although it was pretty modest.
There are just so many other ways to lose utilization that the decoupled front end is one weakness of many.
There are issue restrictions for which EXE pipeline can do what, such as MUL and DIV, and branches can only use one pipeline. In branchy code the core can look 50% thinner on a given cycle.
edit: Also a lack of move elimination, which is more noticeable with the claustrophobic 2 issue slots. Later iterations of the architecture will give the AGU ports the ability to handle moves, though. Intel's design does better.
Because of hyperthreading. A module runs two threads, and an Intel core runs two threads (granted, using very different architectures). The Intel core uses all its resources as its best and can grant high single-thread performance, whereas AMD module should give a slower single thread performance in exchange for a bigger 2-thread IPC.
For all but the most friendly apps, Bulldozer doesn't provide superior aggregate throughput.
Unless you consider the AMD 4M/8C as a full 8-core processor, but having a shared front-end, I think it is not.
The cores have separate memory pipelines, issue, and control hardware.
Well, if you consider mostly simple instructions for IPC they can go fine, since MOP output from the decoder is somewhat similar between AMD and Intel (not so equal, as i.e. even a push gets decoded differently...).
IPC as traditionally used involves the number of instructions a design can execute for a thread in a cycle. In more general terms, it is what a core can generally manage when given a non-toy workload.
Regardless, actual benchmarks show that those cores wind up stalling more, so even in aggregate terms the instructions issued in a clock is weak.
yeah, this has been a weak point of AMD, all the time. If you need to optimize assembly, you go for intel's processor..
The bigger problem is that general-purpose processors have trended towards being resilient enough to not require so much handholding.
Sandy Bridge has a few core optimizations, such as trying to keep instruction counts low enough to fit in the uop cache and the complex and simple decoder arrangement. It's still very strong in non-ideal situations.
Bulldozer has a raft of other problems on top of that and it drops off ideal very quickly.