One of AMD's central problems was pushing the K8 core as long as it has. Aside from incremental changes, the scheduling, branch prediction, and memory pipeline weaknesses were inextricably linked to design decisions made a decade ago.
I would hope a new core redesigned to face the environment at the current process geometries and design paradigms would do better.
At 32nm, I would also hope Bulldozer manages to do better than a 45nm Nehalem.
A Bulldozer chip should have peak FP resources much higher than a current Nehalem, so hopefully it can at least manage that.
Integer performance is less clear cut, particularly in a single-threaded situation.
One of the unknowns that could shift things is if the design's emphasis on higher clocks and power management succeeds, and a core's turbo can hit the 25-30% higher clocks, given the alleged FO4 reduction per stage. Naively, a Phenom with a pipeline that allows turbo to 3.6 GHz would, if transformed into a Bulldozer core, have clocks around 4.5 GHz (much laughter at the simplification of a complex problem with specious math aside).
That hopefully will put it at least a little ahead of the current Nehalems. (edit: this does exclude the claimed IPC advantage BD has over Phenom)
One problem is that server-derived Zambezi looks to be a poor fit for desktop workloads. There will not be a massive need for 8 int cores, and the FP unit does not get the full L1 bandwidth it could use if the chip is idling one of the cores. It may not hurt too much against a 45nm Nehalem, but Sandy bridge will not be operating at that level.
edit edit:
Sorry for the edit-fest, but one thing that is really starting to bug me is the prevalence of the claim that Phenom can only do a mix of 3 ALU/AGU ops per cycle, whereas BD can do ALU+ALU+AGU+AGU per cycle, thus claiming that BD is wider.
The three instruction schedulers in K8 receive macro ops.
A macro op is similar to a fused micro op in an Intel chip.
That means it is often an ALU and AGU op in the same entry.
Each scheduler is capable of breaking a macro op down into its constituent micro ops, that is up to one ALU and one AGU per cycle.
In theory, K8 can send send off a burst of ALU+ALU+ALU+AGU+AGU+AGU in a cycle. Two memory ops can be sent through the AGUs, along with one LEA.
So that means Phenom is definitely wider than Bulldozer, though much less capable of hitting that peak.
source:
www.agner.org/optimize/microarchitecture.pdf