From my understanding the 32nm process was really really bad at the start but got very very good by the time they moved to 28bulk (look at 8370E). BD isn't a speed racer design, it's integer pipeline is shorter then Intels, BD was just a bad target really, it wasn't what the market wanted and for some reason the modules are huge. Then add in all the glass jaws the design seemed to have and there was no redemption.
Part of that might be the lack of a uop cache, which allowed Intel's effective branch prediction penalty to appear shorter due to several stages dedicated to the full instruction fetch path being skipped. Even in the case of a uop cache miss, Sandy Bridge's mispredict penalty was ~17 cycles to Bulldozers 20 (or sometimes worse because it's Bulldozer).
More recent Intel cores do appear to have added more stages, although it's not talked about with much detail in public these days.
However, this is multiple nodes past 32nm, so despite the longer pipelines they've packed more hardware into the stages in a manner contrary to a conceptual speed racer that might have longer pipelines with much less in them.
In that respect, speed racer or speed demon may be something of a subjective measure or one based on context.
If traditional scaling from the time of the concept had continued (per Intel's Tejas plan), a speed racer would be clocking at a multiple of some of the suicide OC runs done with modern cores.
Elements such as using stages to help buy margin in certain functions at lower voltages or combating a lack of scaling in wire delay seem to encourage more stages, while transistor scaling lets those stages be fatter than would have been feasible for earlier nodes.
That said, I think it would be argued that BD wasn't a speed demon, although it did have elements that showed a preference for a higher range. It didn't skip a number of features like bypassing and had a decent amount of width in some areas, perhaps marking it as a troubled compromise after speed racer clocks were definitively ruled out, and maybe AMD aborted some ambitious design effort that would have been enabled by CMT.
The funny thing is CON cores Fmax is caused by the L2 cache, give how bloodly slow that thing is (in cycles for accesses) i bet no one saw that comming....lol
I saw some discussion with third-hand conversations saying that the L2 array engineers were pretty insistent that their portion was fast enough. There was something that just didn't work out with BD in any concurrent memory access scenario, either between cores in a module or between modules.