Found the original Bobcat technical whitepaper (from IEEE Micro). Some good stuff inside it, including the confirmation that Bobcat was aimed to be 90% performance of K8 (not 90% IPC of K10). I am making a comparison table based on the information
Bobcat/Jaguar core can issue up to 6 instructions per clock (it has a dual port integer scheduler, a dual port AGU scheduler and a float "coprocessor" scheduler that can issue 2 instructions per clock). It however cannot decode/retire more than 2 instructions per clock.
In comparison a BD/PD module (2 cores) can issue 2x4+4 instructions = 12 instructions per clock, and decode/retire 4 instructions. So a BD/PD module (two cores) provides exactly the same peak and sustained rates as two Bobcat/Jaguar cores.
Of course shared decode and especially shared retire are a boon for BD/PD average IPC, since the processor can temporarily boost the sustained decode/retire rate of a core when the another code stalls (pipeline bubbles, cache stalls, branch misprediction, etc). The peak (all cores running at full steam) instruction throughput of Jaguar and BD/PD cores are however exactly the same.
--> Bobcat/Jaguar L1D is better than BD/PD L1D in every way.
AMD's own Bobcat technical whitepaper states in many occasions that they reused, improved, finetuned old concepts and hardware blocks (from K8 and K10/Barcelona). For example they stated that Bobcat floating point "coprocessor" is very similar to K8 floating point "coprocessor" except that they dropped one of the 3 pipelines.... Totally different design.
Incorrect.You left out that Bobcat can only decode/issue 2 instructions per cycle, while PD can decode 4 and issue 4 to an integer core and/or 4 to the FPU. And Bobcat can only do 1 load + 1 store per cycle, while PD can do 2 loads (I don't think it can actually do 2 stores though). These are not small differences - mainly, being able to support a load/store or two in conjunction with two ALU/branch/multiply/etc is a big deal, especially for x86. Even in FPU heavy code it's nice to be able to issue at least one integer instruction in addition to two FP ops for flow control/pointer arithmetic/etc.
Bobcat/Jaguar core can issue up to 6 instructions per clock (it has a dual port integer scheduler, a dual port AGU scheduler and a float "coprocessor" scheduler that can issue 2 instructions per clock). It however cannot decode/retire more than 2 instructions per clock.
In comparison a BD/PD module (2 cores) can issue 2x4+4 instructions = 12 instructions per clock, and decode/retire 4 instructions. So a BD/PD module (two cores) provides exactly the same peak and sustained rates as two Bobcat/Jaguar cores.
Of course shared decode and especially shared retire are a boon for BD/PD average IPC, since the processor can temporarily boost the sustained decode/retire rate of a core when the another code stalls (pipeline bubbles, cache stalls, branch misprediction, etc). The peak (all cores running at full steam) instruction throughput of Jaguar and BD/PD cores are however exactly the same.
Incorrect. Bobcat/Jaguar L1D is 8-way. It's twice the size and twice the associativity compared to BD/PD L1D. And Bobcat/Jaguar L1D latency is 3 cycles, while BD/PD L1D latency is 4 cycles.As far as L1D Is concerned, Jaguar does have the bigger cache but loses in associativity (2-way vs 4-way) which is a liability on some workloads.
--> Bobcat/Jaguar L1D is better than BD/PD L1D in every way.
That's not right. Bobcat/Jaguar have 17 cycle L2 latency and Bulldozer L2 latency is 20-22 cycles.And from test numbers I've seen its L2 is not just lower bandwidth but at least as high latency.