@3dilettante
So a FX8350 on its stock speed is faster then 8 core jaguar found in consoles?
How is that clock for clock?
The FX8350 is a 220W (edit: correction--that's the 9590, 125W for Vishera) processor running from 4 to 4.2 GHz versus the 1.6 GHz Jaguar cores.
For non FPU workloads, Vishera had as many cores as the PS4 running over twice as fast. That's vastly better multithreaded throughput and single-threaded performance.
The shared FPUs made it so that there were 4 FPUs, but they were individually much more heavyweight than Jaguar. The exact mixes each design favored do not align all the time (FMA can be better generally, but in a few spots worse versus separate MUL and ADD), but Vishera's FPU could hit the same peak numbers as two Jaguar FPUs and supported a more robust set of shuffle and vector integer operations, even without considering it was running over twice as fast.
Then there's the data paths and cache subsystem, which were wider and faster versus the power-conscious Jaguar.
However, Vishera was a vastly bigger investment in terms of power and silicon, so being much faster doesn't mean it was necessarily an order of magnitude faster than the more modest Jaguar.
Overall, the per-clock and other efficiency measures would have left the first two Bulldozer core variants out. However, by the time Steamroller came about, process changes and architectural changes fixed bugs and improved per-clock performance measurably. Steamroller at half the cores but twice the clock speed of Jaguar was what Sony may have nearly gone with for Orbis.
Ahhh... I'd read that Jaguar's FPUs were double or two way 128bit FPUs which i equated as being the same as the FMAC in Bulldozer. So is Jaguar 128bit per pipe with a performance penalty to combine 2 pipes into DP, and Bulldozer is 2x128bit per module (2 cores) with no performance penalty for DP?
Jaguar has one 128-bit pipe for addition, and one 128-bit pipe for multiplication. Miscellaneous operations and vector integer ops are distributed among those two ports.
Bulldozer has 4 FPU pipes, two for FMA. Permutes, moves, and integer ops were spread among all four.
Steamroller and later had 3 FPU pipes, where some of the integer operations and miscellaneous instructions were moved onto the remaining three.
There's a strong mixture of extensions and operations supported by Bulldozer versus Jaguar, so a lot of the other less glamorous elements of floating point workloads could be better handled by Bulldozer without fighting for cycles on Jaguar's smaller number of ports.
For DP, the Bulldozer line could get half-rate. The FMA pipe is counted as performing 2 operations and provides additional flexibility in the pure addition or pure multiplication case. It's not as strong if there's an equal mix of additions and multiplications that cannot chained together into an FMA.
For Jaguar, the addition pipe could perform a DP addition at half rate. For DP multiplication, the unit could produce the expected half-rate per instruction, but it then had to loop back through the smaller multiplier for an extra cycle--cutting performance further.
For the consoles, the DP case isn't all that important, however.