Most of my thinking on this as to why Sony was allegedly considering Steamroller is that the clocks were not intended to be the same, by a very significant margin.Steamroller module (2 cores with shared FPU) has equal peak flops as 2 Jaguar cores (half a module) at same clocks.
The base clocks for the desktop Kaveri SKUs seem to indicate that at least doubling Jaguar's clocks could have been possible, and the PS4 Pro's uptick in power consumption is an example of where Sony was fine with a marginal increase in overall power consumption over the release version of the PS4.
There's loss of peak against matched FADD and FMUL operations that cannot be turned into an FMA, which Steamroller cannot avoid. Loss of peak due to a mix that is slanted towards FADD or FMUL is something Jaguar also experiences, which I think makes the decision less straightforward at an architectural level. Jaguar has more issue port contention on a per-clock basis due to its having 1 fewer FPU pipeline, and it does have a less feature-rich ISA. The extra instructions AMD had for the Bulldozer line saw limited use in the desktop market, but it would have been presented differently as part of a proprietary platform.2x Jag cores do total 2xFADD and 2xFMUL per cycle (xSIMD4) = 4*4 = 16 flop. Steamroller module does total 2xFMA (xSIMD4) = 2*2*4 = 16 flop. So in theory these are tied. However Steamroller needs FMA ro reach max throughput. If the code is not specifically optimized for FMA, you lose half of the theoretical flops (and even in optimized code, FMA pairing efficiency is never even close to 100%). Also Steamroller FPU instructions tend to have much higher latency than Jaguar equivalents. Thus is harder to get full utilization of it. I also remember reading from somewhere that the shared FPU can cause additional stalls if both cores are utilizing it heavily.
As far as stalls in a multi-threaded context, I can think of some limitations like the store buffer out of the shared FPU not being able to supply the bandwidth of two cores, but this is a case of faltering into a tie rather than losing to 8 Jaguar cores due to Jaguar's narrower architecture. Some stalls with Piledriver and some possible errata were fixed with Steamroller.
There are non-FPU related performance problems, such as odd regressions in multithreaded performance, cache bandwidth, and decode. Decode was generally doubled to make choking there unlikely, with some additional writeback and errata fixes that might have done a little for the rest.
Are there examples in the PC space where having two cores sharing the FPU leads to a loss in performance versus one core per FPU? The benchmarks I can remember didn't show performance going down.Steamroller beats Jaguar handily in FP/AVX code if no more than half of the cores are used (one core per module). It is definitely a better fit for PC application workloads (except for Cinebench/Povray/encoding style tasks).
There would be little reason to use a core optimized to ~4GHz peak if it were to be bumped down to one that struggled to hit 2.Jaguar on the other hand has significantly higher throughput when all cores are used (assuming similar clocks and similar die space = possibly similar power consumption).
If the leaked Steamroller variant of the PS4 is a legitimate early version, it would indicate that concerns such as module area and FPU sharing were considered acceptable since those would have been known in advance.
The actual health of the manufacturing process and the validation of the architecture and overall SOC might have been where the momentum changed.
Perhaps Steamroller couldn't clock sufficiently faster for Sony's requirements with GF's process, or AMD could not make its deliverables for that architecture despite Kaveri seemingly inheriting some high-level similarities with Orbis.
I have seen some discussion that the firmware situation for Kaveri was problematic, and it was notably later to be released than the consoles.
Jaguar's hop to TSMC cost AMD money, and seemingly hindered its bug-fixing and physical characterization--hence lack of turbo in general until the return to GF.
One scenario I've been mulling over is that AMD determined it could not get a validated Steamroller APU out of GF in time, within its clock/power parameters, and in volume.
A Jaguar-based core, and an SOC that used it, could make a jump to TSMC, whereas a Steamroller APU with some apparent teething pains could not.