Well obviously the register files should be unified or at least close together for reverse Hyper-Threading to work. Note that the FlexFP unit is fully shared and Intel's Hyper-Threading shares everything so it's clearly feasible.
The schedulers can ill afford to step on each other like that. The FP unit has its own scheduler along with the unified FP reg file.
If the integer register files were unified, it would almost follow that the scheduling and issue logic would be unified into a single scheduler.
Reverse-hyperthreading would be better characterized as reversing AMD's design decision and making it a one-core module.
That seems like a contradiction to me. If they're small then why would the routing paths suddenly be too long when you have 3 instead of 2 of them? Especially considering that AMD has had 3 ALUs for over a decade and process technology is shrinking faster than the clock rates go up, that doesn't seem very problematic to me.
The design targets higher clocks, but its designers also highlight that the design emphasizes streamlining and reducing the amount of logic and complexity per stage (over the assumed streamlining needed for clocks). The delay in adding additional forwarding paths and the additional burden on the high-speed scheduler may have been beyond AMD's ability to manage. One more ALU would mean 4 paths from the new ALU and the other ALUs and AGUs in the core, and 4 new paths, one from each to the new unit.
Sharing 2 ALUs between cores would mean the shared ALUs would have more than double the burden than if we only considered adding one ALU to one core.
As far as saying AMD shouldn't have problems since it had 3 ALUs 10 years ago, let's note that it failed to consistently improve over a design it made 10 years ago and has run ragged ever since.
It seems to me that AMD has underestimated the importance of IPC, and tried to compensate it with higher clock frequencies and more cores. But the cure seems worse than the disease.
They're smarter than that. The problem is that high-performance designs are hard, competitive ones with Intel are harder, and doing so when they knew they wouldn't have the resources or a competitive manufacturing process even more so.
BD's design makes sense from the viewpoint that the designers wanted to get as much performance as they could knowing that they could not optimize it like Intel and they could not count on a good process from GF.
There were some rumors about the use of T-RAM technology. Perhaps that could save Bulldozer's cache hierarchy.
I'm not sure it would. The L3 is a waste for non-server loads, and the access speeds for T-RAM make too slow for the L1 and L2.
AMD's cache hierarchy and interconnect just isn't all that much better than what preceded it, which has for years not been all that good.