I know it's not as simple as just adding more ALUs. There's additional complexity for the schedulers, the forwarding networks, the register files, cache bandwidth, instruction decoding rate, etc.
However, as far as I can tell my suggestion is somewhere in between already existing designs. It could bring single-threaded performance back on par, without the cost of two full-blown cores. So I don't see why it should be dismissed that easily.
In theory, it would have the benefit of an additional ALU. In practice, it would have a cost. Not in die area, but in clock speed.
The amount of ALUs is not dictated by cost or area. The size of a simple integer ALU is really ridiculously small. We are talking about a few tens or hundreds of thousands of transistors in a chip made of billions of them. If they wanted more, they wouldn't share. They'd just add more.
The forwarding network is responsible for getting the result of one alu to another in "between cycles", so that one alu can start an operation on cycle n+1 using a result produced in another in cycle n. As the time it takes for a signal to propagate through the forwarding network is added to every clock, making it take longer directly hurts the clock speed of the chip. Going from "4 places to forward results from/to" to "5 places to forward results from/to" means a minimum of one additional gate with a high fanout on the critical path -- that alone is a 6% hit in clockspeed without even looking at the routing. Then comes the problem that the signal needs to travel from every unit that is part of the routing network to every other in that time "between clocks". This means that the physical distance between the units would be very strictly limited. If the fifth unit was sandwiched between the two clusters, this would make the additional length needed for routing to be, minimally, as long as the width of a single unit. That's another few percent of speed gone again. Then, as the units would have to be right next to each other, it would essentially burn more than a quarter of the space available on the die within very short distance of the register file. This space is at a very high premium -- as everything that needs to have fast access to the reg file needs to be there, reorganizing the clusters together would force something else that is near the reg file to be a little further away. That's probably a few percent again.
You are thinking in the terms that the units are the primary design detail and constraint, and that scheduler/reg file/forwarding are just some minor details that someone else can think about. The exact opposite is true. The speed and design of modern high-performance CPUs are mainly determined by getting data and ops to where they are needed. The design, including the amount of units, is dictated by how complex reg file and forwarding they can get away with. Of the actual execution core, the simple units are so small a portion that, if they can get them ops and data, they can just slap on as many as they feel like. That's why SIMD exists.
Well obviously the register files should be unified or at least close together for reverse Hyper-Threading to work. Note that the FlexFP unit is fully shared and Intel's Hyper-Threading shares everything so it's clearly feasible.
The FlexFP is 4-wide, like the individual integer clusters. Putting the reg files close together I already mentioned.
If the reg files are unified, there would be no sense to have separate clusters at all. This gets you the Intel design -- one beefy reg file feeding one beefy execution cluster with 6 units (of which one is memory data write). The only problem is, AMD isn't Intel. AMD doesn't have the process tech or the design resources Intel has -- if AMD made a direct copy of Intel design, it would run much slower. Intel has a huge manufacturing lead, even on the same process, and it spends most of it on having that wider execution cluster.
I'm not sure it would. The L3 is a waste for non-server loads, and the access speeds for T-RAM make too slow for the L1 and L2.
AMD's cache hierarchy and interconnect just isn't all that much better than what preceded it, which has for years not been all that good.
The problem with present AMD designs isn't the L3. It's the L2/L1. On write-heavy loads the BD is just sad.
That's one thing I wish they would try to copy Intel on. What they need is small and fast L1 and L2, backed by a large, partitioned L3 with a lot of bandwidth.
But can they actually build that?