I disagree on the dark silicon issue, as I already posted. We already have a way to profitably use every last transistor (even if with diminishing returns), we're not going to start reducing cache to plop in extra FPUs that will idle for most of the time.
Why settle for diminishing returns already? That can always work as a last resort. Intel has been able to keep the L3 cache for four cores at 8 MB for three process nodes now, using the increasing transistor budget for better purposes. I'd rather know that each core is concentrating on one task, achieving maximum performance/Watt, than ensure full utilization of a few but leaving performance on the table.
There are three basic kinds of workloads to deal with: high ILP, high TLP, and high DLP. For high ILP, you want four scalar execution ports like Haswell. For high TLP, you want to share those between two threads. And for high DLP, you just want wide SIMD units, and lots of them, at a lower frequency and hiding latency while maximizing data locality with long-running instructions.
The architecture I'm proposing here is designed for each of these, and any mix of them. Basically, CPUs like Haswell are already very good at ILP and TLP, but they need GPU-like SIMD units, without lowering the data locality by running lots of threads, and other
overhead associated with that.
I still hold that the FPU is not the weak point. Every single FPU-heavy load I've ever ran on the BD has been cache or memory throughput-limited, not compute limited. Every last one. I have never managed to do anything that approaches looking like real work and that runs out of execution resources on BD. The caches are that bad.
I have no first-hand experience with Bulldozer, so I'm not doubting your findings at all. But improving the caches and lowering the latencies would still not make it a good step towards unified computing. To match my proposal, it would need 32 scalar cores. There's no use for that in the consumer market, and it would waste a lot of space (and I'm not talking about low utilization, I'm talking about a complete waste - even using it for more cache would have been better).
Of course the root of the problem is that AMD doesn't want unified computing at all. It wants small scalar cores that share an FPU for legacy purposes, and a big GPU to handle all throughput computing needs. That may sound good on paper, but it's fraught with heterogeneous computing issues. They're hoping for developers to miraculously deal with that, while Intel is pampering developers with a better ROI proposal.
Not only is linpack so unrepresentative of real work that it should never be mentioned, this completely skirts my point. My point wasn't "HT speeds up FPU-heavy loads", it was "in mixed-load environments, HT increases efficiency". Run linpack and 4 scalar threads at the same time, and you bet you're going to see very good gains for HT.
This overhead is cache pressure. There is no overhead to overcome in the actual execution parts. Dedicating FPU clusters does not help overcome the cache pressure overhead in any way, shape or form.
Like I said, I mentioned Linpack only to illustrate that Hyper-Threading has an overhead. I'm not arguing that it increases utilization in mixed-load environments, but I do argue that it doesn't offer the best performance/Watt, precisely due to the inherent overhead.
Then why do I propose keeping Hyper-Threading for the scalar portion of the core? Because there's a tipping point where low utilization becomes a waste and using four scalar execution ports by two threads during high TLP workloads is more power efficient than one thread using on average only a couple of them. They do matter for increasing IPC in single-threaded workloads though due to Amdahl's law.
So it's all about finding the right balance for each type of workload. I don't think SMT is optimal for the SIMD units. They suffer the most from cache pressure when the thread count is increased. AVX-1024 offers the necessary latency hiding qualities to increase utilization (the good kind that improves power efficiency) while lowering front-end and scheduling overhead.
I'm open to alternatives, but I really don't think Bulldozer sets a good example.
Eventually, but I think this will take a long time still.
Note that I don't necessarily think this is the best way to go (I haven't studied the problem long enough to pick a position), but I do think that Intel is institutionally predisposed to go this way, and thus it will happen.
I think it will happen regardless of Intel's desire for it. That, plus their process advantage, will just make it happen sooner than the sheer necessity dictates. I have continuously underestimated how fast they would converge things, and I think that's saying something. If Skylake features 512-bit SIMD units, then we're looking at a 32-fold increase in computing power between a dual-core Westmere and an 8-core Skylake, in five years' time. That would probably put getting rid of the integrated GPU on the roadmap next.