Thanks for those details. So the solution is some more convergence. However, note that only part of the issues would be addressed by having closeby latency optimized scalar cores handle task scheduling. You'd still have tasks that are not parallel enough for the many threads that current compute cores demand.
Currently, the needed amount of parallelism is grossly exaggerated as one has to amortize the huge setup costs.
Don't inflate what I agree upon. Haswell's 2+1 load/store with three AGUs is a fact, and it's an improvement over all previous Intel architectures. But the code doesn't exclusively consist of memory accesses. x86 became the dominant ISA despite its relatively low number of logical registers and despite Intel hanging on to 1+1 L/S for a long time. CPU's can easily schedule around any L/S contention and Haswell is going to be even harder to bottleneck. GPUs on the other hand can easily become L/S bottlenecked and they demand extra threads to work around temporary contention.
Read what I wrote! I just stated the basic fact, that almost every architecture (that includes GPUs as well as common x86 CPUs) will be limited by the throughput of their memory system if that is pegged. Nothing more. No need to argue about something obvious.
Sure, the L1 bandwidth per flop is quite a bit lower on GPUs, but they also enjoy a much larger register space so they don't need to use the L1 as an extended register file like x86 CPUs are doing. Depending on the code and conditions, that may be an advantage or a disadvantage (to either side). That's the way architectural differences play out. But that is no solid argument for nothing.
First of all, with two load ports it never has to be moved far.
How does the amount of load ports influence the memory latency?
I don't think so. You're simply going to run out of threads if the contention on your texture or L/S port is too high, even if just locally.
So you are telling me, that a GCN CU easily runs out of its up to 40 wavefronts (a wavefront can have up to 16 reads and 8 writes pending)? And usual code also has a few arithmetic instructions in between. Even when looking just on a single IB and vector ALU, a burst of memory accesses from one wavefront isn't going to stall the vALU/sALU/LDS instructions from the other wavefronts, let alone the vALU/sALU/LDS instructions from the wavefronts running from the neighboring instruction buffers and vALUs.
I've no idea how long the queue in front of the AGUs is (i.e. how many vector memory instructions can be issued before the issue port gets blocked [it's available again after a few cycles, when the AGUs processed the next access in the queue]), but in any case that would block exclusively the vmem issue port for a short time, not the vALU port, not the sALU port, not the branch port, not the LDS port, not the export (mem write) port, and for sure not the handling of the internal instructions (like synchronization, they are handled directly in the instruction buffer and don't need an issue port). That means all arithmetic or local memory instructions from other wavefronts continue to be issued. If you have so huge amounts of memory accesses that this would play a role, you end up being bandwidth limited anyway. But short bursts of memory accesses (let's say 8 reads directly after each other with no arithmetic instructions in between) are handled usually quite well, if that is just a part of the kernel. There is apparently no performance degragation from such a grouping.
My point was that CPUs can still execute independent arithmetic instructions in this situation.
So do GPUs.
What I'm really trying to get at is that CPU architectures are on average every bit as good at high throughput workloads as the GPU. It just lacks SIMD width.
Is it as good in the sense that it of course can finish the task.
But it burns way more power for the same throughput. Having a 4GHz CPU with fast caches and low latency is great for serial performance, but it is an unecessary power burden for throughput tasks. You don't need a L1 hit to beserved in a single nanosecond and an FMA instruction in 1.3ns. Having more leeway on the latency side (GCN does an SPFMA in ~4ns, a DPFMA currently in ~16ns [latency]) enables a more power efficient design of the functional units. We don't even have to start with the simpler and therefore less consuming scheduling logic. It's obviously better to have a different set of lower clocked but wider execution units next to the latency optimized ones. And it's hard to impossible to scale the frequency for a pipeline a factor of 4 without running into inefficiencies.
Just look what happened to the serial performance of Larrabee/KnightsCorner. That is a more or less throughput optimized design which kept some latency optimizations (or it didn't go as far as GPUs) to fare better at intermediate sized tasks with dynamic and varying amounts and types of parallelism. Nevertheless, it clocks just above 1 GHz. It does an SPFMA also with ~3.5 to slightly below 4ns. Why does it have 62 cores@1.1GHz and not 16 cores@3.5GHz? Coincidence or a pattern?