I think we just need to calculate the worst case performance of both.
I.e. one thread, one AVX port, one core for Haswell.
That isn't an appropriate worst-case. Technically, the Haswell core would become inoperable, because there are AVX functions that are not supported across all ports.
For a single thread, Haswell gets two 256-bit FMA units, 16 software-visible 256-bit registers, and 16 software-visible integer registers. There are 168 internal rename registers for both types.
The core operates at over 3 GHz.
There is a 32KB L1, 256KB L2, and I'll leave you to decide where in the 2-8MB range you want to pick for the L3.
Treating scalar regs as noise, its 1KB Vregs + 32KB + 256KB + 2-8MB, for one thread running at >3 GHz.
With two FMAs per clock, it is >96 GFLOPS.
The L1 has 64 bytes of read per cycle and 32 write, full speed.
The L2 can supply it with 64 bytes per read per cycle.
The L3 ring stop can provide 32 bytes per cycle.
This is at over 3GHz.
And one wavefront (duplicated needed amount of times, essentially working on the same data), one CU, one SIMD.
This is not a worst-case for the CU. The worst case is one wavefront.
Technically, the worst case for both the CPU and CU would be where only one SIMD lane is used, but I'll leave that one out because the CU would fall to 1/256 of its throughput, whereas Haswell drops to 1/16.
16 FMADDs per cycle at 800MHz is 25.6 GFLOPS.
That aside, restricting things to one SIMD means a maximum of 10 wavefronts.
There are 512 scalar registers per SIMD. At most 103 scalar registers per wavefront.
At 10 wavefronts, it is 51 per, although depending on what mode being operated on there are half as many and some number devoted to wavefront masks and the like.
There are 256 256-byte vector registers in total, if divvied up equally there are ~25 per wavefront.
16KB L1, 512KB L2.
2KB+64KB+16KB+512KB/10 = ~60KB per thread (unless you count 64 work items per wavefront as threads) running at 800MHz.
The choice to restrict the CU to one SIMD has one impact, where only one vector memory operation can begin every 4 cycles. This may or may not impact the CU's bandwidth since I'm not clear on how the memory pipeline overlays instruction issue. At one SIMD there's no way to start a write in the next cycle like if there were multiple SIMDs.
64 bytes read from the L1.
64 bytes from L2.
This is at 800MHz.
Then just multiply that by the number of such low performance tasks that each can run in parallel.
Better make sure you don't hamstring the CU any further. I might just find some code that can do otherwise parallel chunks serially.