And now there are 4x the number of threads running in parallel.
It was your example, not mine
My argument stays the same as well if you change the ratio.
Given any fixed alu:mem ratio you don't get away with fewer threads unless you reduce aggregate alu throughput or absolute memory latency. GCN has the same aggregate instruction throughput per CU as Cayman has per SIMD. You might get some opportunities for register reuse going from VLIW to scalar but that's not the general case.
GCN has 10 threads per SIMD. That's enough to hide 400 cycles of memory latency per SIMD given a 10:1 alu:mem ratio. How do you maintain that level of latency hiding with fewer threads without increasing the alu:mem ratio or slowing down the ALUs (using narrower SIMD width)?