If you want to take full advantage of a GPU, you do a lot of embarrassingly parallel work. When we look at the compute shader queue, work is divided into threads per block. And each CU/SM can handle so many blocks. So more CUs = more blocks that can be issued at once. Typically on the compute side I work with nvidia, so they assign IIRC about 1000 threads per block per SM/CU. Each SM can handle a couple of blocks. Because of the way the threads are serialized and the shared memory between the CUDA cores, you can assign work that can share data in which you're obtaining extremely good utilization out of your ALU. Huge amounts really.Cerny said something along the lines of filling up and exploiting fully the CU rather than spread to many and have them underutilized.
It seems like the clock boost is supposedly going to help in that regard. Maybe he things the CU count with that clock boost will help push the performance better? Hence the gap will not be as big as we expect?
Also I wonder how CU count scales up, considering so many GPUs out there with different CU counts. Games wont be fully optimized with those with the highest count but with an average out there. Unless the games are intelligent enough to scale efficiently and properly
So in this case, having more CUs is a much greater advantage than having high clock speed, because ultimately more work can be done in parallel, and latency is ultimately handled by the amount of thread switching a CU can do. The CUs ultimately all need to wait for memory to provide the next piece of work, so tearing through your compute jobs faster doesn't necessarily improve performance. Having a large number of cores that can hold a lot of threads for work processing can keep its saturation up while it waits for the next bit of memory to arrive is ideal considering how latent memory can be. High parallelism will thrive on maximum throughput, provided you've got the bandwidth to feed it. The more work you can give it, the more work that can be done in parallel and keep stalling to a minimum. Ultimately the unit of work per time is going to be higher on multicore processing if they are being fully realized, not to mention being significantly more energy efficient at it.
tldr; the programmers don't need to account for scaling more CUs. The Bandwidth needs to scale with the number of CUs. Programmers need to ensure they are coding in a way that ensures those CUs are fed well. Be clever at synchronizing threads etc.