Things don't really work that way. Pretty much no actual real-life task relies on loading new data for each instruction processed. There's matrix multiplication benchmarks and things of that sort, but you're not going to find many games doing a whole lot of that in the middle of rendering a frame...
Pixel shader programs can be upwards of 1000 instructions long in some cases today IIRC, certainly dozens for anything more than trivial stuff, so caching and chewing over the same data set over and over certainly comes into play. In most actual real-world situations, 18 CUs will whoop up on 12 CUs, even if the latter has a significant bandwidth advantage.