It is quite possible for a shader engine to export at a sustained rate lower than what the ROPs can process. An overclocking test isn't going to make that distinction. A mix of wavefronts with sufficient ALU, TEX, or other resource consumption can lead to gaps in the utilization of the export bus and ROPs, as occupancy limits would prevent the launch of additional wavefronts with their own export requirements.
Increasing the number of wavefronts that can be kept in flight can bring the contended export path back into play, particularly since the bandwidth constraint has been significantly relaxed.
Some amount of buffering has to be done on the far end of the export process, and the ROPs have a tile import-filter-export loop going on that is intimately tied to how well memory or the compression subsystem can move them, so even if the ROPs themselves are not on a memory controller's clock domain, what they plug into is.
Fundamentally, I agree with you.
Whether the ROPs clock with the shader clocks or are fixed, overclocking the shaders is going to expose a limit somewhere in the render back end, whether it's actual fillrate, the delta colour compression system (if it is, like NVidia's, variable in throughput) or buffer cache non-reuse or MC queueing.
Unfortunately without a lot more data, there's no way to tell how all these subsystems' bottlenecks interact and which of them are actually causing substantially less than linear scaling with overclock.
One of the curiosities of coin mining is that multiple subsystems are "beating" against each other, such that the optimal performance can be gained by setting the memory clock to a precise, irrational, multiplier of the core clock, which is not necessarily the fastest memory clock attainable (and depending on mining algorithm can be a vastly lower than maximum memory clock). And there aren't any ROPs in that, just kernel execution, export, cache, MC and GDDR. Oh and the potential for temperature-/current-induced throttling.
And each card will have different optima for clocks, voltages, issue granularity...
So, erm, I'm afraid to say, it's practically impossible to characterise formally.
MRTs? The rasterizer isn't going to guess how many outbound values there could be. Does it know what format or filtering demands may be placed on the ROPs after coverage has been determined?
MRTs are just extra bytes written per work item, just the same as writing fatter pixels. I'm unsure what you're alluding to. The export rate is measured in bytes per clock. Which translates to match the baseline rate of the ROPs for the simplest pixels. At least, that's been the case for ages in AMD/ATI.
I'm certainly not trying to excuse the observed scaling. Even though AMD ditched fast double precision they couldn't fit in more ROPs. If Tonga is any guide, then they spent a hell of a lot of transistors on delta colour compression, instead of adding more ROPs.
I wonder if AMD's colour delta compression algorithm acts as a multiplier on certain kinds of fillrate.
Perhaps AMD is doing more buffering-and-sorting after export, to feed into the colour delta compression algorithm (or, it is a fully integrated colour delta compression buffer-and-sort after export). It might be a deliberately-long pipeline with some kind of fancy feed-forward that means that the effective fillrate averages higher than the nominal rate we associate with ROPs.
A configuration like this could help with small triangles as the quantity of pixel-quads with less than four shaded pixels increases substantially.
I just have no idea what's going on.