Because the performance gain going from a middling CPU to a very powerful one is often negligible. You can blame PCIe or system bandwidth but there have been enough tests showing that higher PCIe or system bandwidth is even less relevant to game benchmarks. So what are the other potential culprits if not the GPU itself?
Bandwidth and latency are not the same thing, so systematic PCI Express latencies are real - see any GPGPU effort for the work-arounds required to minimise that impact.
Additionally the API has some fairly fundamental, coarse, granularities in it. As I said to PeterT earlier, this is why D3D10 implements finer-grained update of state (e.g. constant buffers) and why D3D11 allows multi-threaded construction of state.
All of these things conspire against old graphics engines that are built on out of date techniques with the DX9 API. The efficiency gains in D3D11 are enough that it's worth running the game/drivers in D3D11 mode even though the hardware is only capable of DX9.
Well there are micro-benchmarks that target specific functions. The problem is that nobody really picks apart a game to see what it's doing within each frame. What we need is a combination of PIX and NVPerfHud (or AMD's equivalent).
Still don't even have a decent answer why tessellation in Heaven 1.0 is so slow on ATI. Somehow I think we'll be waiting a long time.
The game can be scaling poorly because it is bandwidth or setup bound. Does that mean the game is inherently not scalable or does it simply mean there wasn't an adequate increase in bandwidth or setup in proportion with other things?
Sure, that would be a hardware limitation. It might be a fairly noticeable bandwidth-efficiency limitation (see the 8xMSAA performance in GT200) or it might simply be not enough bandwidth. Or setup rate. Just have to prove that the specific game is sensitive in that respect.
Of course review sites that even bother to activate 8xMSAA or adaptive/transparency MSAA are pathetically few.
If you're going to make a reasoned comparison of the gains in a replacement GPU, you've gotta take the mix into account: unit counts + bandwidth + serialisations. I don't expect to see many existing games benefit from the dramatically higher setup rate in GF100 - but the architecture's more finely-grained rasterisation (which is a result of the parallel setup architecture) may mean that those same games see "better than expected" scaling on GF100. If that proves to be the case, then it's another parameter to investigate in Cypress scaling - though it seems unlikely there'll be much of that done, either.
That may not turn out to be due to finely-grained rasterisation. It might be to do with the way hardware threads are launched. etc.
At work I'm not allowed to blame the workload if my design isn't adequate. That applies here as well.
Why's that relevant? It's quite clear that a lot of game developers don't have the time/resources to implement a state of the art scalable and efficient engine. And if the API stands in the way...
The bottom line is you have to prove your test is evaluating what you say it is, before you can say that the test indicates X about the test subject.
I'm certainly not excluding the possibility of end-of-an-architecture problems in Cypress, where scaling has hit the end-stops due to something. I suspect some fixed-function stuff is out of its depth, but I say that because of the poor tessellation performance, not because of existing games.
It's notable that we're (generally, presumably some did know) only now realising that Crysis 1920x1200 with 4xMSAA is limited by the capacity of video RAM. How long have reviewers been testing that game at that setting? Why's it taken so long to discover the RAM limitation? What folly it has been to say it doesn't scale, when the card is running out of memory. (Though I think the game is still meant to scale substantially with CPU at that kind of setting, too - honestly Crysis has long seemed like a red herring as analysis has been woeful.)
Jawed