I think the whole point of 3dcgi is that there is nothing to prove, because the types of workload change multiple times per frame and thus the location of the bottleneck changes just the same.
It only makes sense to say: x% of the time, the bottleneck is here, and y% of the time is somewhere else.
There is, you can look at the ratio of SM-TPC-Polymorph Engine-Raster Engine and actual game frame behaviour.
Historically this has been a 1:1 performance relationship (see Arun's tool) but as the SM/CUDA cores scale while other aspects remain static it puts more pressure on the front end IF looking to use the idea of the architecture scaling from say Pascal to Volta; a 42% increase but the 1:1 relationship in context of geometry is now broken.
This is further seen with games that are measured either with PresentMon or other time based derivative solution, you see the influence on frames.
3dcgi was picking up my post that it was speculation with no foundation; actually it does have a foundation and is backed up with what is seen with nearly every game so far on TitanV and instead of 42% improvement in games we are at average of 18-25% or mostly below and a very rare few either in low 30s% or at times over 40%.
Look at Arun's tool and what was discussed, then look at games monitored from a frame behaviour perspective.
If one wants to argue semantics, then one can say there is no bottlenecks anywhere as workload changes for anything; point is context was in response to scaling of compute/TFLOPs/cores and gaming (geometry aspects that can be proved to be less than before in terms of ratio with the architecture fundamental to Nvidia).
And that then leads into by your context you might as well say games are fine on TitanV and scaling well if we look at the 1% of times it is fine over Y period rather than more real world and how it is behaving 98% of time in the game, in reality games are not scaling well and it comes back so far (no other explanation identified) to what Arun has identified with his tool and was discussed in that thread.
But like I mentioned to be 100% satisfied with Arun's tool results we need to see the behaviour on P100 due to the SM/CUDA structure, like I said in theory the tool should identify it as 1:2 or better, for V100 it is quite a lot worse than that.
Still this gives us some indicator (Arun's tool showing front end performance ratio has reduced) combined with what we are seeing with game behaviour trends on Titan V when the cores scaled by 42%.
Edit:
Worth noting as well that even with the reduced ROPs in compute applications requiring B/W such as Amber the TitanV still hits over 40% performance increase, so relative to comparing scaling performance with say GP102 it is fair to say it is still not a limitation relative to the core scaling.
That said it would be even higher with the full HBM2 bit/BW but it is not limiting to below the core scaling.