Isn't that an argument of benchmarks being superior to actual experience? Essentially image quality doesn't matter so long as framerates are high. Actual performance so far hasn't been the point of these demonstrations as no framerates are disclosed.
Benchmarks are admittedly imperfect attempts at measuring what the performance is for themselves, but it does have the benefit of providing some level of consistency, reproducibility, and values for measurement and comparison. Flaws or unexpected quirks related to those tests can be more readily found and analyzed, with a fair amount of shared terminology and processes.
You can derive conclusions of reasonable confidence when working with human perception, but if trying for a more scientific conclusion you will be faced with the daunting prospect that humans are gooey, twitchy, inconstant, non-deterministic, capricious, sometimes inarticulate, and often irrational instruments. Signal has to be extracted from a lot of measurements, and even slight quirks can create a signal that wouldn't exist if not for an unexpected influence added by the test. The set of variables is much wider and less understood, and it's challenging especially in that there are ways to influence a test that even a badly coded benchmark would be too rational to do.
Experimentiation in part is about trying to combat the human tendency to fool oneself, and adding a human element to more parts of the experiment allows for an explosion in the number of ways one can fool or be fooled about how they are fooling or being fooled about being fooled, and so on.
For example, maybe it would have made a difference if there were "four" systems, with some randomized mapping of the two systems to what the testers thought they were playing on. It might help avoid a player getting hung up on something subconsciously in one test run and their brain subsequently reinforcing its conclusions on what it knows is a retry.
Also, it seemed like the testing order was sequential and fixed, which might need to be controlled for if this were a more broad test. The brain running prior to a session on machine 1 isn't entirely the same one at the start of machine 2's run.
I could be misinterpreting the methodology section on that.
They admit they lacked the time and resources to do more, which may leave us uncertain on a lot of things.
Benchmarks are meant to try to optimize time and resourcing for getting a measurement of some sort.
Human beings, on the other hand, are complications wrapped in an enigma smothered in biology and dusted with cheetos.