"Diminishing returns" is a quantitative idea, so it kind of implies the question...how do we quantify this? I came up with a thought experiment that I think could quantify it. Get a bunch of screenshots going all the way back to the Atari 2600, photographs, and artistic works (sketches, paintings, etc). Poll a bunch of people and ask them to rate the quality of the graphics from 1 to 10. Tell them only that "1" means "This couldn't be much worse," and "10" meaning "I'm not sure if this is a video game, a picture, or an artist's work."
Then plot the rating of the picture with the year the game's target hardware was released. Dollars to donuts say you see a curve that visibly increases more slowly some time in the early to mid 2000s. It's not worth the effort for me to do it, but if I was writing for an industry publication, I'd try to put together the poll.
Hard to specifically quantify diminishing returns, but we can specify the processes that lead us there. The number of pixels per triangle across a screen, how many operations lead to the final screen space pixel, and how many screen space pixels we have to render to in a way define our ability to reach a diminished rate of visual improvement. A simple doubling or even quadrupling of capability needs new techniques and methods to improve beyond just 4x the triangles, texels and pixels.
Hence, each of the 3D generations (and even within generations) to me is much more than just more polygons, texels, and pixels, but also about what you can do with them. Each new console has layered on more hallmark capabilities to approximate realistic rendering.