Custom benchmark timedemos should indeed be rotated by publicly available recorded sessions with vast differences in order to rule out timedemo specific "optimizations" by IHV's..
BUT.. that being said.. there are also a number of legitimate reasons why you might see a flip-flop in performance superiority between different IHV's hardware just by selecting a different timedemo.
Since different 3D hardware has different strengths and weaknesses, it would be common sense that a timedemo that exercises one IHV's weakness for the majority of the script would have diminished performance on that particular IHV. Timedemos that illustrate the reverse (i.e. contain a majority of time illustrating a weakness of another IHV's hardware) would then also possibly reverse the situation.
I can give you a good example, albeit not with a timedemo since the game at hand does not have such a feature: Morrowind. If you put a 9800 Pro head to head with say a Geforce4 card.. with all settings maxed (max view distance, texture quality, shadows, etc.etc.) the 9800 Pro will benchmark much faster in most all circumstances. A current weakness exists with either the drivers or shadow method to where if you spawn a Winged Twilight (fairly rare creature), the 9800 Pro's framerates nosedive whereas an NVIDIA card has no problem. So if you could theorize a timedemo in this game.. one that is taken in a busy town with a long horizon viewing distance, 10 guards walking around.. you'd have a 9800 Pro powering past the NVIDIA card in framerates. Now theorize a timedemo in an underground dungeon, with little viewing distance do to tunnels and passageways and 3 or 4 Winged Twilights.. Viola- you have a graph that is the exact opposite.
In a Quake3 sense, the same circumstances might be relevent. Theorize a particular IHV sees a reduction in performance when looking at the skies. A timedemo outdoors with a viewpoint that always have 20-30% of the screen with the skybox would yield poor performance on this IHV... yet timedemos recorded in indoor levels with no skybox, or the player's viewing angle mostly looking downwards at the ground might yield the opposite effect.
This is nothing new- and in some cases, certain websites would use this to their advantage to push a particular advantage. By cherry picking certain benchmark scripts, you could pretty much fabricate a "win" just by ensuring a demo/benchmark made ample use of a weakness of a particular IHV that you were trying to compete with. It didn't require any driver "cheats" or "optimizations"- just a good eye and understanding of things that particular platforms did well.. and what other platforms didn't handle well. (alpha effects/smoke and explosion effects on generations past comes to mind the most..)
So whether this is a case of driver "cheats" still remains to be seen. If the nature of the timedemos is vastly different, or if some sort of shortcoming can be found (performance wise, be it driver or hardware) between the timedemos is what should be researched before crying foul play.
If we also take a page from the past, cases like these that turn out to be shortcomings usually improve things for consumers.. a lot of times they point out bugs or inefficiencies in drivers that were just simply not caught by IHV X. They might simply be pointing out a few lines of dead code or less than optimal approach to a particular function or feature. When mistakes or inefficiencies like these are illustrated by opposing benchmarks, the end user usually sees fixes/improvements in drivers from these rather quickly.