On Ati's "optimization"

ATI's official statement:

The 1.9% performance gain comes from optimization of the two DX9 shaders (water and sky) in Game Test 4 . We render the scene exactly as intended by Futuremark, in full-precision floating point. Our shaders are mathematically and functionally identical to Futuremark's and there are no visual artifacts; we simply shuffle instructions to take advantage of our architecture. These are exactly the sort of optimizations that work in games to improve frame rates without reducing image quality and as such, are a realistic approach to a benchmark intended to measure in-game performance. However, we recognize that these can be used by some people to call into question the legitimacy of benchmark results, and so we are removing them from our driver as soon as is physically possible. We expect them to be gone by the next release of CATALYST.
 
DemoCoder said:
The only way something that like could not be cheating is if their driver had a generic shader optimizer that parsed shaders and "optimized on the fly" any non-optimal constructs.
I have a hard time to think this is feasible. In general, that's a much more difficult problem than an optimizing on-the-fly HLSL compiler. Maybe some very simple cases could get caught and optimized, but you'd waste cpu cycles all the time for analyzing the low-level code. If the non-optimal cases you can find and optimize are so frequent that such an on-the-fly optimizer would be worth it, you probably should change your hardware instead.

I don't know why, but somehow all this cheating discussion reminds me of the sun/sparc specfp 2000 scores. You think they have competitive fp performance if you look at the spec numbers. However if you look closer you can see they managed to boost scores on one particular benchmark (art) by about a factor of 10 with a new compiler, which gets them an overall increase of around 25% or so. It was stated everywhere this is NOT a cheat, instead it's an optimization. Of course, pretty much everything is allowed for compiler builders to boost specfp scores, with the important exception that the optimization must not target specfp directly (i.e. recognize the benchmark and do some hard-coded optimization). So, Sun has found a way to optimize this badly written benchmark without targetting it directly - which means potentially there could be other apps out there which would benefit from it, but the chance of this happening are likely very close to zero.
But I disgress... such optimizations are possible with a compiler (as it analyzes the code anyway and probably just gets one tiny bit slower in searching for code which could get optimized that way) but not possible within a graphic card driver.
(What IMHO is really interesting however is that no other chip maker / compiler vendor (such as intel) has incorporated the same optimizations. Either they are not clever enough to figure out what magic Sun does, or they think they don't need to inflate their scores that way. Both possibilities seem unlikely to me...)

mczak
 
mczak said:
I have a hard time to think this is feasible. In general, that's a much more difficult problem than an optimizing on-the-fly HLSL compiler. Maybe some very simple cases could get caught and optimized, but you'd waste cpu cycles all the time for analyzing the low-level code. If the non-optimal cases you can find and optimize are so frequent that such an on-the-fly optimizer would be worth it, you probably should change your hardware instead.

You only send the shaders over once for a whole bunch of pixels. They're only ~100 instructions. Simple rules for re-ordering should be no problem to implement _in real time_, _in the general case_.

Matter of fact, its likely the HLSL translates to some generic OP tree thing, and then does code generation, then does optimization. For an assembly shader, you'd skip the first two steps.
 
DaveBaumann said:
ATI's official statement:
Sounds much more reasonable to me than nvidias statement...
Of course they are in a much better position to argue than nvidia, since they don't cheat as much as nvidia (the clip plane / back buffer clearing cheats are IMHO much more evil as shader exchange, as long as the shader is mathematically equivalent). And, more important, they can of course easily remove the cheats and stay ahead of nvidia performance-wise (if reviewers use the updated version or question the validity of nvidias results if the next Detonator driver should magically show a 24.1% improvement...).
 
RussSchultz said:
mczak said:
I have a hard time to think this is feasible. In general, that's a much more difficult problem than an optimizing on-the-fly HLSL compiler. Maybe some very simple cases could get caught and optimized, but you'd waste cpu cycles all the time for analyzing the low-level code. If the non-optimal cases you can find and optimize are so frequent that such an on-the-fly optimizer would be worth it, you probably should change your hardware instead.

You only send the shaders over once for a whole bunch of pixels. They're only ~100 instructions. Simple rules for re-ordering should be no problem to implement _in real time_, _in the general case_.
If you can improve performance by just re-ordering, I'd consider that as a simple case to catch and thus agree it could be done. I don't question it can be done in real-time either, but quite a few of todays games are more cpu than gpu limited at almost all resolutions/FSAA levels (at least on high-end cards) that you probably just don't want to spend _any_ cpu time in such an optimizer.

Matter of fact, its likely the HLSL translates to some generic OP tree thing, and then does code generation, then does optimization. For an assembly shader, you'd skip the first two steps.
Yes, but software design also says you don't gain much optimizing at the lowest level (i.e. reordering instructions).
 
DaveBaumann said:
ATI's official statement:

The 1.9% performance gain comes from optimization of the two DX9 shaders (water and sky) in Game Test 4 . We render the scene exactly as intended by Futuremark, in full-precision floating point. Our shaders are mathematically and functionally identical to Futuremark's and there are no visual artifacts; we simply shuffle instructions to take advantage of our architecture. These are exactly the sort of optimizations that work in games to improve frame rates without reducing image quality and as such, are a realistic approach to a benchmark intended to measure in-game performance. However, we recognize that these can be used by some people to call into question the legitimacy of benchmark results, and so we are removing them from our driver as soon as is physically possible. We expect them to be gone by the next release of CATALYST.

Great testament to the public opinion of ATi:

Nobody's even suggested testing cat3.5 and/or cat3.6 on old versions of 3DMark03 to test that the optimisation really is out of the driver. 8)
 
mczak said:
If you can improve performance by just re-ordering, I'd consider that as a simple case to catch and thus agree it could be done. I don't question it can be done in real-time either, but quite a few of todays games are more cpu than gpu limited at almost all resolutions/FSAA levels (at least on high-end cards) that you probably just don't want to spend _any_ cpu time in such an optimizer.
It's only a one-time optimization when the shader is first compiled/passed to the driver. No one cares if loading a game/benchmark takes half a second longer.

mczak said:
Yes, but software design also says you don't gain much optimizing at the lowest level (i.e. reordering instructions).
And CPU design shows that this is phenomenally wrong. ;)
Optimizing inner loops very results in huge performance improvements, and a shader is nothing but a single "inner loop".
 
DaveBaumann said:
ATI's official statement:

The 1.9% performance gain comes from optimization of the two DX9 shaders (water and sky) in Game Test 4 . We render the scene exactly as intended by Futuremark, in full-precision floating point. Our shaders are mathematically and functionally identical to Futuremark's and there are no visual artifacts; we simply shuffle instructions to take advantage of our architecture. These are exactly the sort of optimizations that work in games to improve frame rates without reducing image quality and as such, are a realistic approach to a benchmark intended to measure in-game performance. However, we recognize that these can be used by some people to call into question the legitimacy of benchmark results, and so we are removing them from our driver as soon as is physically possible. We expect them to be gone by the next release of CATALYST.

That answer is definitely a lot better than blaming futuremark for a problem or issue that is their own.

Something NVIDIA has not done, actually admitted they have done something wrong or controversial.

ATI is looking like a very honest company as of late. And this statement helps strengthen that stance.
 
DaveBaumann said:
ATI's official statement:

The 1.9% performance gain comes from optimization of the two DX9 shaders (water and sky) in Game Test 4 . We render the scene exactly as intended by Futuremark, in full-precision floating point. Our shaders are mathematically and functionally identical to Futuremark's and there are no visual artifacts; we simply shuffle instructions to take advantage of our architecture. These are exactly the sort of optimizations that work in games to improve frame rates without reducing image quality and as such, are a realistic approach to a benchmark intended to measure in-game performance. However, we recognize that these can be used by some people to call into question the legitimacy of benchmark results, and so we are removing them from our driver as soon as is physically possible. We expect them to be gone by the next release of CATALYST.

Nvidia responds(take it with a grain of salt...)
http://www.nvnews.net/

Since NVIDIA is not part in the FutureMark beta program (a program which costs of hundreds of thousands of dollars to participate in) we do not get a chance to work with Futuremark on writing the shaders like we would with a real applications developer. We don't know what they did but it looks like they have intentionally tried to create a scenario that makes our products look bad. This is obvious since our relative performance on games like Unreal Tournament 2003 and Doom3 shows that The GeForce FX 5900 is by far the fastest graphics on the market today.
 
saf1 said:
DaveBaumann said:
ATI's official statement:

The 1.9% performance gain comes from optimization of the two DX9 shaders (water and sky) in Game Test 4 . We render the scene exactly as intended by Futuremark, in full-precision floating point. Our shaders are mathematically and functionally identical to Futuremark's and there are no visual artifacts; we simply shuffle instructions to take advantage of our architecture. These are exactly the sort of optimizations that work in games to improve frame rates without reducing image quality and as such, are a realistic approach to a benchmark intended to measure in-game performance. However, we recognize that these can be used by some people to call into question the legitimacy of benchmark results, and so we are removing them from our driver as soon as is physically possible. We expect them to be gone by the next release of CATALYST.

Nvidia responds(take it with a grain of salt...)
http://www.nvnews.net/

Since NVIDIA is not part in the FutureMark beta program (a program which costs of hundreds of thousands of dollars to participate in) we do not get a chance to work with Futuremark on writing the shaders like we would with a real applications developer. We don't know what they did but it looks like they have intentionally tried to create a scenario that makes our products look bad. This is obvious since our relative performance on games like Unreal Tournament 2003 and Doom3 shows that The GeForce FX 5900 is by far the fastest graphics on the market today.

its also on http://news.com.com/2100-1046_3-1009574.html?tag=fd_top
 
They (and NVIDIA) have a runtime driver optimiser, but a generic one will never truely be as good as some hand tuned code. Yes, most stuff will run reasonably optimally anyway, but on key stuff (read highly publicised / bencharked / ones that don't work well with the optimiser) they are likely to hand tune it.
 
saf1 said:
Nvidia responds(take it with a grain of salt...)
http://www.nvnews.net/

Since NVIDIA is not part in the FutureMark beta program (a program which costs of hundreds of thousands of dollars to participate in) we do not get a chance to work with Futuremark on writing the shaders like we would with a real applications developer. We don't know what they did but it looks like they have intentionally tried to create a scenario that makes our products look bad. This is obvious since our relative performance on games like Unreal Tournament 2003 and Doom3 shows that The GeForce FX 5900 is by far the fastest graphics on the market today.

LOL. The bold faced/underlined is priceless.

Damn, I really want to put that in my sig, but I don't want to have to change it again! in such a short time... :?


Added after some brief reflection before hitting Submit:

Yes, let's compare our PS2.0 and PS1.4 terrifically bad 3DMark03 performance, to our performance in pre-PS1.4 functionality. Now that is a PR spin. I can't wait for the first PS2.0 game to come out without any specific NV30 optimisation. =) That'll be a slaughterhouse for NV3x.
 
RussSchultz said:
You only send the shaders over once for a whole bunch of pixels. They're only ~100 instructions. Simple rules for re-ordering should be no problem to implement _in real time_, _in the general case_.

Great, then why doesn't nVidia do this, if that's the problem with NV30. I mean, FM defeating application / shader detection routines surely should not defeat a "no problem to implement real-time re-ordering optimiztion."
 
Xmas said:
It's only a one-time optimization when the shader is first compiled/passed to the driver. No one cares if loading a game/benchmark takes half a second longer.
actually, if I think about that, you're probably right. I guess you typically use the same shaders over and over again, with passing them to the gpu only once (or, if you pass it multiple times, the driver could easily detect this shader was already used and use an already optimized cached copy). In such a case it makes sense to optimize I guess.

mczak said:
Yes, but software design also says you don't gain much optimizing at the lowest level (i.e. reordering instructions).
And CPU design shows that this is phenomenally wrong. ;)
Optimizing inner loops very results in huge performance improvements, and a shader is nothing but a single "inner loop".
I wouldn't call it phenomelly wrong, but the gains can indeed be considerable (won't help an selection sort to beat a quick sort though). The potential for optimizing low-level shader code should however be smaller than for cpus I suppose (as you can't do much more than re-ordering probably). But maybe I'm wrong again - it's a bit too late over here and I definitely need some sleep :?
 
Tagrineth said:
Nobody's even suggested testing cat3.5 and/or cat3.6 on old versions of 3DMark03 to test that the optimisation really is out of the driver. 8)
If you meant 3.2 and 3.3, I have. :)
 
mczak said:
hard time to think this is feasible. In general, that's a much more difficult problem than an optimizing on-the-fly HLSL compiler.
Re-analysing assembly to generate new code isn't significantly harder than doing it from the high-level version.
 
mczak said:
actually, if I think about that, you're probably right. I guess you typically use the same shaders over and over again, with passing them to the gpu only once (or, if you pass it multiple times, the driver could easily detect this shader was already used and use an already optimized cached copy). In such a case it makes sense to optimize I guess.
When an application passes a shader to the driver/API, the driver caches it and the app gets a handle to that shader. The app then uses those handles to select which shader should be used to render the following polygons. All shaders are kept "alive" until the application deletes them, which is usually at the end of a "level" or when exiting the game.

mczak said:
The potential for optimizing low-level shader code should however be smaller than for cpus I suppose (as you can't do much more than re-ordering probably). But maybe I'm wrong again - it's a bit too late over here and I definitely need some sleep :?
The potential for CPUs might be higher. But on GPUs you also have different units (TMU, vec3 + 1 or vec4, complex fp ops) working in parallel which need to be fed. And you can do some neat things by combining scalar operations. If you can save two ops in a 10 op shader, that might give you a 10-30% performance increase.
 
I wouldn't call it phenomelly wrong, but the gains can indeed be considerable (won't help an selection sort to beat a quick sort though). The potential for optimizing low-level shader code should however be smaller than for cpus I suppose (as you can't do much more than re-ordering probably). But maybe I'm wrong again - it's a bit too late over here and I definitely need some sleep

Reordering instructions done by a CPU gets you about 30%. That's phenomenal. Reordering shader code could easily remove stalls and reduce contention for execution resources.
 
Back
Top