On Ati's "optimization"

Dave Baumann · May 23, 2003

ATI's official statement:

The 1.9% performance gain comes from optimization of the two DX9 shaders (water and sky) in Game Test 4 . We render the scene exactly as intended by Futuremark, in full-precision floating point. Our shaders are mathematically and functionally identical to Futuremark's and there are no visual artifacts; we simply shuffle instructions to take advantage of our architecture. These are exactly the sort of optimizations that work in games to improve frame rates without reducing image quality and as such, are a realistic approach to a benchmark intended to measure in-game performance. However, we recognize that these can be used by some people to call into question the legitimacy of benchmark results, and so we are removing them from our driver as soon as is physically possible. We expect them to be gone by the next release of CATALYST.

mczak · May 23, 2003

DemoCoder said:
The only way something that like could not be cheating is if their driver had a generic shader optimizer that parsed shaders and "optimized on the fly" any non-optimal constructs.

I have a hard time to think this is feasible. In general, that's a much more difficult problem than an optimizing on-the-fly HLSL compiler. Maybe some very simple cases could get caught and optimized, but you'd waste cpu cycles all the time for analyzing the low-level code. If the non-optimal cases you can find and optimize are so frequent that such an on-the-fly optimizer would be worth it, you probably should change your hardware instead.

I don't know why, but somehow all this cheating discussion reminds me of the sun/sparc specfp 2000 scores. You think they have competitive fp performance if you look at the spec numbers. However if you look closer you can see they managed to boost scores on one particular benchmark (art) by about a factor of 10 with a new compiler, which gets them an overall increase of around 25% or so. It was stated everywhere this is NOT a cheat, instead it's an optimization. Of course, pretty much everything is allowed for compiler builders to boost specfp scores, with the important exception that the optimization must not target specfp directly (i.e. recognize the benchmark and do some hard-coded optimization). So, Sun has found a way to optimize this badly written benchmark without targetting it directly - which means potentially there could be other apps out there which would benefit from it, but the chance of this happening are likely very close to zero.
But I disgress... such optimizations are possible with a compiler (as it analyzes the code anyway and probably just gets one tiny bit slower in searching for code which could get optimized that way) but not possible within a graphic card driver.
(What IMHO is really interesting however is that no other chip maker / compiler vendor (such as intel) has incorporated the same optimizations. Either they are not clever enough to figure out what magic Sun does, or they think they don't need to inflate their scores that way. Both possibilities seem unlikely to me...)

mczak

RussSchultz · May 23, 2003

mczak said:
I have a hard time to think this is feasible. In general, that's a much more difficult problem than an optimizing on-the-fly HLSL compiler. Maybe some very simple cases could get caught and optimized, but you'd waste cpu cycles all the time for analyzing the low-level code. If the non-optimal cases you can find and optimize are so frequent that such an on-the-fly optimizer would be worth it, you probably should change your hardware instead.

You only send the shaders over once for a whole bunch of pixels. They're only ~100 instructions. Simple rules for re-ordering should be no problem to implement _in real time_, _in the general case_.

Matter of fact, its likely the HLSL translates to some generic OP tree thing, and then does code generation, then does optimization. For an assembly shader, you'd skip the first two steps.

mczak · May 24, 2003

DaveBaumann said:
ATI's official statement:

Sounds much more reasonable to me than nvidias statement...
Of course they are in a much better position to argue than nvidia, since they don't cheat as much as nvidia (the clip plane / back buffer clearing cheats are IMHO much more evil as shader exchange, as long as the shader is mathematically equivalent). And, more important, they can of course easily remove the cheats and stay ahead of nvidia performance-wise (if reviewers use the updated version or question the validity of nvidias results if the next Detonator driver should magically show a 24.1% improvement...).

mczak · May 24, 2003

RussSchultz said:
mczak said:

I have a hard time to think this is feasible. In general, that's a much more difficult problem than an optimizing on-the-fly HLSL compiler. Maybe some very simple cases could get caught and optimized, but you'd waste cpu cycles all the time for analyzing the low-level code. If the non-optimal cases you can find and optimize are so frequent that such an on-the-fly optimizer would be worth it, you probably should change your hardware instead.

Click to expand...

You only send the shaders over once for a whole bunch of pixels. They're only ~100 instructions. Simple rules for re-ordering should be no problem to implement _in real time_, _in the general case_.

If you can improve performance by just re-ordering, I'd consider that as a simple case to catch and thus agree it could be done. I don't question it can be done in real-time either, but quite a few of todays games are more cpu than gpu limited at almost all resolutions/FSAA levels (at least on high-end cards) that you probably just don't want to spend _any_ cpu time in such an optimizer.

Matter of fact, its likely the HLSL translates to some generic OP tree thing, and then does code generation, then does optimization. For an assembly shader, you'd skip the first two steps.

Yes, but software design also says you don't gain much optimizing at the lowest level (i.e. reordering instructions).

Tagrineth · May 24, 2003

DaveBaumann said:
ATI's official statement:

The 1.9% performance gain comes from optimization of the two DX9 shaders (water and sky) in Game Test 4 . We render the scene exactly as intended by Futuremark, in full-precision floating point. Our shaders are mathematically and functionally identical to Futuremark's and there are no visual artifacts; we simply shuffle instructions to take advantage of our architecture. These are exactly the sort of optimizations that work in games to improve frame rates without reducing image quality and as such, are a realistic approach to a benchmark intended to measure in-game performance. However, we recognize that these can be used by some people to call into question the legitimacy of benchmark results, and so we are removing them from our driver as soon as is physically possible. We expect them to be gone by the next release of CATALYST.

Click to expand...

Great testament to the public opinion of ATi:

Nobody's even suggested testing cat3.5 and/or cat3.6 on old versions of 3DMark03 to test that the optimisation really is out of the driver. 8)

Xmas · May 24, 2003

mczak said:
If you can improve performance by just re-ordering, I'd consider that as a simple case to catch and thus agree it could be done. I don't question it can be done in real-time either, but quite a few of todays games are more cpu than gpu limited at almost all resolutions/FSAA levels (at least on high-end cards) that you probably just don't want to spend _any_ cpu time in such an optimizer.

It's only a one-time optimization when the shader is first compiled/passed to the driver. No one cares if loading a game/benchmark takes half a second longer.

mczak said:
Yes, but software design also says you don't gain much optimizing at the lowest level (i.e. reordering instructions).

And CPU design shows that this is phenomenally wrong.

Optimizing inner loops very results in huge performance improvements, and a shader is nothing but a single "inner loop".

Brent · May 24, 2003

DaveBaumann said:
ATI's official statement:

The 1.9% performance gain comes from optimization of the two DX9 shaders (water and sky) in Game Test 4 . We render the scene exactly as intended by Futuremark, in full-precision floating point. Our shaders are mathematically and functionally identical to Futuremark's and there are no visual artifacts; we simply shuffle instructions to take advantage of our architecture. These are exactly the sort of optimizations that work in games to improve frame rates without reducing image quality and as such, are a realistic approach to a benchmark intended to measure in-game performance. However, we recognize that these can be used by some people to call into question the legitimacy of benchmark results, and so we are removing them from our driver as soon as is physically possible. We expect them to be gone by the next release of CATALYST.

Click to expand...

That answer is definitely a lot better than blaming futuremark for a problem or issue that is their own.

Something NVIDIA has not done, actually admitted they have done something wrong or controversial.

ATI is looking like a very honest company as of late. And this statement helps strengthen that stance.

saf1 · May 24, 2003

DaveBaumann said:
ATI's official statement:

The 1.9% performance gain comes from optimization of the two DX9 shaders (water and sky) in Game Test 4 . We render the scene exactly as intended by Futuremark, in full-precision floating point. Our shaders are mathematically and functionally identical to Futuremark's and there are no visual artifacts; we simply shuffle instructions to take advantage of our architecture. These are exactly the sort of optimizations that work in games to improve frame rates without reducing image quality and as such, are a realistic approach to a benchmark intended to measure in-game performance. However, we recognize that these can be used by some people to call into question the legitimacy of benchmark results, and so we are removing them from our driver as soon as is physically possible. We expect them to be gone by the next release of CATALYST.

Click to expand...

Nvidia responds(take it with a grain of salt...)
http://www.nvnews.net/

Since NVIDIA is not part in the FutureMark beta program (a program which costs of hundreds of thousands of dollars to participate in) we do not get a chance to work with Futuremark on writing the shaders like we would with a real applications developer. We don't know what they did but it looks like they have intentionally tried to create a scenario that makes our products look bad. This is obvious since our relative performance on games like Unreal Tournament 2003 and Doom3 shows that The GeForce FX 5900 is by far the fastest graphics on the market today.

Brent · May 24, 2003

saf1 said:
DaveBaumann said:

ATI's official statement:

The 1.9% performance gain comes from optimization of the two DX9 shaders (water and sky) in Game Test 4 . We render the scene exactly as intended by Futuremark, in full-precision floating point. Our shaders are mathematically and functionally identical to Futuremark's and there are no visual artifacts; we simply shuffle instructions to take advantage of our architecture. These are exactly the sort of optimizations that work in games to improve frame rates without reducing image quality and as such, are a realistic approach to a benchmark intended to measure in-game performance. However, we recognize that these can be used by some people to call into question the legitimacy of benchmark results, and so we are removing them from our driver as soon as is physically possible. We expect them to be gone by the next release of CATALYST.

Click to expand...

Click to expand...

Nvidia responds(take it with a grain of salt...)
http://www.nvnews.net/

Since NVIDIA is not part in the FutureMark beta program (a program which costs of hundreds of thousands of dollars to participate in) we do not get a chance to work with Futuremark on writing the shaders like we would with a real applications developer. We don't know what they did but it looks like they have intentionally tried to create a scenario that makes our products look bad. This is obvious since our relative performance on games like Unreal Tournament 2003 and Doom3 shows that The GeForce FX 5900 is by far the fastest graphics on the market today.

its also on http://news.com.com/2100-1046_3-1009574.html?tag=fd_top

Dave Baumann · May 24, 2003

They (and NVIDIA) have a runtime driver optimiser, but a generic one will never truely be as good as some hand tuned code. Yes, most stuff will run reasonably optimally anyway, but on key stuff (read highly publicised / bencharked / ones that don't work well with the optimiser) they are likely to hand tune it.

Dave Baumann · May 24, 2003

Hardware membership does not have to cost hundreds of thousands.

Tagrineth · May 24, 2003

saf1 said:
Nvidia responds(take it with a grain of salt...)
http://www.nvnews.net/

Since NVIDIA is not part in the FutureMark beta program (a program which costs of hundreds of thousands of dollars to participate in) we do not get a chance to work with Futuremark on writing the shaders like we would with a real applications developer. We don't know what they did but it looks like they have intentionally tried to create a scenario that makes our products look bad. This is obvious since our relative performance on games like Unreal Tournament 2003 and Doom3 shows that The GeForce FX 5900 is by far the fastest graphics on the market today.

LOL. The bold faced/underlined is priceless.

Damn, I really want to put that in my sig, but I don't want to have to change it again! in such a short time... :?

Added after some brief reflection before hitting Submit:

Yes, let's compare our PS2.0 and PS1.4 terrifically bad 3DMark03 performance, to our performance in pre-PS1.4 functionality. Now that is a PR spin. I can't wait for the first PS2.0 game to come out without any specific NV30 optimisation. =) That'll be a slaughterhouse for NV3x.

Joe DeFuria · May 24, 2003

RussSchultz said:
You only send the shaders over once for a whole bunch of pixels. They're only ~100 instructions. Simple rules for re-ordering should be no problem to implement _in real time_, _in the general case_.

Great, then why doesn't nVidia do this, if that's the problem with NV30. I mean, FM defeating application / shader detection routines surely should not defeat a "no problem to implement real-time re-ordering optimiztion."

mczak · May 24, 2003

Xmas said:
It's only a one-time optimization when the shader is first compiled/passed to the driver. No one cares if loading a game/benchmark takes half a second longer.

actually, if I think about that, you're probably right. I guess you typically use the same shaders over and over again, with passing them to the gpu only once (or, if you pass it multiple times, the driver could easily detect this shader was already used and use an already optimized cached copy). In such a case it makes sense to optimize I guess.

mczak said:
mczak said:

Yes, but software design also says you don't gain much optimizing at the lowest level (i.e. reordering instructions).

Click to expand...

And CPU design shows that this is phenomenally wrong.
Optimizing inner loops very results in huge performance improvements, and a shader is nothing but a single "inner loop".

I wouldn't call it phenomelly wrong, but the gains can indeed be considerable (won't help an selection sort to beat a quick sort though). The potential for optimizing low-level shader code should however be smaller than for cpus I suppose (as you can't do much more than re-ordering probably). But maybe I'm wrong again - it's a bit too late over here and I definitely need some sleep :?

Joe DeFuria · May 24, 2003

DaveBaumann said:
ATI's official statement...

Bravo, ATI :!:

Pete · May 24, 2003

Tagrineth said:
Nobody's even suggested testing cat3.5 and/or cat3.6 on old versions of 3DMark03 to test that the optimisation really is out of the driver. 8)

If you meant 3.2 and 3.3, I have.

Dio · May 24, 2003

mczak said:
hard time to think this is feasible. In general, that's a much more difficult problem than an optimizing on-the-fly HLSL compiler.

Re-analysing assembly to generate new code isn't significantly harder than doing it from the high-level version.

Xmas · May 24, 2003

mczak said:
actually, if I think about that, you're probably right. I guess you typically use the same shaders over and over again, with passing them to the gpu only once (or, if you pass it multiple times, the driver could easily detect this shader was already used and use an already optimized cached copy). In such a case it makes sense to optimize I guess.

When an application passes a shader to the driver/API, the driver caches it and the app gets a handle to that shader. The app then uses those handles to select which shader should be used to render the following polygons. All shaders are kept "alive" until the application deletes them, which is usually at the end of a "level" or when exiting the game.

mczak said:
The potential for optimizing low-level shader code should however be smaller than for cpus I suppose (as you can't do much more than re-ordering probably). But maybe I'm wrong again - it's a bit too late over here and I definitely need some sleep :?

The potential for CPUs might be higher. But on GPUs you also have different units (TMU, vec3 + 1 or vec4, complex fp ops) working in parallel which need to be fed. And you can do some neat things by combining scalar operations. If you can save two ops in a 10 op shader, that might give you a 10-30% performance increase.

Saem · May 24, 2003

I wouldn't call it phenomelly wrong, but the gains can indeed be considerable (won't help an selection sort to beat a quick sort though). The potential for optimizing low-level shader code should however be smaller than for cpus I suppose (as you can't do much more than re-ordering probably). But maybe I'm wrong again - it's a bit too late over here and I definitely need some sleep

Reordering instructions done by a CPU gets you about 30%. That's phenomenal. Reordering shader code could easily remove stalls and reduce contention for execution resources.

On Ati's "optimization"

Dave Baumann

Gamerscore Wh...

mczak

RussSchultz

Professional Malcontent

mczak

mczak

Tagrineth

murr

Xmas

Porous

Brent

saf1

Brent

Dave Baumann

Gamerscore Wh...

Dave Baumann

Gamerscore Wh...

Tagrineth

murr

Joe DeFuria

mczak

Joe DeFuria

Pete

Moderate Nuisance

Dio

Xmas

Porous

Saem

Similar threads