ATI & NVidia's vertex processors

Ostsol

Veteran
Digit-Life's review, like all their reviews is posted annoyingly on a single, massive page. However, they tend to do some synthetic benchmarks that other sites don't, making for some interesting browsing. The following benchmark intrigues me the most:

http://www.ixbt.com/video2/images/r420xt/gps-3diffuse+specular.png

What I'd like to point out is ATI's and NVidia's branching implementations. Obviously there's some fundamental difference between the two that leads to NVidia receiving quite a significant performance penalty. I'm wondering just what that is. . .

The thought had occured to me that ATI is expanding loops. Since they're static, this makes sense and would be relatively easy to do when compiling the shader. However, if that were the case, there is no way they could achieve the maximum 65k instruction limit they advertise. AFAIK, that number is really the number of instructions that can be executed including loops -- the maximum number instruction count of a vertex shader being far less. As such, if that isn't the solution, then what is?
 
Well I'm a lot more interested in the huge difference between ff & vs1/2 on nv40.
WTH is up with that?
 
I don't understand the FFP vs VS on the NV40, I thought FFP was completely gone. Maybe it's because the FPP executes hand-tuned vertex shaders. I'll have to find out.

As for the VS differences, dynamic branches probably get unrolled on the R420 (we haven't heard about any true branches added. Dave said SINCOS was only new thing added to vertex pipeline). Dynamic branches on NV40 probably need a smarter compiler. Static branches I believe are handled by compiling 2 versions of the shader, so the static branch is performing worst of all due to a) overhead of compiling b) overhead of switching (state change)

Interesting.
 
VS 2.0 doesn't do dynamic branches --> there are no results for r420 dynamic vs branch ;)
 
Yeah, I went blind looking at that diagram. I thought each card had 4 bars and assumed the dynamic branch was working (technically, it could be made to work with write disablement)
 
wouldn't R420 be slower under branching situation than under no-branching situation if it executes both paths?
 
You mean, static branching should be faster on R420 than just executing both paths right? Or are you talking about R420 vs NV40?

I assume the code those shaders run is the same, except one of them uses CMP and the other uses IF_BOOL.
 
These are the codes used in the test(I avoided the common parts):

non-branching:
Code:
// Lighting

				// 1

				add TempLight, c[LightPosition0], -TempPosition     // the light vector
				dp3 Temp.w, TempLight, TempLight
				rsq TempLight.w, Temp.w

				dst TempDistance, Temp.w, TempLight.w               // (1, d, d*d, 1/d)
				dp3 Temp.w, TempDistance, c[LightAttenuation0]      // (a0 + a1*d + a2*d2)
				rcp TempAttenuation.w, Temp.w		            // 1 / (a0 + a1*d + a2*d)	

				mul TempLight, TempLight, TempLight.w               // normalize the light vector

				add Temp, TempEye, TempLight		            // calculate half-vector (light vector + eye vector)
				nrm TempHalfAngle, Temp		                    // normalize half-vector

				dp3 TempLit.x, TempNormal, TempLight                // N*L
				dp3 TempLit.yz, TempNormal, TempHalfAngle           // N*H

				sge Temp.x, c[LightRange0].y, TempDistance.y        // (range > d) ? 1:0
				mul TempLit.x, TempLit.x, Temp.x
				mul TempLit.y, TempLit.y, Temp.x

				lit Temp, TempLit		                    // calculate the diffuse & specular factors
				mul TempIntensity, Temp, TempAttenuation.w          // scale by attenuation

				mul Temp, TempIntensity.y, c[LightDiffuse0]	    // calculate diffuse color
				mad TempColor, Temp, c[MaterialDiffuse], TempColor  // add (diffuse color * material diffuse)

				mul Temp, TempIntensity.z, c[LightSpecular0]	    // calculate specular color
				mad TempColor, Temp, c[MaterialSpecular], TempColor // add (specular color * material specular)

				// 2

				add TempLight, c[LightPosition1], -TempPosition     // the light vector
				dp3 Temp.w, TempLight, TempLight
				rsq TempLight.w, Temp.w

				dst TempDistance, Temp.w, TempLight.w               // (1, d, d*d, 1/d)
				dp3 Temp.w, TempDistance, c[LightAttenuation1]      // (a0 + a1*d + a2*d2)
				rcp TempAttenuation.w, Temp.w		            // 1 / (a0 + a1*d + a2*d)	

				mul TempLight, TempLight, TempLight.w               // normalize the light vector

				add Temp, TempEye, TempLight		            // calculate half-vector (light vector + eye vector)
				nrm TempHalfAngle, Temp		                    // normalize half-vector

				dp3 TempLit.x, TempNormal, TempLight                // N*L
				dp3 TempLit.yz, TempNormal, TempHalfAngle           // N*H

				sge Temp.x, c[LightRange1].y, TempDistance.y        // (range > d) ? 1:0
				mul TempLit.x, TempLit.x, Temp.x
				mul TempLit.y, TempLit.y, Temp.x

				lit Temp, TempLit		                    // calculate the diffuse & specular factors
				mul TempIntensity, Temp, TempAttenuation.w          // scale by attenuation

				mul Temp, TempIntensity.y, c[LightDiffuse1]	    // calculate diffuse color
				mad TempColor, Temp, c[MaterialDiffuse], TempColor  // add (diffuse color * material diffuse)

				mul Temp, TempIntensity.z, c[LightSpecular1]	    // calculate specular color
				mad TempColor, Temp, c[MaterialSpecular], TempColor // add (specular color * material specular)

				// 3

				add TempLight, c[LightPosition2], -TempPosition     // the light vector
				dp3 Temp.w, TempLight, TempLight
				rsq TempLight.w, Temp.w

				dst TempDistance, Temp.w, TempLight.w               // (1, d, d*d, 1/d)
				dp3 Temp.w, TempDistance, c[LightAttenuation2]      // (a0 + a1*d + a2*d2)
				rcp TempAttenuation.w, Temp.w		            // 1 / (a0 + a1*d + a2*d)	

				mul TempLight, TempLight, TempLight.w               // normalize the light vector

				add Temp, TempEye, TempLight		            // calculate half-vector (light vector + eye vector)
				nrm TempHalfAngle, Temp		                    // normalize half-vector

				dp3 TempLit.x, TempNormal, TempLight                // N*L
				dp3 TempLit.yz, TempNormal, TempHalfAngle           // N*H

				sge Temp.x, c[LightRange2].y, TempDistance.y        // (range > d) ? 1:0
				mul TempLit.x, TempLit.x, Temp.x
				mul TempLit.y, TempLit.y, Temp.x

				lit Temp, TempLit		                    // calculate the diffuse & specular factors
				mul TempIntensity, Temp, TempAttenuation.w          // scale by attenuation

				mul Temp, TempIntensity.y, c[LightDiffuse2]	    // calculate diffuse color
				mad TempColor, Temp, c[MaterialDiffuse], TempColor  // add (diffuse color * material diffuse)

				mul Temp, TempIntensity.z, c[LightSpecular2]	    // calculate specular color
				mad TempColor, Temp, c[MaterialSpecular], TempColor // add (specular color * material specular)

				mov  OutColor, TempColor		            // final color

static branching:
Code:
// Lighting

				loop aL, i0

					add TempLight, c[LightPosition], -TempPosition      // the light vector
					dp3 Temp.w, TempLight, TempLight
					rsq TempLight.w, Temp.w

					dst TempDistance, Temp.w, TempLight.w               // (1, d, d*d, 1/d)
					dp3 Temp.w, TempDistance, c[LightAttenuation]       // (a0 + a1*d + a2*d2)
					rcp TempAttenuation.w, Temp.w		            // 1 / (a0 + a1*d + a2*d)	

					mul TempLight, TempLight, TempLight.w               // normalize the light vector

					add Temp, TempEye, TempLight		            // calculate half-vector (light vector + eye vector)
					nrm TempHalfAngle, Temp		                    // normalize half-vector

					dp3 TempLit.x, TempNormal, TempLight                // N*L
					dp3 TempLit.yz, TempNormal, TempHalfAngle           // N*H

					sge Temp.x, c[LightRange].y, TempDistance.y         // (range > d) ? 1:0
					mul TempLit.x, TempLit.x, Temp.x
					mul TempLit.y, TempLit.y, Temp.x

					lit Temp, TempLit		                    // calculate the diffuse & specular factors
					mul TempIntensity, Temp, TempAttenuation.w          // scale by attenuation

					mul Temp, TempIntensity.y, c[LightDiffuse]	    // calculate diffuse color
					mad TempColor, Temp, c[MaterialDiffuse], TempColor  // add (diffuse color * material diffuse)

					mul Temp, TempIntensity.z, c[LightSpecular]	    // calculate specular color
					mad TempColor, Temp, c[MaterialSpecular], TempColor // add (specular color * material specular)

				endloop

				mov  OutColor, TempColor		                    // final color

dynamic branching:
Code:
// Lighting

				loop aL, i0

					add TempLight, c[LightPosition], -TempPosition                      // the light vector
					dp3 Temp.w, TempLight, TempLight
					rsq TempLight.w, Temp.w

					dst TempDistance, Temp.w, TempLight.w                               // (1, d, d*d, 1/d)

					if_lt TempDistance.y, c[LightRange].x				    // Distance < Range

						dp3 Temp.w, TempDistance, c[LightAttenuation]               // (a0 + a1*d + a2*d2)
						rcp TempAttenuation.w, Temp.w		                    // 1 / (a0 + a1*d + a2*d)	

						mul TempLight, TempLight, TempLight.w                       // normalize the light vector
						dp3 TempLit.x, TempNormal, TempLight                        // N*L

						if_gt TempLit.x, c[Zero].x				    // NdotL > 0

							add Temp, TempEye, TempLight		            // calculate half-vector (light vector + eye vector)
							nrm TempHalfAngle, Temp		                    // normalize half-vector

							dp3 TempLit.yz, TempNormal, TempHalfAngle           // N*H

							lit Temp, TempLit		                    // calculate the diffuse & specular factors
							mul TempIntensity, Temp, TempAttenuation.w          // scale by attenuation

							mul Temp, TempIntensity.y, c[LightDiffuse]	    // calculate diffuse color
							mad TempColor, Temp, c[MaterialDiffuse], TempColor  // add (diffuse color * material diffuse)

							mul Temp, TempIntensity.z, c[LightSpecular]	    // calculate specular color
							mad TempColor, Temp, c[MaterialSpecular], TempColor // add (specular color * material specular)
						endif
					endif

				endloop

				mov  OutColor, TempColor // final color

Any idea why static branching casused a slow down on NV40?
 
I suppose this answers the question of which is which from one of Microsoft's GDC presentations:

HLSL.ppt said:
Using static and dynamic flow control can dramatically slow down shaders

Static flow control is very cheap on some hardware, but not on some

When used correctly can dramatically decrease number of shader switches and associated CPU overhead

i.e. Unrolled loops that have variable end conditions can be dramatically faster than static looping on certain hardware
 
Sorta answers it. . . ;) I was more interested in "why?" rather than simple confirmation that this isn't unusual. :)
 
I meant the benches answer which hardware can do static branching for close to free, and which it is expensive for :)
 
I've already asked the question in another thread, but no answer :(.

Could someone explain to me why dynamic flow is faster than static flow on NV40?

IMG0007772.gif

http://www.hardware.fr/articles/494/page3.html
 
Doesn't this suggest that inspite of NV40 having both static & dynamic branching, the r420s static branching is actually the most usable vertex flow control implementation?
This needs further investigation to be sure either way.
 
The "static branching" example is really a test of the constant looping mechanism, which may or may not be getting unrolled by the driver.

Technically, it could be made to run as fast as the first example (no branch) by just fully unrolling the loop.

From what I gather, NVidia has spent all of its effort on optimizing pixel shaders, which was their achilles heel for the NV3x. If you look at NVShaderPerf or FXComposer, there is no unified compiler for vertex shaders, whereas for Pixel Shaders, they provide a library which contains the same unified compiler that's inside the device drivers in order to tell you GPU utilization statistics.

Their first drivers for the NV40 appear to be "get it out the door" drivers. They don't even support their video processing components that were demoed with OpenGL extensions at the launch event.

NVidia should have delayed their launch for a month or two. It would have benefited them in multiple ways: more time to mature the drivers and smaller launch<->volume ship gap.

Edit: I tried a really simply test. I wrote a loop which did nothing at all (unused results) in HLSL. FXComposer didn't eliminate the dead code. It appears to be doing almost very little HLSL optimizations, just a straight translation to VS2.0. The same exact loop, when compiled for PS_2_0 optimizes away the dead code! I think the vertex shader compiler uses different code base.

Edit: Update2, really bizarre. I got FXComposer to do loop hoisting and induction variable lifting! But it still leaves dead code behind, like the following

Code:
rep i0
endrep

Clearly the compiler is brittle and doesn't respond well to every situation.
 
It's still weird. Static branching should be at least as fast as dynamic branching because you can always replace the static branch with a dynamic one.
 
Evildeus said:
Could someone explain to me why dynamic flow is faster than static flow on NV40?
Because some codes are avoided on vertices that failed the branching condition?
 
BTW, I noticed offhand this idiom
Code:
sge Temp.x, c[LightRange0].y, TempDistance.y        // (range > d) ? 1:0 
mul TempLit.x, TempLit.x, Temp.x 
mul TempLit.y, TempLit.y, Temp.x

can be written like this
Code:
sge Temp.xy, c[LightRange0].yy, TempDistance.yy        // (range > d) ? 1:0 
mul TempLit.xy, TempLit.xy, Temp.xy

To save 1 instruction per loop. (actually, you could use Temp.xx and get rid of the usage of the extra component, freeing it up for usage later. One benefit of HLSL is that declaring datatypes, like float/2/3/4 allows the compiler to pack registers and use write masks efficiently everywhere)
 
991060 said:
Evildeus said:
Could someone explain to me why dynamic flow is faster than static flow on NV40?
Because some codes are avoided on vertices that failed the branching condition?

Logically its the other way around as the very nature of a static flow control implies that all vertex instances take the same execution path making it much easier for HW to skip unexecuted instruction than for the dynamic case.

John.
 
Evildeus said:
Could someone explain to me why dynamic flow is faster than static flow on NV40?
Can't be anything but driver issues. It appears that to date, nVidia has not spent much time optimizing the drivers for branching situations.
 
It wouldn't surprise me if NV didn't have any dedicated static branching support, and instead do it with dynamic branching instead.

Now, if I'm not mistaken isn't it true that NV didn't have any constant registers or something? Which could mean bad things could occur with static branching.
 
Back
Top