G7x vs R580 Architectural Efficiency

KimB · Mar 8, 2006

DemoCoder said:
I'm just counting the norms. (2 norms expanded out, norm++ means "+1 norm") There are 4 norms being done. These potentially execute on NV HW with single cycle throughput if the compiler can recognize them. I'm not saying anythng about how you could rewrite them with macros or not (although using NRM would probably be superior since it is easier for the compiler to map it)

I think it is a relevant observation that roughly 1/3 of the instructions are performing norms.

Ah, I see. But for nVidia hardware, you'd want to add an extra MUL for each TEX, making it:
3:27 for ATI
3:19 for NV
(I think you're off by one on your 3:15 number, because MAD would become ADD, not taken away entirely)

Still a big difference, of course.

OpenGL guy · Mar 8, 2006

I think you both should go back and reread what the shader is doing: Those first two "nrms" aren't nrms.

Reverend · Mar 8, 2006

OpenGL guy said:
I think you both should go back and reread what the shader is doing: Those first two "nrms" aren't nrms.

I believe the key words are "can be nrms".

OpenGL guy · Mar 8, 2006

Reverend said:
I believe the key words are "can be nrms".

Code:

    dp3 r6.x, t1, t1       <--- r6.x = t1 * t1
    dp3 r3.x, t2, t2                 <--- r3.x = t2 * t2
    rsq r2.w, r6.x                   <--- r2.w = length of t1
    rsq r0.w, r3.x                   <--- r0.w = length of t2
    mul r3.xyz, r2.w, t1           <--- r3.xyz = nrm t1
    mad r5.xyz, t2, r0.w, r3     <--- r5.xyz = nrm t1 + nrm t2
    nrm r4.xyz, r5
    add r5.xyz, r2, c3.x
    nrm r2.xyz, r5
    dp3_sat r4.x, r2, r4
    dp3_sat r2.x, r2, r3
    mul r0.w, r1.w, c2.x
    mul r1.xyz, r1, c1
    pow r1.w, r4.x, r0.w        <--- r1.w = r4.x ^ r0.w = r4.x ^ length t2
    mov_sat r0.w, r6.x          <--- r0.w = min{ 1, t1 * t1 }

The intermediate values are used elsewhere, so you gain nothing by using "free" normalization.

DemoCoder · Mar 8, 2006

OpenGL guy said:
I think you both should go back and reread what the shader is doing: Those first two "nrms" aren't nrms.

Yet, you go onto say

mul r3.xyz, r2.w, t1 <--- r3.xyz = nrm t1
mad r5.xyz, t2, r0.w, r3 <--- r5.xyz = nrm t1 + nrm t2

Claiming that they are nrms (the combined mad is just an optimization and irrelevent, fact is, NRM T1 and NRM T2 were computed)

The intermediate values are used elsewhere, so you gain nothing by using "free" normalization.

Yes, I didn't step through the entire shader, but that doesn't sabotage my overall point, which is is that normalization seems a very common operation in shaders, and that "free nrm" seems like a valuable thing to have in hardware. It may even be possible to create an extension of NRM which gives "free length" in a separate register as well.

no-X · Mar 8, 2006

OpenGL guy said:
The intermediate values are used elsewhere, so you gain nothing by using "free" normalization.

So it isn't possible, that the "mirracle FEAR drivers" use some kind of shader replacement using FP16 norm.?

OpenGL guy · Mar 8, 2006

DemoCoder said:
Claiming that they are nrms (the combined mad is just an optimization and irrelevent, fact is, NRM T1 and NRM T2 were computed)

The computations involved may normalize two vectors, but the intermediate values are also used, hence it's not "nrm". If it were just "nrm" then the intermediate results would be tossed out.

Yes, I didn't step through the entire shader, but that doesn't sabotage my overall point, which is is that normalization seems a very common operation in shaders, and that "free nrm" seems like a valuable thing to have in hardware. It may even be possible to create an extension of NRM which gives "free length" in a separate register as well.

So now you want to add more "useful" stuff. How much "useful" stuff do you budget? How often will they be used? Also note that the current "free nrm" is FP16 only, so that will limit its usefulness.

OpenGL guy · Mar 8, 2006

no-X said:
So it isn't possible, that the "mirracle FEAR drivers" use some kind of shader replacement using FP16 norm.?

There are two "nrm"s in the shader that could be replaced by an FP16 "nrm", but that's not part of the spec.

KimB · Mar 8, 2006

OpenGL guy said:
The intermediate values are used elsewhere, so you gain nothing by using "free" normalization.

Okay, fine. But the intermediate values are still don't need the full nrm calculation. So this bumps up the instruction counts between the two to:
3:27 for ATI
3:24 for NV

So that particular shader isn't that much of an improvement, but it is still an improvement.

P.S. little nitpicky thing, but it's 1/length that's stored, not length.

Dio · Mar 8, 2006

Beafy said:
Actually, it was the other way around: Since the ALU computation was slow, JC (or whoever did the shader) traded it for a texture lookup. This was only possible because the results of the computations were very predictable and easily saved in a 1x8 (or similar) sized texture. This won't be as easily possible in a more advanced scenario...

That's not quite the case. The texture was one-dimensional, and expressed a function, and up to recently these are a reasonably common use of lookup tables since some hardware was/is nothing like as fast at ALU as it is at texture

.

Should anyone wish to do experiments modifying the Doom3 shader they can do so easily

.

DemoCoder · Mar 8, 2006

OpenGL guy said:
So now you want to add more "useful" stuff. How much "useful" stuff do you budget? How often will they be used? Also note that the current "free nrm" is FP16 only, so that will limit its usefulness.

Well, R5xx has native SINCOS budgeted, and I didn't see many trig functions in that shader.

This is not to say that trig functions aren't "useful". (BTW, on later thought, are they really spending significant gates on it? I don't know how NV implements free NRM_16, but it stands to reason that since NVidia's TEX unit is combined with ALU1, and since they have sphere-map/cube-map address calculation hardware there, then some of those gates can be reused to calculate the intersection of a vector at the origin with the unit cube/unit sphere, to come up with a first approximation to the normal)

Mintmaster · Mar 8, 2006

DemoCoder said:
Of course, but it's well known that only about 17-bits of precision are needed to represent a norm adequately (no real perceptual loss). FP16 uses about 10. So what Nvidia really needs is NRM_24 (FP24 s15e8) and they'd probably be good enough for perceptually lossless.

I was hoping you'd reply to my post in the other thread when you mentioned this. Especially the part about how cheap a 17-bit renormalizer would be.

Anyway, FP16 normalization doesn't get you anywhere near 17-bits of precision, but rather 11 bits. That's a big difference, and not enough for a highly specular reflection on a smooth surface. If a point light's reflection has a FWHM angular spread of 10 degrees, you need 10^-4 accuracy for only 38 luminance steps between fully and half lit. If you're doing a cube map texture lookup with a per pixel reflection vector, precision requirements are even heavier since you need a lot of accuracy for a visually continuous reflection image with filtering.

Also, this shader is doing per pixel halfway vector calculation, which I don't think will be very common practice, and when necessary I doubt FP16 accuracy will be sufficient. Adding a constant to the texture input and renormalizing is also peculiar.

I don't see normalization (esp. FP16) being an appreciable percentage of the workload for shaders in the future. Might as well use the math logic for other uses. It was something that helped short lighting shaders such as those in Doom3 and earlier games. Furthermore, nrm is not adequate at times, like when you use a two channel normal texture (3Dc or R16G16).

KimB · Mar 8, 2006

Well, Mintmaster, that all depends. I mean, are you talking about 17-bit integer accuracy required? Because FP16 will be as good as that for small enough values. I guess my real question is: where in coordinate space do precision problems creep in? Is it uniform with integer normalization? If it's concentrated at the poles, then FP will clearly have an advantage. But is it?

Mintmaster · Mar 8, 2006

Razor1 said:
If it was fill rate limited due to fillrate limited due to stencil shadows the higher the res goes it should go in favor of nV not the oppoiste. Personally this was a question I had along time ago, and wasn't able to figure out, so its good that we are talking about.

Razor1 said:
How can it be fillrate limited if your resolutions are 800x600?

You have some serious misunderstandings here.

If a game is fillrate or bandwidth limited at one resolution, you will fillrate limited at all resolutions unless you become CPU or geometry limited. (Note that the CPU limit depends on drivers too.) If you have more pixels on the screen, you have more shaders to run. The instructions to pixel ratio does not change with resolution. The bandwidth needed per pixel changes very little with resolution.

In FEAR, OpenGL guy said ATI's cards spend about half their time rendering shadows (here they're fillrate limited) and half their time running shaders. Improving shader speed only helps the latter, which is why R580 was not twice as fast as R520, even though RV530 was twice as fast as RV515.

G70 has almost twice the stencil fillrate of R580 without AA, but has a bit less with AA. You can't draw any conclusions about the in-game shader performance by looking at no-AA results in FEAR. Also, it's clear that ATI is tweaking their memory controller for AA performance. They may not have bothered with high efficiency for the cases without AA/AF (as witnessed by their 3DMark ST fillrate results), as most of the target market doesn't care.

Razor1 · Mar 8, 2006

Mintmaster said:
You have some serious misunderstandings here.

If a game is fillrate or bandwidth limited at one resolution, you will fillrate limited at all resolutions unless you become CPU or geometry limited. (Note that the CPU limit depends on drivers too.) If you have more pixels on the screen, you have more shaders to run. The instructions to pixel ratio does not change with resolution. The bandwidth needed per pixel changes very little with resolution.

In FEAR, OpenGL guy said ATI's cards spend about half their time rendering shadows (here they're fillrate limited) and half their time running shaders. Improving shader speed only helps the latter, which is why R580 was not twice as fast as R520, even though RV530 was twice as fast as RV515.

G70 has almost twice the stencil fillrate of R580 without AA, but has a bit less with AA. You can't draw any conclusions about the in-game shader performance by looking at no-AA results in FEAR. Also, it's clear that ATI is tweaking their memory controller for AA performance. They may not have bothered with high efficiency for the cases without AA/AF (as witnessed by their 3DMark ST fillrate results), as most of the target market doesn't care.

I don't think Fear is fillrate limited at all though, as res goes up without aa and af, it goes in favor of ATi, this is what didn't make sense, unless its bandwidth limited on the g70's. Which Dave hinted at. If it was fillrate limited, nV will be the one gaining, as you mentioned, double the fillrate without aa.

Mintmaster · Mar 8, 2006

Chalnoth said:
Well, Mintmaster, that all depends. I mean, are you talking about 17-bit integer accuracy required? Because FP16 will be as good as that for small enough values. I guess my real question is: where in coordinate space do precision problems creep in? Is it uniform with integer normalization? If it's concentrated at the poles, then FP will clearly have an advantage. But is it?

Don't forget the z-axis for nearly "up" normals. In a normal vector, at least one component will be over 0.57, so it doesn't matter where in coordinate space you're doing your calculations. The dot product will be notably discretized with an FP16 normal. For the accuracy, I'm talking about whatever DC is talking about.

FP16 may have a slight advantage over integer for a 2-channel texture format storing mostly small perturbations, because you can derive the third coordinate in full precision, but that's irrelevant to the discussion since we need all three for lighting.

KimB · Mar 8, 2006

But memory bandwidth doesn't become more of a constraint at high res. If anything, it becomes less constraining compared to the fillrate requirements of the higher resolution.

Razor1 · Mar 8, 2006

Chalnoth said:
But memory bandwidth doesn't become more of a constraint at high res. If anything, it becomes less constraining compared to the fillrate requirements of the higher resolution.

Well it doesn't make sense otherwise, unless there is a shift in bottleneck, where nV should have a pronounced effeciency lead, to somewhere ATi has a pronounced lead.

KimB · Mar 8, 2006

Mintmaster said:
Don't forget the z-axis for nearly "up" normals. In a normal vector, at least one component will be over 0.57, so it doesn't matter where in coordinate space you're doing your calculations.

Why would that be?

KimB · Mar 8, 2006

Razor1 said:
Well it doesn't make sense otherwise, unless there is a shift in bottleneck, where nV should have a pronounced effeciency lead, to somewhere ATi has a pronounced lead.

It could be a lot of things, but it's not memory bandwidth.

It could be due to ATI's memory bandwidth savings techniques preferring high resolution.
It could be due to nVidia having lower CPU usage in their drivers, which would tend to inflate the scores for low resolutions.
It could be due to ATI's texture caches being a bit better for magnification, which is more prevalent at high res.

G7x vs R580 Architectural Efficiency

Similar threads