Interesting RTHDRIBL results @ Hexus

ERP · May 5, 2004

I was just scanning through the shader mark results, and it's just plain odd the way both cards stack up.

I haven't really looked at the shaders in detail, but based on how I'd write them I just can't see what the pattern is in terms of performance. Has anyone here looked in detail at the shders in shader mark?

davepermen · May 5, 2004

DemoCoder said:
What do you mean HW does the scheduling? The HW is not going to replace a dp3/rsq/mul with a free normal. The HW is not going to replace an ADD/MUL with a dual-issue automatically. This is done by the compiler packing both into a VLIW instruction. And the HW certainly isn't going to do register allocation and expression inlining.

I took a look at some of the RTHDRIBL shaders via 3DAnalyze, many look to be inefficient, and almost look like they were written by hand, since they don't even do constant folding.

Let me give you example, this fragment is from 3DAnalyze of RTHDRIBL,

Code:

def c3 , 256.000000, 0.111031, 0.000000, -128.000000 mad_pp r4.w , r0.wwww , c3.xxxx , c3.wwww mul r11.w , r4.wwww , c3.yyyy

Here we have
r11.w = r4.w * c3.y,

substituting for r4.w we have
r11.w = (r0.w * c3.x + c3.w) * c3.y

= r11.w = (r0.w * c3.y*c3.y + c3.w*c3.y)
= with constant folding
= r11.w = (r0.w * c_fold.x + c_fold.y)

lets us rewrite as
MAD r11.w, r0.w, c0.x, c0.y

Just saved 1 register and 1 instruction. I saw one shader that used 11 registers when only 2 were really needed, and most of those 11 were scalars! (r1.x, r2.w, r3.y), and they hadn't hit any port limits, so there was no justifable reason no to reuse dead registers or pack the scalars.

as it is quite old, i think it was indeed written by hand.. but i'm unsure.. by hand, or with a very first release HLSL compiler..

KimB · May 5, 2004

DemoCoder said:
What do you mean HW does the scheduling? The HW is not going to replace a dp3/rsq/mul with a free normal. The HW is not going to replace an ADD/MUL with a dual-issue automatically. This is done by the compiler packing both into a VLIW instruction.

From what I've read, the NV3x was VLIW. The NV4x is not. The instructions carry less information about direct hardware scheduling than on the NV3x, with some of that taken care of by the hardware.

I suspect that the improvement in the NV4x's performance was due both to nVidia producing the compiler along with the architecture, and due to the hardware taking over parts of the low-level scheduling that would be challenging without knowledge of runtime data.

Obviously there are other higher-level things that the compiler must take care of (for example, I suspect that better register allocation in future drivers will help the NV4x's performance).

Code:
Code:

def c3 , 256.000000, 0.111031, 0.000000, -128.000000 mad_pp r4.w , r0.wwww , c3.xxxx , c3.wwww mul r11.w , r4.wwww , c3.yyyy

Here we have
r11.w = r4.w * c3.y,

substituting for r4.w we have
r11.w = (r0.w * c3.x + c3.w) * c3.y

= r11.w = (r0.w * c3.y*c3.y + c3.w*c3.y)
= with constant folding
= r11.w = (r0.w * c_fold.x + c_fold.y)

lets us rewrite as
MAD r11.w, r0.w, c0.x, c0.y

Just saved 1 register and 1 instruction. I saw one shader that used 11 registers when only 2 were really needed, and most of those 11 were scalars! (r1.x, r2.w, r3.y), and they hadn't hit any port limits, so there was no justifable reason no to reuse dead registers or pack the scalars.

Wow, yeah, that's pretty bad. I suppose this is one reason higher-level languages should really be used.

KimB · May 5, 2004

ERP said:
I was just scanning through the shader mark results, and it's just plain odd the way both cards stack up.

I haven't really looked at the shaders in detail, but based on how I'd write them I just can't see what the pattern is in terms of performance. Has anyone here looked in detail at the shders in shader mark?

Might be interesting, but I think we'll see things change significantly with later drivers anyway, so I'm not sure it will really make a huge difference.

ERP · May 5, 2004

I was trying to establish if it's a difference in texture performance vs ALU op throughput. It's just out of interest on my part.

Since I'm one of the few people who has a reason to care about SM3.0 ATI's part is useless to me anyway.

One thing that did occur to me with RTHDRIBL (or whatever) is that it uses a 16 bit target fo the HDR image and then textures from it. Since ATI doesn't support fp texture filtering, it's possible that the code soesn't set it to point sampling explicitly and that NV40 is filtering the fp texture.

Of course it could be any number of other issues aswell.

DemoCoder · May 5, 2004

ERP, wow, I thought of the exact same thing, but had no way to verify it! And RTHDRIBL uses a hell of a lot of texture ops per shader too.

991060 · May 5, 2004

Hmm, if there's difference on the filtering, we should be able to find some visual differences.

ERP · May 5, 2004

Not necessarilly, it's probably already doing a blur filter in the shader, which would pretty much amke it impossible to tell if the individual samples were bilinear, or point sampled.

I have no idea how you would tell without the soursce code. Probably the easiest is to email the author and ask.

991060 · May 5, 2004

Right, but we can extract the shader code by 3danalyzer, all we need to do is to go through the code looking for filtering.

Asking the author may not be a bad idea also, saves us some time at least.

And judging from the screen shot, it seems to me that there's already a blur filter applied on the background.
http://img.hexus.net/v2/graphics_cards/ati/r420/images/rthdribl_big.png

Do you think nVIDIA should provide a control button to disable the FP filtering when necessary? If the programmer implement the filtering in the shader, it's pointless and harmful to do repeated work.

hstewarth · May 5, 2004

LeStoffer said:
Some RightMark 3D numbers:

Rightmark Procedural Wood PS1.4
Procedural Marble PS2.0
Procedural Marble - PS2.0 FP16
Lighting (Blinn) - PS2.0
Lighting (Blinn) - PS2.0 FP16
Lighting (Phong) - PS2.0
Lighting (Phong) - PS2.0 FP16

6800 Ultra 554.4 414.4 413.1 333.6 421.6 167.7 235.1

X800 XT PE 383.1 594.5 595.1 495.2 493.7 277.9 277.9

Okay, on second thought maybe DemoCoder got a point here regarding reuse of NV3X code.

This makes me curious on one thing, I remember in past some game developers ( Half Life 2 ) detecting NV3x and using 1.x shaders instead of 2.0 shaders. If other applications do this and running on 6800, then this application will not be taking full advantage of the new chip.

Just something that came to mind when reading this thread.

KimB · May 5, 2004

991060 said:
Do you think nVIDIA should provide a control button to disable the FP filtering when necessary? If the programmer implement the filtering in the shader, it's pointless and harmful to do repeated work.

Huh? FP filtering will only be used when requested. If the program requested filtering while assuming that it wouldn't be done because it's a FP target, well, then that's just sloppy.

AlphaWolf · May 5, 2004

hstewarth said:
This makes me curious on one thing, I remember in past some game developers ( Half Life 2 ) detecting NV3x and using 1.x shaders instead of 2.0 shaders. If other applications do this and running on 6800, then this application will not be taking full advantage of the new chip.

Just something that came to mind when reading this thread.

You're suggesting that this demo (RTHDRIBL) is doing it?

The reason developers default to a lower shader is to get past a hardware's limitations, rather than hide its strengths. Do you really think Core and Crytek (and their TWIMTBP titles) are sabotaging nvidia's performance by forcing more 1.1 shaders instead of 2.0 on them?

DemoCoder · May 5, 2004

Well, I used 3DAnalyze to log every DX9 call. Judging by the results, it does appear that RTHDRIBL is asking for D3DTEXF_ANISOTROPIC on float textures, but it's hard to track because of the voluminous data. There appears to be a call to CreateCubeTexture for D3DFMT_A16R16G16B16F, followed by a SetTexture for sampler stage 0, followed by a SetSamplerState to D3DTEXF_ANISOTROPIC on that sampler, but I'm not totally sure because I was jumping around alot in a textfile with thousands of lines trying to track back hex addresses.

Sxotty · May 6, 2004

The 7 posts just mysteriously dissapeared eh? Well I suppose it is for the best.

DemoCoder · May 6, 2004

I'm learning more towards shader problems myself, but there is heavy use of FP cube maps going on.

Pete · May 11, 2004

Any updates to RTHDRIBL performance? Has anyone contacted the author?

Rys · May 12, 2004

I know NVIDIA are having a look at it themselves, they asked me this morning how I did the testing for RTHDRIBL.

Rys

Pete · May 15, 2004

Bump (before it falls off the front page and I forget) to remind ppl to post if either nV or the "dribble" author responds.

Rys · May 17, 2004

I've asked NVIDIA what optimisation for RTHDRIBL's case they're performing, since it's a somewhat unique problem. I'll post when they reply.

Rys

Evildeus · May 17, 2004

Let us know

Interesting RTHDRIBL results @ Hexus

ERP

davepermen

KimB

KimB

ERP

DemoCoder

991060

ERP

991060

hstewarth

KimB

AlphaWolf

Specious Misanthrope

DemoCoder

Sxotty

DemoCoder

Pete

Moderate Nuisance

Rys

Graphics @ AMD

Pete

Moderate Nuisance

Rys

Graphics @ AMD

Evildeus

Similar threads