Interesting RTHDRIBL results @ Hexus

I was just scanning through the shader mark results, and it's just plain odd the way both cards stack up.

I haven't really looked at the shaders in detail, but based on how I'd write them I just can't see what the pattern is in terms of performance. Has anyone here looked in detail at the shders in shader mark?
 
DemoCoder said:
What do you mean HW does the scheduling? The HW is not going to replace a dp3/rsq/mul with a free normal. The HW is not going to replace an ADD/MUL with a dual-issue automatically. This is done by the compiler packing both into a VLIW instruction. And the HW certainly isn't going to do register allocation and expression inlining.

I took a look at some of the RTHDRIBL shaders via 3DAnalyze, many look to be inefficient, and almost look like they were written by hand, since they don't even do constant folding.

Let me give you example, this fragment is from 3DAnalyze of RTHDRIBL,

Code:
def c3 , 256.000000, 0.111031, 0.000000, -128.000000

mad_pp r4.w , r0.wwww , c3.xxxx , c3.wwww 
mul r11.w , r4.wwww , c3.yyyy

Here we have
r11.w = r4.w * c3.y,

substituting for r4.w we have
r11.w = (r0.w * c3.x + c3.w) * c3.y

= r11.w = (r0.w * c3.y*c3.y + c3.w*c3.y)
= with constant folding
= r11.w = (r0.w * c_fold.x + c_fold.y)

lets us rewrite as
MAD r11.w, r0.w, c0.x, c0.y

Just saved 1 register and 1 instruction. I saw one shader that used 11 registers when only 2 were really needed, and most of those 11 were scalars! (r1.x, r2.w, r3.y), and they hadn't hit any port limits, so there was no justifable reason no to reuse dead registers or pack the scalars.

as it is quite old, i think it was indeed written by hand.. but i'm unsure.. by hand, or with a very first release HLSL compiler..
 
DemoCoder said:
What do you mean HW does the scheduling? The HW is not going to replace a dp3/rsq/mul with a free normal. The HW is not going to replace an ADD/MUL with a dual-issue automatically. This is done by the compiler packing both into a VLIW instruction.
From what I've read, the NV3x was VLIW. The NV4x is not. The instructions carry less information about direct hardware scheduling than on the NV3x, with some of that taken care of by the hardware.

I suspect that the improvement in the NV4x's performance was due both to nVidia producing the compiler along with the architecture, and due to the hardware taking over parts of the low-level scheduling that would be challenging without knowledge of runtime data.

Obviously there are other higher-level things that the compiler must take care of (for example, I suspect that better register allocation in future drivers will help the NV4x's performance).

Code:
def c3 , 256.000000, 0.111031, 0.000000, -128.000000

mad_pp r4.w , r0.wwww , c3.xxxx , c3.wwww 
mul r11.w , r4.wwww , c3.yyyy

Here we have
r11.w = r4.w * c3.y,

substituting for r4.w we have
r11.w = (r0.w * c3.x + c3.w) * c3.y

= r11.w = (r0.w * c3.y*c3.y + c3.w*c3.y)
= with constant folding
= r11.w = (r0.w * c_fold.x + c_fold.y)

lets us rewrite as
MAD r11.w, r0.w, c0.x, c0.y

Just saved 1 register and 1 instruction. I saw one shader that used 11 registers when only 2 were really needed, and most of those 11 were scalars! (r1.x, r2.w, r3.y), and they hadn't hit any port limits, so there was no justifable reason no to reuse dead registers or pack the scalars.
Wow, yeah, that's pretty bad. I suppose this is one reason higher-level languages should really be used.
 
ERP said:
I was just scanning through the shader mark results, and it's just plain odd the way both cards stack up.

I haven't really looked at the shaders in detail, but based on how I'd write them I just can't see what the pattern is in terms of performance. Has anyone here looked in detail at the shders in shader mark?
Might be interesting, but I think we'll see things change significantly with later drivers anyway, so I'm not sure it will really make a huge difference.
 
I was trying to establish if it's a difference in texture performance vs ALU op throughput. It's just out of interest on my part.

Since I'm one of the few people who has a reason to care about SM3.0 ATI's part is useless to me anyway.

One thing that did occur to me with RTHDRIBL (or whatever) is that it uses a 16 bit target fo the HDR image and then textures from it. Since ATI doesn't support fp texture filtering, it's possible that the code soesn't set it to point sampling explicitly and that NV40 is filtering the fp texture.

Of course it could be any number of other issues aswell.
 
ERP, wow, I thought of the exact same thing, but had no way to verify it! And RTHDRIBL uses a hell of a lot of texture ops per shader too.
 
Not necessarilly, it's probably already doing a blur filter in the shader, which would pretty much amke it impossible to tell if the individual samples were bilinear, or point sampled.

I have no idea how you would tell without the soursce code. Probably the easiest is to email the author and ask.
 
Right, but we can extract the shader code by 3danalyzer, all we need to do is to go through the code looking for filtering.

Asking the author may not be a bad idea also, saves us some time at least. :D

And judging from the screen shot, it seems to me that there's already a blur filter applied on the background.
http://img.hexus.net/v2/graphics_cards/ati/r420/images/rthdribl_big.png

Do you think nVIDIA should provide a control button to disable the FP filtering when necessary? If the programmer implement the filtering in the shader, it's pointless and harmful to do repeated work.
 
LeStoffer said:
Some RightMark 3D numbers:

Rightmark Procedural Wood PS1.4
Procedural Marble PS2.0
Procedural Marble - PS2.0 FP16
Lighting (Blinn) - PS2.0
Lighting (Blinn) - PS2.0 FP16
Lighting (Phong) - PS2.0
Lighting (Phong) - PS2.0 FP16

6800 Ultra 554.4 414.4 413.1 333.6 421.6 167.7 235.1

X800 XT PE 383.1 594.5 595.1 495.2 493.7 277.9 277.9

Okay, on second thought maybe DemoCoder got a point here regarding reuse of NV3X code.


This makes me curious on one thing, I remember in past some game developers ( Half Life 2 ) detecting NV3x and using 1.x shaders instead of 2.0 shaders. If other applications do this and running on 6800, then this application will not be taking full advantage of the new chip.

Just something that came to mind when reading this thread.
 
991060 said:
Do you think nVIDIA should provide a control button to disable the FP filtering when necessary? If the programmer implement the filtering in the shader, it's pointless and harmful to do repeated work.
Huh? FP filtering will only be used when requested. If the program requested filtering while assuming that it wouldn't be done because it's a FP target, well, then that's just sloppy.
 
hstewarth said:
This makes me curious on one thing, I remember in past some game developers ( Half Life 2 ) detecting NV3x and using 1.x shaders instead of 2.0 shaders. If other applications do this and running on 6800, then this application will not be taking full advantage of the new chip.

Just something that came to mind when reading this thread.

You're suggesting that this demo (RTHDRIBL) is doing it?

The reason developers default to a lower shader is to get past a hardware's limitations, rather than hide its strengths. Do you really think Core and Crytek (and their TWIMTBP titles) are sabotaging nvidia's performance by forcing more 1.1 shaders instead of 2.0 on them?
 
Well, I used 3DAnalyze to log every DX9 call. Judging by the results, it does appear that RTHDRIBL is asking for D3DTEXF_ANISOTROPIC on float textures, but it's hard to track because of the voluminous data. There appears to be a call to CreateCubeTexture for D3DFMT_A16R16G16B16F, followed by a SetTexture for sampler stage 0, followed by a SetSamplerState to D3DTEXF_ANISOTROPIC on that sampler, but I'm not totally sure because I was jumping around alot in a textfile with thousands of lines trying to track back hex addresses.
 
I know NVIDIA are having a look at it themselves, they asked me this morning how I did the testing for RTHDRIBL.

Rys
 
Bump (before it falls off the front page and I forget) to remind ppl to post if either nV or the "dribble" author responds.
 
I've asked NVIDIA what optimisation for RTHDRIBL's case they're performing, since it's a somewhat unique problem. I'll post when they reply.

Rys
 
Back
Top