Writing shaders in assembly vs HLSL

About that HLSL code

Is the failure when I attempt a "ps_2_a" target the result of fxc.exe hard coding, or is it just passing through what the 9.0a HLSL compiler library is capable of? In any case, I'm interested in MS's response in HLSL implementation to the NV3x challenge (dependency on depracating part of the underlying spec, in this case the number of registers supported). IOW, I'm wondering whether the "ps_2_a" target increases instruction count to save register usage, and, more importantly, how the ps_2_a performance compares to Cg, whatever it is doing...and hoping someone can help provide the answer, maybe by running fxc on the HLSL code above with the _2_a profile, and listing the beta version/date and code output. ;)

However, what's a question without some of my associated off-the-wall speculation?...:

Another question I have is whether the beta 2_a profile is actually going to be released...it seems like nVidia is pushing for Cg pretty exclusively to (maybe) allow precision spec circumvention (wrt to DX 9) under their control, and a good DX9 HLSL ps_2_a profile, from the NV35's point of view atleast, would tend to undercut that effort. If nVidia isn't pushing for it, I'm not sure Microsoft will be prompt with delivering it with the small amount of cards showing significant benefit ...cards <5900 don't fit the rest of the ps_2_a profile characteristics outlined in the GDC slides all that well because of precision issues, so I suspect without nVidia working on their end it might not be productive effort on MS's part. Then again, maybe something will crop up with this "real" DetFX release that is supposed to occur sometime in the future....barring that, the NV40's overcoming the NV3x weaknesses might be foreshadowed in the next few months by how nVidia changes their developer focus, including a possible change regarding DX 9 standards efforts.

Oh well, atleast my first question is pretty straightforward. ;)
 
It seems the ps_2_a profile is actually ps_2_x profile with a known set of extensions. The previous code compiled to ps_2_a profile with HLSL compiler from the second newest beta version of the DX9 SDK update (as was the previous one, just downloading the newest one, thanks for the heads up) and for comparison for ps_2_x profile with Cg 1.1 compiler.

DX9 HLSL:

//
// Generated by Microsoft (R) D3DX9 Shader Compiler
//
// Source: deluxearb.cg
// Flags: /E:main /T:ps_2_x
//

// Parameters:
//
// sampler2D $baseMap;
// sampler2D $deLuxMap;
// sampler2D $lightMap;
// sampler2D $normalMap;
// float3 eyeposition;
// float3 lightposition;
//
//
// Registers:
//
// Name Reg Size
// ------------- ----- ----
// eyeposition c0 1
// lightposition c1 1
// $deLuxMap s0 1
// $normalMap s1 1
// $baseMap s2 1
// $lightMap s3 1
//

ps_2_x
def c2, 2, -1, 16, 0
dcl v0.xyz
dcl t0.xyz
dcl t1.xyz
dcl t2.xy
dcl t3.xy
dcl t4.xyz
dcl_2d s0
dcl_2d s1
dcl_2d s2
dcl_2d s3
add r0.xyz, -t4, c1
add r0.xyz, r0, c0
add r1.xyz, r0, -t4
dp3 r0.x, r1, t0
dp3 r0.y, r1, t1
dp3 r0.z, r1, v0
dp3 r0.w, r0, r0
rsq r0.w, r0.w
mul r2.xyz, r0, r0.w
texld r0, t2, s0
texld r1, t3, s1
mad r1.xyz, c2.x, r1, c2.y
dp3_sat r0.w, r2, r1
pow r3.w, r0.w, c2.z
mad r2.xyz, c2.x, r0, c2.y
dp3 r0.x, r2, t0
dp3 r0.y, r2, t1
dp3 r0.z, r2, v0
dp3 r0.w, r0, r0
rsq r0.w, r0.w
mul r0.xyz, r0, r0.w
dp3_sat r2.w, r0, r1
texld r1, t3, s3
texld r0, t3, s2
mad r0, r0.xyzz, r2.w, r3.w
mul r0, r1.xyzz, r0
mov oC0, r0

// approximately 29 instruction slots used (4 texture, 25 arithmetic)


Cg:

// DX9 Pixel Shader by NVIDIA Cg compiler
ps_2_x
// cgc version 1.1.0003, build date Mar 4 2003 12:32:10
// command line args: -profile ps_2_x
//vendor NVIDIA Corporation
//version 1.0.02
//profile ps_2_x
//program main
//semantic main.tangentCube
//semantic main.binormalCube
//semantic main.deLuxMap
//semantic main.normalMap
//semantic main.baseMap
//semantic main.lightMap
//semantic lightposition
//semantic eyeposition
//var float3 lightposition : : c[2] : -1 : 1
//var float3 eyeposition : : c[1] : -1 : 1
//var float3 I.norVec : $vin.COLOR0 : COLOR0 : 0 : 1
//var float3 I.tanVec : $vin.TEXCOORD0 : TEXCOORD0 : 0 : 1
//var float3 I.binVec : $vin.TEXCOORD1 : TEXCOORD1 : 0 : 1
//var float2 I.deLuxCoord : $vin.TEXCOORD2 : TEXCOORD2 : 0 : 1
//var float2 I.texCoord : $vin.TEXCOORD3 : TEXCOORD3 : 0 : 1
//var float3 I.position : $vin.TEXCOORD4 : TEXCOORD4 : 0 : 1
//var samplerCUBE tangentCube : : texunit 0 : 1 : 1
//var samplerCUBE binormalCube : : texunit 1 : 2 : 1
//var sampler2D deLuxMap : : texunit 2 : 3 : 1
//var sampler2D normalMap : : texunit 3 : 4 : 1
//var sampler2D baseMap : : texunit 4 : 5 : 1
//var sampler2D lightMap : : texunit 5 : 6 : 1
//var float4 main : $vout.COLOR : COLOR : -1 : 1
dcl_2d s2
dcl_2d s3
dcl_2d s4
dcl_2d s5
def c0, 2.000000, 1.000000, 16.000000, 0.000000
dcl v0.xyz
dcl t0.xyz
dcl t1.xyz
dcl t2.xy
dcl t3.xy
dcl t4.xyz
texld r0, t2, s2
texld r1, t3, s3
mad r0.xyz, c0.x, r0.xyzx, -c0.y
mad r1.xyz, c0.x, r1.xyzx, -c0.y
add r2.xyz, c2.xyzx, -t4.xyzx
add r2.xyz, r2.xyzx, c1.xyzx
add r2.xyz, r2.xyzx, -t4.xyzx
dp3 r0.w, r2.xyzx, t0.xyzx
mov r3.x, r0.w
dp3 r0.w, r2.xyzx, t1.xyzx
mov r3.y, r0.w
dp3 r0.w, r2.xyzx, v0.xyzx
mov r3.z, r0.w
dp3 r0.w, r2.xyzx, v0.xyzx
mov r3.z, r0.w
dp3 r0.w, r0.xyzx, t0.xyzx
mov r2.x, r0.w
dp3 r0.w, r0.xyzx, t1.xyzx
mov r2.y, r0.w
dp3 r0.x, r0.xyzx, v0.xyzx
mov r2.z, r0.x
dp3 r0.x, r3.xyzx, r3.xyzx
rsq r0.x, r0.x
mul r3.xyz, r0.x, r3.xyzx
dp3_sat r0.x, r3.xyzx, r1.xyzx
pow r1.w, r0.x, c0.z
dp3 r0.x, r2.xyzx, r2.xyzx
rsq r0.x, r0.x
mul r2.xyz, r0.x, r2.xyzx
dp3_sat r0.x, r2.xyzx, r1.xyzx
texld r2, t3, s4
texld r3, t3, s5
mad r0.xyz, r2.xyzx, r0.x, r1.w
mul r0.xyz, r0.xyzx, r3.xyzx
mov r0, r0.xyzz
mov oC0, r0
// 34 instructions, 4 R-regs.
// End of program
59 lines, 0 errors.

Both now get the last movs right, HLSL also changes instruction scheduling pretty dramatically (texld's mostly) which also relieves register pressure.

Update: the newest HLSL beta compiler uses nrm macro instruction instead of dp3/rsq/mul, shuffles the instructions a bit more and uses 5 registers instead of 4 (as the nrm can't use the same source and destination). The same applies to the 2.0 shader posted earlier, it now uses nrm and 7 registers instead of 6.
 
mboeller said:
Ahh...

the NV3x optimised HLSL-compiler is now available. So Cg can now die. Good.

For D3D perhaps, not for OpenGL until GL2 shading language can actually be used...
 
Hmm... D3DXSHADER_DX91BETA_PREDICATION

The predication flag, specify this flag if you want to enable predication and flow control code generation. This flag is only for convenience of testing during the preview, and will be removed before final.

Hmm...
 
Back
Top