Writing shaders in assembly vs HLSL

Simon F said:
Dio said:
Actually, all pixel shader versions are closer to microcode - i.e. lower level - than classic 'assembly language'.
I would have thought it'd be more like VLIW or something like that (at least from what I read of, say, the leaked XBox description). From what I recall of my dim-dark-Uni days, microcode is a way of implementing assembly code by interpretting it using an even simpler system and so could take multiple cycles to execute each asm instruction.
I'm a bit hazy on this. My definition of 'microcode' has always been some code that could be used by the state machine to control the ALU without needing significant decoding. That's not to say it necessarily IS done that way, but that it could be (and note my extra-hazy insertion of the word 'significant').

Simon F said:
The reason instructions take three arguments is because of MAD. For obscure technical reasons, it costs virtually nothing to do a MAD over a MUL if you can work out the operand specification and fetch issues.
By that, do you mean throughput? - it surely must cost gates to stick on the adder (although there will be some savings with them chained together over independent muls and adds).
When I last looked at this - which was before I worked for ATI, I actually don't know how ATI do it - the MUL had an adder on the back end, therefore it was just one extra CSA (carry-save adder) stage to put the extra adder in, and CSA stages are pretty small (cf. cost of the multiplier, at least).
 
Dio said:
The dot-product can sometimes be as much as half of all the instructions in shader programs, and it takes something like 5 SSE instructions to implement it.
Yes and no. If you're using SoA packed data, SSE DP takes 5 instructions, but you get 4 results, so the overhead isn't that large.
The 5-instruction SSE sequence I was thinking of for the dot-product was a MUL->swizzle->ADD->swizzle->ADD sequence (where you can't avoid the swizzles because you need to add together elements within a vector rather than between two vectors). If you perform multiple dot-products, you may be able to share some adds and swizzles between them, but you won't get below ~3 instructions per element.
As with MAD, dot-product comes virtually free at the ALU level once you've implemented MUL.
True for fixed-point calculations (as you can just glue in a couple of CSAs), false for floating-point (as you need to normalize data before and after the additions, which is expensive).
 
Ok allow me to rephrase myself. If I'm raving again then please snap me out of it ;)

ps 2.* is a DirectX 9 standard. So as far as I know there is no obligation for the hardware to have micro-instructions that correspond directly to ps 2.* instructions. This also implies that there is always a compilation stage, where ps 2.* is the intermediate code that get translated to architecture dependent micro-instructions. Of course it is interesting for the hardware to have micro-instructions that correspond directly with ps 2.* instructions.

But let's see what could happen in the future. To increase clock freuencies, ala Intel tactics, micro-instructions will have to be kept simple. With ever increasing programmability we'll also need more general ALU's. So the instructions could get split up, much like with SSE instructions, to become more general and reusable to save silicon. So the ps 2.* translation phase could become less trivial and could need an optimization phase so inherently some performance is lost. Which brings it closer to HLSL and makes it what I called a high-level assembly language.

Sorry for pimping my project again, but there's nothing in the DirectX SDK documentation that forbids me from translating ps 2.* instructions to SSE instructions. What I'm currently working on is an instruction scheduler and a peephole optimizer, so the instruction correspondence is totally lost. In this situation it definitely is a high-level assembly language and the translation of HLSL would bring the same performance loss. Oh, and by the way, Prescott will feature new SSE instructions with 'horizontal' operations that could eliminate many swizzle operations.

Anyway, I'm probably looking at things from the wrong perspective so take it with a grain of salt...
 
Chalnoth said:
Were the moves used to reduce the number of registers used? If so, then this would be in line with what is expected when optimizing for the NV3x architecture.

It's also why I agree with one very specific part of Cg: it has the capability for hardware-specific compiler targets.

Well I might as well put the code here, it's not that secret, delux mapping shader from Tenebrae 2.

The original HLSL/Cg code (the same compiles for both):

struct inputVertex
{
float3 norVec : COLOR0;
float3 tanVec : TEXCOORD0;
float3 binVec : TEXCOORD1;
float2 deLuxCoord : TEXCOORD2;
float2 texCoord : TEXCOORD3;
float3 position: TEXCOORD4;
};

uniform float3 eyeposition;
uniform float3 lightposition;

float4 main(inputVertex I, uniform samplerCUBE tangentCube,
uniform samplerCUBE binormalCube,
uniform sampler2D deLuxMap, uniform sampler2D normalMap,
uniform sampler2D baseMap,
uniform sampler2D lightMap) : COLOR
{
//normal
float3 normal = I.norVec.xyz;
//tangent
float3 tangent = I.tanVec.xyz;
//binormal
float3 binormal = I.binVec.xyz;

// Get the worldspace delux
float3 wDelux = 2 * tex2D(deLuxMap,I.deLuxCoord).xyz - 1;
//Put into tangent space
float3 tDelux;
tDelux.x = dot(wDelux,tangent);
tDelux.y = dot(wDelux,binormal);
tDelux.z = dot(wDelux,normal);
tDelux = normalize(tDelux);

// Get the normal from normal map lookup
float3 matNormal = 2 * tex2D(normalMap, I.texCoord).xyz - 1;

// normal . light vector
float diffdot = saturate(dot(tDelux, matNormal));

// calculate halfvector
float3 halfvec = lightposition - I.position + eyeposition - I.position;
float3 trans;
trans.x = dot(halfvec, tangent);
trans.y = dot(halfvec, binormal);
trans.z = dot(halfvec, normal);
halfvec = normalize(trans);

float specdot = saturate(dot(halfvec, matNormal));
specdot = pow(specdot, 16);

float3 base = tex2D(baseMap, I.texCoord).xyz;
float3 lmap = tex2D(lightMap, I.texCoord).xyz;

float3 res = ((base * diffdot) + specdot) * lmap;
return res.xyzz;
}


Cg compiler output:

// DX9 Pixel Shader by NVIDIA Cg compiler
ps_2_0
// cgc version 1.1.0003, build date Mar 4 2003 12:32:10
// command line args: -profile ps_2_0
//vendor NVIDIA Corporation
//version 1.0.02
//profile ps_2_0
//program main
//semantic main.tangentCube
//semantic main.binormalCube
//semantic main.deLuxMap
//semantic main.normalMap
//semantic main.baseMap
//semantic main.lightMap
//semantic lightposition
//semantic eyeposition
//var float3 lightposition : : c[2] : -1 : 1
//var float3 eyeposition : : c[1] : -1 : 1
//var float3 I.norVec : $vin.COLOR0 : COLOR0 : 0 : 1
//var float3 I.tanVec : $vin.TEXCOORD0 : TEXCOORD0 : 0 : 1
//var float3 I.binVec : $vin.TEXCOORD1 : TEXCOORD1 : 0 : 1
//var float2 I.deLuxCoord : $vin.TEXCOORD2 : TEXCOORD2 : 0 : 1
//var float2 I.texCoord : $vin.TEXCOORD3 : TEXCOORD3 : 0 : 1
//var float3 I.position : $vin.TEXCOORD4 : TEXCOORD4 : 0 : 1
//var samplerCUBE tangentCube : : texunit 0 : 1 : 1
//var samplerCUBE binormalCube : : texunit 1 : 2 : 1
//var sampler2D deLuxMap : : texunit 2 : 3 : 1
//var sampler2D normalMap : : texunit 3 : 4 : 1
//var sampler2D baseMap : : texunit 4 : 5 : 1
//var sampler2D lightMap : : texunit 5 : 6 : 1
//var float4 main : $vout.COLOR : COLOR : -1 : 1
dcl_2d s2
dcl_2d s3
dcl_2d s4
dcl_2d s5
def c0, 2.000000, 1.000000, 16.000000, 0.000000
dcl v0.xyz
dcl t0.xyz
dcl t1.xyz
dcl t2.xy
dcl t3.xy
dcl t4.xyz
texld r0, t2, s2
texld r1, t3, s3
mad r0.xyz, c0.x, r0, -c0.y
mad r1.xyz, c0.x, r1, -c0.y
add r2.xyz, c2, -t4
add r2.xyz, r2, c1
add r2.xyz, r2, -t4
dp3 r0.w, r2, t0
mov r3.x, r0.w
dp3 r0.w, r2, t1
mov r3.y, r0.w
dp3 r0.w, r2, v0
mov r3.z, r0.w
dp3 r0.w, r0, t0
mov r2.x, r0.w
dp3 r0.w, r0, t1
mov r2.y, r0.w
dp3 r0.x, r0, v0
mov r2.z, r0.x
dp3 r0.x, r3, r3
rsq r0.x, r0.x
mul r3.xyz, r0.x, r3
dp3_sat r0.x, r3, r1
pow r1.w, r0.x, c0.z
dp3 r0.x, r2, r2
rsq r0.x, r0.x
mul r2.xyz, r0.x, r2
dp3_sat r0.x, r2, r1
texld r2, t3, s4
texld r3, t3, s5
mad r0.xyz, r2, r0.x, r1.w
mul r3.xyz, r0, r3
mov r0.xyz, r3
mov r0.w, r3.z
mov oC0, r0
// 35 instructions, 4 R-regs.
// End of program
58 lines, 0 errors.

DX9 HLSL compiler output:

//
// Generated by Microsoft (R) D3DX9 Shader Compiler
//
// Source: deluxearb.cg
// Flags: /E:main /T:ps_2_0
//

// Parameters:
//
// sampler2D $baseMap;
// sampler2D $deLuxMap;
// sampler2D $lightMap;
// sampler2D $normalMap;
// float3 eyeposition;
// float3 lightposition;
//
//
// Registers:
//
// Name Reg Size
// ------------- ----- ----
// eyeposition c0 1
// lightposition c1 1
// $deLuxMap s0 1
// $normalMap s1 1
// $baseMap s2 1
// $lightMap s3 1
//

ps_2_0
def c2, 2, -1, 16, 0
dcl v0.xyz
dcl t0.xyz
dcl t1.xyz
dcl t2.xy
dcl t3.xy
dcl t4.xyz
dcl_2d s0
dcl_2d s1
dcl_2d s2
dcl_2d s3
texld r3, t3, s1
texld r2, t2, s0
texld r1, t3, s2
texld r0, t3, s3
add r4.xyz, -t4, c1
add r4.xyz, r4, c0
add r5.xyz, r4, -t4
dp3 r4.x, r5, t0
dp3 r4.y, r5, t1
dp3 r4.z, r5, v0
dp3 r0.w, r4, r4
rsq r0.w, r0.w
mul r4.xyz, r4, r0.w
mad r3.xyz, c2.x, r3, c2.y
dp3_sat r0.w, r4, r3
pow r1.w, r0.w, c2.z
mad r4.xyz, c2.x, r2, c2.y
dp3 r2.x, r4, t0
dp3 r2.y, r4, t1
dp3 r2.z, r4, v0
dp3 r0.w, r2, r2
rsq r0.w, r0.w
mul r2.xyz, r2, r0.w
dp3_sat r0.w, r2, r3
mad r1.xyz, r1, r0.w, r1.w
mul r1.xyz, r0, r1
mov r0.xyz, r1
mov r0.w, r1.z
mov oC0, r0

// approximately 31 instruction slots used (4 texture, 27 arithmetic)


The only difference to my own assembler version was a bit different order and not doing the last movs quite as literally, alpha is ignored anyway.
 
It could be interesting to look at performances of these codes :) HLSL code uses 6 registers and the Cg code only 4. Your HLSL code has 17% less instructions.

Let's look at performances of both shaders with some small tests :

- On a Radeon 9800 Pro your HLSL code is 25% faster than your Cg one.

- On a GeForce FX 5600, your HLSL code is 10% slower than your Cg one.

- On a GeForce FX 5600 with _pp modifier, your HLSL code is 7% faster than your Cg one.

With AA and AF enabled, your HLSL code makes a bigger improvement. It is faster even on a GeForce FX 5600 without the _pp modifier.

Cg seems faster only with GeForce FX and when the bottleneck comes from register usage.

(The Radeon 9800 Pro is 10 X faster than the GeForce FX 5600 ;) )


Radeon 9800 Pro HLSL : 125 MPix/s
Radeon 9800 Pro Cg : 100 MPix/s

GeForce FX 5600 HLSL : 11.2 MPix/s
GeForce FX 5600 Cg : 12.4 MPix/s

GeForce FX 5600 HLSL_pp : 14.8 MPix/s
GeForce FX 5600 Cg_pp : 13.8 MPix/s

GeForce FX 5600 HLSL AA/AF : 7.0 MPix/s
GeForce FX 5600 Cg AA/AF : 6.1 MPix/s
 
arjan de lumens said:
Dio said:
The dot-product can sometimes be as much as half of all the instructions in shader programs, and it takes something like 5 SSE instructions to implement it.
Yes and no. If you're using SoA packed data, SSE DP takes 5 instructions, but you get 4 results, so the overhead isn't that large.
The 5-instruction SSE sequence I was thinking of for the dot-product was a MUL->swizzle->ADD->swizzle->ADD sequence (where you can't avoid the swizzles because you need to add together elements within a vector rather than between two vectors). If you perform multiple dot-products, you may be able to share some adds and swizzles between them, but you won't get below ~3 instructions per element.
It's not the smart way to do it. You can't do AoS (xyzw in one SSE register) efficiently in SSE. Instead, you have one register full of X, one of Y, one of Z, and one of W. Then the DP is one MUL and three ADDs (not four as I worked out above) and so you get one XYZW output per instrution.

True for fixed-point calculations (as you can just glue in a couple of CSAs), false for floating-point (as you need to normalize data before and after the additions, which is expensive).
Good point which I had forgotten, although since the unit I was discussing with the hardware guys at the time was floating-point, there must still be some shortcut.
 
Dio said:
It's not the smart way to do it. You can't do AoS (xyzw in one SSE register) efficiently in SSE. Instead, you have one register full of X, one of Y, one of Z, and one of W. Then the DP is one MUL and three ADDs (not four as I worked out above) and so you get one XYZW output per instrution.
1 register for X, 1 for Y, 1 for Z, 1 for W: that's 4 registers, which is a bit beyond what a single MUL can reach, so that doesn't sound very meaningful for a standalone dot-product. If you combine multiple dot-products for, say, a vertex transform, you may be able to sustain a rate of 1 MUL per component, but then you also get 3/4 adds and 1 data unpack per input component, for a total of ~2.75 instructions per result component.

True for fixed-point calculations (as you can just glue in a couple of CSAs), false for floating-point (as you need to normalize data before and after the additions, which is expensive).
Good point which I had forgotten, although since the unit I was discussing with the hardware guys at the time was floating-point, there must still be some shortcut.
You can still do CSAs between the input and output normalize stages, it's just that the normalize stages are unavoidable and a bit expensive.
 
I'm assuming you're executing some kind of shader or transform or similar, where you have several hundred bits of data to process. As you say, for a single DP it's no use - but if you're just doing the one, why use SSE anyway? The whole point is that it's supposed to be for vector processing.

For this SoA method, you write exactly the same algorithm as you would for classic non-vector code, just you process four items of data instead of one. It's exactly vector processing. SSE isn't meant for, or particularly efficient at, non-vectorisable algorithms.

Data pack/unpack does put a limit on peak efficiency, but this isn't much once you are doing several operations before repacking. In most cases you can store the input data in unpacked form and it's all pretty cheap.
 
Dio said:
For this SoA method, you write exactly the same algorithm as you would for classic non-vector code, just you process four items of data instead of one. It's exactly vector processing. SSE isn't meant for, or particularly efficient at, non-vectorisable algorithms.
So you process 4 vertices/pixels at the same time, with each component of an SSE register corresponding to one piece of data for each of the 4 vertices/pixels. OK, then it suddenly makes a whole lot more sense to me with those XXXXs and YYYYs :)
 
The other side of the coin in this argument is just as important if not more so.

I call it the laziness factor. Because I was taught that programmers are always lazy in one of my first college classes. If all shaders needed to be written in assembly we would get relatively few of them and simple ones at that. If developers are allowed to use an easier language like CG or HLSL, we will see many more shaders including ones with increased complexity.

So even if there is a 50% performance hit vs assembly the 300%+ increase in laziness will outweigh the lost performance over the long run.
:p
 
arjan de lumens said:
So you process 4 vertices/pixels at the same time, with each component of an SSE register corresponding to one piece of data for each of the 4 vertices/pixels. OK, then it suddenly makes a whole lot more sense to me with those XXXXs and YYYYs :)
:) My apologies. Sometimes I talk in my own pseudocode...
 
P4-Fan said:
So even if there is a 50% performance hit vs assembly the 300%+ increase in laziness will outweigh the lost performance over the long run.
Your argument of lazyness is valid, but I wouldn't say 300%...

In my opinion, coding in HLSL isn't that much simpler than coding in assembly. You still have to keep track of what you're doing and vector math never really is trivial. I think there is also a fear factor here. People seem to be afraid of assembly. But like with the CPU, to be a good C programmer you have to be a good x86 programmer.

So let's say 150% lazyness and 150% fear :devilish:
 
I've coded a lot of assembly for 486 and Pentium, counting instruction cycles, stalls, U/V optimizations, testing and rearranging for cache efficiency.

Still HLSL is a valuable tool if for no other than prototyping shaders. The rule is: implement first, optimize later. Asm is only for the optimization stage.
It could at least double the work when asm have to be used at the research/trial-error stage.

At the end it might worth a look to the assembly output and do hand optimization if it worth it. (It can be like 30% in some cases...)

One of the problems on R300 is when a final code could fit in within the - quite strict! - 64 instruction limit, but the HLSL compiled code doesn't.
 
Hyp-X said:
Still HLSL is a valuable tool if for no other than prototyping shaders. The rule is: implement first, optimize later. Asm is only for the optimization stage.
It could at least double the work when asm have to be used at the research/trial-error stage.
That's true but writing in assembly doesn't automatically mean you -have- to optimize. In the Win32asm community they write assembly for complete programs, but they don't constantly worry about performance. Like I said before I think some people just have fear for it.

In some cases I even think that using assembly can be more straighforward. It isn't necessarily harder to learn or use than a C-style language, especially for relatively short programs like shaders...
 
jpaana said:
Well I might as well put the code here, it's not that secret, delux mapping shader from Tenebrae 2.

The original HLSL/Cg code (the same compiles for both):

Thankyou for posting that, that answers the question on how efficient CG is for other cards. 25% slower vs HLSL is significant.
 
Nick said:
In some cases I even think that using assembly can be more straighforward. It isn't necessarily harder to learn or use than a C-style language, especially for relatively short programs like shaders...
Short programs is the key. My guess is that assembler is easy if you can hold the entire program in your head, and it isn't if it isn't.

Portability - not of the code to a different architecture, but from one head to another - is also a problem. One man's assembly is generally that - it's much harder to get into someone else's assembly code than it is someone else's high level code. (The converse can be true too - it's possible to write particularly incomprehensible code in HLL's, particularly with C++).
 
Nick said:
That's true but writing in assembly doesn't automatically mean you -have- to optimize.

Writing assembly is slower - usually more code / operations.

Assembly is less clear having more registers and less named variables makes things hard to read.
Of coures you can write comments, but it's a tradoff to even slower programming.

When the project gets larger assembly tends to produce slower code than C.
For example you can always inline functions in C, and each incarnation of the function will use different registers that was free at that time.
You won't do that in assembly cos it'd increase the work exponentially.
You will end up with strict register convention rules to make code reusability for large projects, using a lot of push/pop - causing slower code.

Most of these points apply to shaders as well (more or less).
When it isn't that's because shaders are not so big (yet!).
 
Hyp-X said:
Assembly is less clear having more registers and less named variables makes things hard to read.
Of coures you can write comments, but it's a tradoff to even slower programming.
That's a very good point. But you could write a preprocessor with macro's! ;)
For example you can always inline functions in C, and each incarnation of the function will use different registers that was free at that time. You won't do that in assembly cos it'd increase the work exponentially. You will end up with strict register convention rules to make code reusability for large projects, using a lot of push/pop - causing slower code.
I am actually against the use of inline functions. They generally make code bigger and in many cases it doesn't influence performance. A call is really not that expensive. And where it does really matter, in the bottleneck, it's advisible to write the whole thing in assembly anyway. It might take some effort but you can really always beat the C compiler. For graphics programming it's a bit harder because of its dedicated and limited instruction set though...
Most of these points apply to shaders as well (more or less).
When it isn't that's because shaders are not so big (yet!).
Unless you're going to do ray-tracing with it, I don't believe shaders will ever become real 'programs'. You have to realize that a 1024x768 screen is still almost a million pixels, and running your 'program' per pixel is always going to be slow. So most shaders will be kept short for performance reasons. Other reasons are that many games simply don't need complex shaders. I haven't seen any games yet that uses ps 2.0 shaders on every surface. I agree that writing assembly can be slightly more unpleasant in most cases, but it's not like it will ever become unmanagable.

So I find that writing a high-level compiler for shaders is a bit of a waste of time (except for portability reasons). Just having some tools like the preprocessor would be more effective. But please keep in mind that my opinion is biased because I'm an assembly freak. :devilish:
 
Back
Top