Writing shaders in assembly vs HLSL

Hanners

Regular
This is really just following up to a comment I read in the forums over at Rage3D (discussing Half-Life 2), where somebody stated that writing shaders in DirectX HLSL compared to assembly will yield a 50% performance hit.

Now, although I might have expected HLSL to give a small performance penalty, 50% seems more than a little excessive, to put it mildly.

Anyway, my curiosity has been roused now, so can any of you clarify whether this statement is correct or incorrect, potential performance hits, etc?

Thanks in advance guys. :)
 
Seems like BS.
That is, unless you're programming on a NV3x, eh. In fact...

Even using Cg on a NV3x will result in a fairly big ( 20% I'd guestimate ) hit compared to assembly language, if your assembly language is optimal.

I've had the chance to look at some beta shaders of a certain nVidia demo which is in fire, and they didn't retrieve a few comments in there yet.
These comments were very interesting: some of them were comparing the output of the Cg compiler ( version 1.1. I suppose, since it was all fairly recent ) and what some NV3x assembly guru managed to do by optimizing the resulting given by the Cg Compiler.

Things like 25 instructions, 5 half registers and 1 full register going as low as 20 instructions and only 4 half registers ( don't remember the exact numbers though, probably have them somewhere on my PC )

So, considering Cg is done with the NV3x in mind... Well...
I'd say that 50% is exagerated, heck HLSL is pretty much done with the R3xx in mind. But saying it's a "small performance penalty" isn't really correct either. An exact estimate is hard to do, and it greatly depends on the situation.
But the thing to keep in mind here is that making optimal code takes a lot of time, and sometimes you might not even figure out how to make it much better than the compiler. And sometimes you couldn't even figure out how to make it as good as the compiler!
That's all precisely what HLSL & Cg are all about - saving time.


Uttar
 
HLSL wasn't done with any [vendor specific] hardware in mind; it was done with the intention of putting out the most optimal assembly reasonably possible. And it's quite good at it.

I've seen the HLSL compiler out perform many amateurs by a few instructions, and in the cases where it falls behind handcrafted assembly it is usually only by an instruction or two. Though there are, of course, cases where this can differ by a large margin.

50% sounds like a gross, worst-case scenario overestimate to me. But HLSL was created to increase the efficiency of the programmer, not the GPU.

[Edit]Clarified the hardware statement.
 
We use HLSL exclusively for our next gen stuff and compiler generated assembly beat hand- tweaked one before we did the switch to HLSL... and boy does it simplify things :)

-- Daniel, Epic Games Inc.

Hanners said:
This is really just following up to a comment I read in the forums over at Rage3D (discussing Half-Life 2), where somebody stated that writing shaders in DirectX HLSL compared to assembly will yield a 50% performance hit.

Now, although I might have expected HLSL to give a small performance penalty, 50% seems more than a little excessive, to put it mildly.

Anyway, my curiosity has been roused now, so can any of you clarify whether this statement is correct or incorrect, potential performance hits, etc?

Thanks in advance guys. :)
 
It's largely the same score as with C vs. assembly.

There will be some limited set of situations in which an assembly language expert can produce much faster code than the C compiler. In most other circumstances pure assembly will be slightly but not significantly faster.

The gap will be significantly narrower than C vs assembly, though, because the single-basic-block nature of (current) shader code is very amenable to optimisation. If the algorithm has been expressed in its most efficient form in HLSL, the output should be very close to the ideal shader.

Most importantly, there are very few competent assembly programmers left out there. The typical programmer who has no or limited assembly-level skills will get as good or better results from HLSL, because they will understand what they are doing so much better.
 
In my own tests DX9 HLSL compiler has done significantly better job for PS 2.0 shaders than Cg compiler, difference being a few instructions out of a ~30 instruction shader. DX9 HLSL got within one instruction of my own version, Cg had 4 instructions more than DX9 HLSL, mostly movs that aren't actually needed and possibly optimized away by the driver.
 
jpaana said:
...Cg had 4 instructions more than DX9 HLSL, mostly movs that aren't actually needed and possibly optimized away by the driver.


Interesting...especially if it's "optimized away" by nVidia's driver, but not other's....
 
Joe DeFuria said:
Interesting...especially if it's "optimized away" by nVidia's driver, but not other's....

I was thinking the same..

More of a reason to prefer the standards over CG.
 
If the algorithm has been expressed in its most efficient form in ...

I'd say this is equally important for performance. Especially when you take into account the hardware the code will be running on.
 
Ilfirin said:
More of a reason to prefer the standards over CG.

beerchug.gif
 
jpaana said:
Cg had 4 instructions more than DX9 HLSL, mostly movs that aren't actually needed and possibly optimized away by the driver.
Were the moves used to reduce the number of registers used? If so, then this would be in line with what is expected when optimizing for the NV3x architecture.

It's also why I agree with one very specific part of Cg: it has the capability for hardware-specific compiler targets.
 
P4-Fan said:
If the algorithm has been expressed in its most efficient form in ...
I'd say this is equally important for performance. Especially when you take into account the hardware the code will be running on.
One of the problems with high-level languages is that it's easy to make something clumsy or - a particular problem for games - a 'one size fits all' shader when in many cases a simpler shader would suffice.
 
As far as I know, both ps 2.* assembly and HLSL have to be compiled and optimized by the driver. Some assembly instructions might not correspond with the micro-instructions used by the graphics chip. In the future I think that this difference wil become even bigger when using higher clock speeds and RISC-like architectures.

So we're comparing apples and oranges here. It's not like comparing C with x86 assembly. In that case you absolutely can have 50% and more performance increase by writing in assembly. But most of that is by using newer instruction sets or special characteristics from a specific processor model. Many -old- optimizations are now not valid any more and can even decrease performance. On the other hand, some mathematical tricks can only be done in assembly.

I think of shader programming mostly in the same way. With a high-level shading language you get good performance and good portability. With low-level assembly you -migh- very well get a performance increase, even 50% if you know all the ins and outs, but you're limited to a smaller audience and the optimization might not work on other and newer architectures.

Cg is a different story. It is an extra layer above DirectX or OpenGL assembly shaders. So there are two compilation and optimization steps. In both steps some performance is lost. In other words, Cg should theoretically always be slower than ps 2.* or HLSL or glslang. It's doomed to extinct except as an API-independent language...

Anyway, what I actually wanted to say is, ps 2.* is more like a high-level assembly language. Some sort of intermediate language where most instructions take three arguments. Take a look at the project in my signature, almost all ps 2.0 instructions get translated to multiple SSE instructions.
 
Nick said:
Anyway, what I actually wanted to say is, ps 2.* is more like a high-level assembly language. Some sort of intermediate language where most instructions take three arguments. Take a look at the project in my signature, almost all ps 2.0 instructions get translated to multiple SSE instructions.
Actually, all pixel shader versions are closer to microcode - i.e. lower level - than classic 'assembly language'.

The reason instructions take three arguments is because of MAD. For obscure technical reasons, it costs virtually nothing to do a MAD over a MUL if you can work out the operand specification and fetch issues.

SSE takes more instructions because it doesn't have separate destination or 3-operand instructions - it's nothing to do with the PS code being 'higher level'.
 
Dio said:
SSE takes more instructions because it doesn't have separate destination or 3-operand instructions - it's nothing to do with the PS code being 'higher level'.

And it doesn't have free swizzle...
It's not unusual in SSE that half of the instructions are pack/unpack stuff. Yuck!
 
Nick said:
As far as I know, both ps 2.* assembly and HLSL have to be compiled and optimized by the driver.

No, HLSL have to be compiled into assembly first before it gets passed to the driver. (The binary representation actually.)
 
Dio said:
The reason instructions take three arguments is because of MAD. For obscure technical reasons, it costs virtually nothing to do a MAD over a MUL if you can work out the operand specification and fetch issues.
I thought MAD took 4 arguments?
SSE takes more instructions because it doesn't have separate destination or 3-operand instructions - it's nothing to do with the PS code being 'higher level'.
There are also several commonly used shader instructions that do not have SSE counterparts, especially dot-product, exponent and logarithm instructions. The dot-product can sometimes be as much as half of all the instructions in shader programs, and it takes something like 5 SSE instructions to implement it.
 
Dio said:
Actually, all pixel shader versions are closer to microcode - i.e. lower level - than classic 'assembly language'.
I would have thought it'd be more like VLIW or something like that (at least from what I read of, say, the leaked XBox description). From what I recall of my dim-dark-Uni days, microcode is a way of implementing assembly code by interpretting it using an even simpler system and so could take multiple cycles to execute each asm instruction.
The reason instructions take three arguments is because of MAD. For obscure technical reasons, it costs virtually nothing to do a MAD over a MUL if you can work out the operand specification and fetch issues.
By that, do you mean throughput? - it surely must cost gates to stick on the adder (although there will be some savings with them chained together over independent muls and adds).
 
arjan de lumens said:
Dio said:
The reason instructions take three arguments is because of MAD. For obscure technical reasons, it costs virtually nothing to do a MAD over a MUL if you can work out the operand specification and fetch issues.
I thought MAD took 4 arguments?
I shouldn't have used the word 'argument'. Instead I should have used 'operand' which I did (correctly) later on. This is probably a blind spot caused by switching between destination-first and source-first assembly language and separate-destination assembly language. In my head it looks like X = MAD X, Y, Z... I don't recommend rearranging their brain in this manner to anyone that hasn't got a somewhat random mix of z80, 6502, 68000, 56001, infinite varieties of x86, ARM, various PS and VS, and several microcodes on the loose in their head.

The dot-product can sometimes be as much as half of all the instructions in shader programs, and it takes something like 5 SSE instructions to implement it.
Yes and no. If you're using SoA packed data, SSE DP takes 5 instructions, but you get 4 results, so the overhead isn't that large. As with MAD, dot-product comes virtually free at the ALU level once you've implemented MUL.
 
Back
Top