PDA

View Full Version : Pixel Shader Optimization Challenge

DemoCoder
06-Jan-2003, 15:02
Ok, I needed to code something up, but I'm not sure my solution is optimal, so I am putting the challenge to you:

You have 3 vector values: E, F, and G. I want to compute MAX(E,F,G) and get the max value, and I also want to know which one of the three was the max.

Oh, and to make it clear: these are vectors, so everything applies to each component.

Here's an example.

E = { 2, 4, 1, 5}
F = { 8, 7, 8, 0 }
G = { 9, 0, 5, 2}

M = MAX(E,F,G) = { 9, 7, 8, 5 } (component-wise maximums)

Let Epos = 0, Fpos=1, Gpos=2

C = {2, 1, 1, 0} (records which value, E, F, or G was chosen as max for each component)

Interested in the minimal instruction count possible. Naive implementation can be done in 6 ops, but I'm wondering if it's possible to beat it.

Approximations are allowed if the error rate is less than 5%. (e.g. I've seen optimizations from MAX to a fused-multiply-add that use a change of algebraic rings at the cost of loss of precision)

Basic
06-Jan-2003, 17:59
I always seem to forget which of the conditionals are in which shader set, and the only spec I have at this computer is the NV_fragment_program for NV30. So with that set, this is the first try:
MAX M, E, F
MAX M, M, G
SNE C, M, E
SNE T, M, F
MAD C, C, T, C
C will contain the smallest index if more than one vector contains the max value.

Don't know if all those instructions are available in other instruction sets.

I want a cookie. :D

DemoCoder
06-Jan-2003, 21:30
Cool, but can C appear so many times in the MAD? Wouldn't you hit a read port limit?

Anyone think they can get this down to 4 instrs?

nAo
06-Jan-2003, 22:08
Cool, but can C appear so many times in the MAD? Wouldn't you hit a read port limit?
Each register should be tri-read-ported.

Basic
06-Jan-2003, 22:31
Couldn't find an explicit reference to read port limits in the NV30 doc. But yes I'm quite sure that a temporary variable can be read two times in one OP.

Anyone think they can get this down to 4 instrs?You inhumane... :D
I think it's rather hard, would you care to tell the "approximate MAX" trick to give some inspiration?

[EDIT]
Doh, nAo beat me to it.

DemoCoder
07-Jan-2003, 02:58
Modern superscalar processors are designed to get high performance on floating-point intensive problems. The IBM Power2, for example, can initiate two floating-point multiply-add instructions per cycle. This is the incentive behind reformulating ADD and MAX operations as floating-point multiply and add instructions.

The two algebraic structures S1 = (R, +, 0, MAX) and S2 = (R+, *, 1, +) are both semirings, where R is the set of real numbers and R+ is the positive reals. Given a real number BASE, the function encode(A) = BASE**A maps numbers from S1 into those of S2. The inverse mapping is decode(X) = log(X)/log(BASE). These transformations preserve the first semiring operation, that is,

A+B = decode(encode(A)*encode(B))

The situation is not quite as tidy for the second operation. If A is much larger (or smaller) than B, then A+B is a very good approximation to MAX(A,B). However, if A is equal to B, then A+B = 2 MAX(A,B). It isn't hard to show
MAX(A,B) &lt; decode(encode(A)+encode(B)) &lt;= MAX(A,B)+log(2)/log(BASE)

Basically, if all you need to do is addition and max ops, then you can transform to exponential space (almost like gamma correction), do your math there, and then transform back later.

This works only in certain circumstances, where you have code that looks like

X = A + B
Y = C + D
Z = MAX(X,Y)

Simon F
07-Jan-2003, 08:44
But Democoder, why are you worrying about efficiency for? Aren't you supposed to be using a high level shader language these days? :P

/me removes tongue from cheek.

DemoCoder
07-Jan-2003, 11:00
What if I were writing a compiler? :)

Simon F
07-Jan-2003, 13:23
What if I were writing a compiler? :)
Then I'd say you'd either specified a rather 'interesting' set of primitives in the language OR that you were trying to make it way too intelligent :-)

DemoCoder
07-Jan-2003, 22:24
Got me. :)

Chalnoth
12-Jan-2003, 07:41
Just out of curiosity, DemoCoder, have you tried this in an HLSL (Cg or DX9 HLSL) and examined the output?