PDA

View Full Version : Question about predicates


Mintmaster
24-Feb-2004, 05:45
What is the real advantage of predicates in ps 2.x? I can't see why they're any better than the plain old cmp function, except sometimes when combined with dynamic branching (which NV30 can't do anyway).

Can anyone give me an example of what you can do with predicates that you can't do with cmp? Or at least how it can save instructions (other than the fact that setp_comp doesn't need a subtract when comparing with a non-zero value)?

Basic
24-Feb-2004, 09:58
Think about what happens when more than one assignement is predicated with the same comparision.

Mintmaster
24-Feb-2004, 15:30
Is that it? So they're just boolean values with a fancy name?

Even in the case you mentioned, when pixel shaders have if...else statements, even multiple instruction blocks often have only one value that needs to be chosen.

Thanks. I guess it doesn't really matter anyway since everything is done by the hlsl compiler nowadays.

3dcgi
25-Feb-2004, 01:34
I don't know how it's implemented in ps 2.x, but here's how predicates work. The instruction names in the example are made up.

pred_cmp_equal_zero r0.x
(pred_true) add r1 0.5
(pred_false) sub r1 0.25

This is effectively an if/else statement.

if (r0.x = 0 )
{
add 0.5 to register r1
}
else
{
subtract 0.25 from register r1
}

You can find more details in the Itanium documentation on Intel's web site. Or elsewhere on the web I'm sure.

Simon F
25-Feb-2004, 08:07
Mintmaster: AFAIK, ARM processors have always had predicated instructions. I guess it's just been a matter of time until others have copied the idea.

DemoCoder
27-Feb-2004, 02:48
In some cases, a branch will be more efficient, and in some cases, a predicate block will win. It is up to compilers to make smart decisions based on the underlying architecture of the HW and the shader's structure.

For example, for short blocks, something like

if(x > y)
a = b * c
else
a = b * d

will run alot faster with CMP, LERP, MIN/MAX, and predicate-style branches.

But for a very long shader, with expensive operations like texture lookups in one of the branches, it may be more efficient to use real dynamic branches with some kind of prediction mechanism.

It all depends on the penalty for a real branch vs executing both branches and using predicates.

3dcgi
27-Feb-2004, 04:28
While Itanium isn't a graphics chip it appears to work like DemoCoder describes. I.e. it supports traditional dynamic branching and predication. The compiler decides which to use.

MfA
27-Feb-2004, 04:45
Branch prediction isnt needed, just find something else to do while the data needed to determine the branch target clears the pipeline ... and support split branches.

Speculative execution is EVIL, EVIL I TELL YOU.

Simon F
27-Feb-2004, 09:09
Branch prediction isnt needed, just find something else to do while the data needed to determine the branch target clears the pipeline ... and support split branches.

Speculative execution is EVIL, EVIL I TELL YOU.

That's an interesting stance Marco (and I realise that you're probably being facetious), but do you mind if I pose some hypothetical questions?
What do you do if you haven't got anything else to do while the pipeline clears?
How do you feel about breadth-first searches of a shortest path problem? In some senses much of the work you are doing is also speculative.
How do you feel about the following VS instruction...?
MUL R1.xy, R0, R2?
It's wasting 50% of the shader's computational power, but it'd be very difficult to fit in something else that used the z and w channels.
How about in a normal CPU where you're using an integer for a boolean test - that's 31 bits being unused :)

MfA
27-Feb-2004, 10:17
First of all I understand you are being facetious, but of course when I said speculative execution I meant hardware initiated speculative execution.

Branch prediction isnt needed, just find something else to do while the data needed to determine the branch target clears the pipeline ... and support split branches.

Speculative execution is EVIL, EVIL I TELL YOU.

That's an interesting stance Marco (and I realise that you're probably being facetious), but do you mind if I pose some hypothetical questions?
What do you do if you haven't got anything else to do while the pipeline clears?

Signal the profiler and let it tell the developer or compiler ... they can add predicated execution of the most likely path for instance (doesnt need to be a single path, with split branches you could do software branch prediction).

In a multithreaded processor you could use dynamic multithreading and only speculatively execute branches by spawning threads if the processor didnt have any real work to do, but I am not sure such sophisticated multithreading will really make much sense in a performance per mm2 sense ...

How do you feel about breadth-first searches of a shortest path problem? In some senses much of the work you are doing is also speculative.

In another you need much of that information to really know it is the shortest path anyway.

How do you feel about the following VS instruction...?
MUL R1.xy, R0, R2?
It's wasting 50% of the shader's computational power, but it'd be very difficult to fit in something else that used the z and w channels.

Then again, make the multiplication itself expensive enough and it becomes usefull to execute the individual multiplications with shared execution resources. Lets see what happens when we start doing double precision math in vertex shaders.

How about in a normal CPU where you're using an integer for a boolean test - that's 31 bits being unused :)

Ah, but we dont use individual registers for flags either now do we :)

It might be a question of balance, but for non mainly serial ISAs I dont think speculative execution is justified very often ... I mean, can you honestly say that you would like to see something like data speculation coming to desktop processors? Think of what you could do with the kind of area wasted on such things ...

Personally I dont think it will get that far, parallelism will conquer the excesses of speculative execution ... Ill stop calling it evil when that happens :)

Marco

Basic
28-Feb-2004, 21:56
I'm with MfA here. Remember that we're talking about GPUs. With PS2.0 you need to hide the latency of a texture fetch between two instructions. If you've already taken the hit to hide that kind of latency (through massive multi-threading), then there's no need to do speculative execution.

I'm rather surprised how often people talk about "branch misprediction penalties" here wrt PS3.0. It's a non-isssue.

Back on topic. (Branching isn't on topic.)

Even if there's just one assignement in each of the branches, it's still possible to gain some with predicates. If it's a large expression, it can often be rewritten so that operations can be reused in the two branches.

[Edit]
Just realized how strange the last part sound. :D
Actually doing branches are of topic here.
Different ways of faking branches by evaluating everything, and selecting the right result isn't off topic.

DemoCoder
29-Feb-2004, 07:41
Well, yes, you can factor common subexpressions outside of branches and perform partial redundancy elimination as well, but you still need the ability to branch over potentially very expensive operations.

Of course, the biggest example of where predicates break down is when you need to escape a loop early.

Basic
29-Feb-2004, 11:47
I'm not arguing that true branching is good to have, especially on GPUs where there's no problem with branch misprediction. I'm just saying that the original topic was about predicates vs cmp.

Btw, it doesn't have to be normal common subexpressions to reuse instructions between branches. It can be enough that it's the same operations in a larger expression.

Silly example: //HLSL
if(a<0)
x = sin(b) + log(c);
else
x = cos(d) + exp(e);


// pseudocode for "cmp" (one line per instruction)
R1 = sin(b);
R2 = log(c);
R1 = R1 + R2;
R2 = cos(d);
R3 = exp(e);
R2 = R1 + R2;
x = (a<0) ? R1 : R2;


// pseudocode for predicates (one line per instruction)
B1 = (a<0)
R1 = sin(b);
if(!B1) R1 = cos(d);
R2 = log(c):
if(!B1) R2 = exp(e);
x = R1 + R2; // The add is reused for both branches.

In that example I saved one instruction, even though there only was one assignement. With a larger expression it should be possible to find more instruction to reuse.

And again, yes, it would be better with a true branch. But that's not what the topic was about.

DemoCoder
29-Feb-2004, 13:12
It's not really redundancy elimination that's really saving you the instruction, the true magic lies in the fact that predicates allow you to combine CMP + op into a single instruction eliminating the need for extra CMPs in addition to ops.

If we write down the code in static single assignment form, it looks like


x1 = sin(b)
x2 = log(c)
x3 = cos(d)
x4 = exp(e)
x5 = x1 + x2
x6 = x3 + x4
x = phi(x5, x6)


where phi() is the conditional move function (notational abstraction) which sets X according to which branch control actually flowed through. To implement phi() with real branches, you insert a MOV instruction at the end of each branch block (e.g. x = x5 in IF clause, x = x6 in else clause). To implement with CMP, you need to insert a CMP instruction to choose x5 vs x6.

However, you can refactor the representation as such


x5 = phi(x1, x3)
x6 = phi(x2, x4)
x = x5 + x6


Then with forward propagation, you get
x5 = phi(sin(b), cos(d))
x6 = phi(log(c), exp(e))

Once you get phi(op, op), you translate phi() as a predicate

x5 =sin(b)
(!pred) x5 = cos(d)


You cannot translate phi() with only 2 instructions using CMP. Hence, the win.

The only problem is, I bet Microsoft's compiler doesn't implement these kinds of optimizations, because they're tricky. Intel has a bunch of papers out, and although I gave a simple trick for achieving it above (patterning matching for phi(op,op)), it's just the tip of the iceberg. The real deal, *really* taking full advantage of predication optimizations is uber complex for compiler writers.


Hell, FXC still doesn't generate assembly macros for some library functions, but rather expands things in its own power series, precluding the ability of the driver to substitute a more efficient evaluation.