The penalty of flow control

DemoCoder said:
DiGuru said:
In a SIMD architecture, if using predicates, that might still require different instructions issued for each fragment. That will only be faster if it doesn't 'break' the quad (ie. the chip can issue the instructions for each fragment and continue executing the whole quad afterwards, instead of dropping down to serial processing).

No, predicates do not require different instructions to be issued for each fragment. With predicates, instructions are still executed, the results simply aren't written to registers. It's a write disabling mechanism, that's all.

Yes, but the results still have to be calculated. All of them, with the correct ones being used.

Also, in the examples above:
Code:
if (tex2D(sampler, coords) > 0.5)
    oColor = func1();
else
    oColor = func2();

Code:
oColor = (tex2D(sampler, coords) > 0.5)? func1() : func2();

Those functions (func1, func2) have to be executed, whatever mechanism is used. And that requires different instructions issued.

The difference is in calculating all possibilities versus stalling parts of the pipeline. Both have a penalty.
 
DiGuru said:
Those functions (func1, func2) have to be executed, whatever mechanism is used. And that requires different instructions issued.

No, they will all be executed in every fragment, in the exact same sequence. Period.

Code:
oColor = (tex2D(sampler, coords) > 0.5)? func1() : func2();

let func1()  = { return a*b; }
let func2() = { return a + b; }

Will compile to something like

Code:
texld r0, t0, s0
setp_gt p0, r0, c0.x
(p0) mul r1, r2, r3
(!p0) add r1, r2, r3
mov oC0, r1

more optimally

Code:
texld r0, t0, s0
setp_gt p0, r0, c0.x
(p0) mul oC0, r2, r3
(!p0) add oC0, r2, r3

And without predicates

Code:
texld r0, t0, s0
sub r1, r0, c0.x
mul r4, r2, r3
add r5, r2, r3
cmp oC0, r1, r4, r5


The two predicate instructions above
Code:
(p0) mul r1, r2, r3
(!p0) add r1, r2, r3

Both get executed in all pipelines, in the same order. The only difference is, if P0 is true, then the first instruction executes, and R1 register gets updated, but the second instruction with (!p0), the R1 register doesn't get written to.

Predicates are *write masks*, like saying mov r0.x, r1, r2, the only difference is, they are conditional masks. They do not change instruction order or which instructions execute. (another way to think of it is the other pipelines get disabled for that instruction, e.g. they execute a "NOP" that cycle, but generally, I think it's implemented via register update disablement)


p.s. I am aware that oC0 can only be written to with a MOV, but it is not generally true that the HW is restricted to this. I wrote it to show how it will look in the underlying code after the driver optimizes it (e..g on NV40, 0C0 = R0 register) I could have shown it more verbosely, but hopefully driver compiler does copy propagation and dead code removal and ends up like I wrote it.
 
DemoCoder, we actually agree. We both say the same thing in different words.

What do you think would happen if you use branches within a predicate? Or can you only use a predicate for assignments? I don't know the shader language good enough to say if it is possible.
 
Back
Top