The penalty of flow control

Frank · Apr 22, 2004

Would it be worth it to use flow control in the near future for more than the special cases that cannot be done otherwise? I think not, as long as we are talking games and you need a fall-back path for other hardware anyway. And I don't think it will speed things up for general purposes, as you effectively use about half your processing capacity or more to make it happen.

DemoCoder · Apr 22, 2004

Depends what you mean by "special case". There will be common cases in some shaders where it will save performance or save a rendering pass, but it is definately not a replacement for CMP or predication.

On the other hand,PS2.0b doesn't seem to have predication either, and predication is an example where instruction cycles can be saved in the general case without penalty.

Frank · Apr 22, 2004

DemoCoder said:
Depends what you mean by "special case". There will be common cases in some shaders where it will save performance or save a rendering pass, but it is definately not a replacement for CMP or predication.

On the other hand,PS2.0b doesn't seem to have predication either, and predication is an example where instruction cycles can be saved in the general case without penalty.

That depends. At the moment you execute an action that is pixel dependant, your quad pipeline will stop being a quad and start being a single pixel pipeline (or possibly two reduced single pixel pipelines). So even if you take the loss of the cycles for that action, all further actions will only execute on a single pixel (or pair).

So, while the loss of cycles for the branch might be preferrable to other alternatives, you end up with a smaller pipeline for sure the moment the instructions for all four pixels don't match anymore.

Even in the case that the driver or hardware unrolls your loops and executes all possibilities, it will decrease the throughput for those fragments from that point on, while it takes additional cycles afterwards to shift and assign the correct results.

Edit: typo.

Btw. It is even worse than it looks, as the other pixels from the initial quad have to be processed in sequence as well.

Frank · Apr 22, 2004

Of course, when the next batch of fragments enters the quad pipe, it will start being a quad again, until you do something that makes it reconsider.

DemoCoder · Apr 22, 2004

DiGuru said:
That depends. At the moment you execute an action that is pixel dependant, your quad pipeline will stop being a quad and start being a single pixel pipeline (or possibly two reduced single pixel pipelines). So even if you take the loss of the cycles for that action, all further actions will only execute on a single pixel (or pair).

Are you talking about predication or branching? Predication has no "quad reducing" penalty. And where do you get this information on losing pipelines. A branch should just cause a stall for synchronization while it waits for other pipes in the quad to finish.

ninelven · Apr 23, 2004

Another dumb question courtesy of Ninelven:

In very heavy shader based games, could combining the shaders into fewer shaders that use dynamic flow control help alleviate the need for a z-pass?

Frank · Apr 23, 2004

DemoCoder said:
DiGuru said:

That depends. At the moment you execute an action that is pixel dependant, your quad pipeline will stop being a quad and start being a single pixel pipeline (or possibly two reduced single pixel pipelines). So even if you take the loss of the cycles for that action, all further actions will only execute on a single pixel (or pair).

Click to expand...

Are you talking about predication or branching? Predication has no "quad reducing" penalty. And where do you get this information on losing pipelines. A branch should just cause a stall for synchronization while it waits for other pipes in the quad to finish.

Predication just means (as far as I know), that you tell the hardware what direction your branch is supposded to take most often. That works ok, as long as the same branch is taken for all pixels.

Wether or not a branch causes a stall for synchronization (and stores the other pixels while executing the instructions for one of them) or just drops down to a one pixel pipeline, stores the other pixels and goes on executing the instructions for one of them, what is the difference?

And at the end, it has to do the other pixels in sequence as well, before taking on the next quad of fragments.

Correct?

DemoCoder · Apr 23, 2004

DiGuru said:
DemoCoder said:

DiGuru said:

That depends. At the moment you execute an action that is pixel dependant, your quad pipeline will stop being a quad and start being a single pixel pipeline (or possibly two reduced single pixel pipelines). So even if you take the loss of the cycles for that action, all further actions will only execute on a single pixel (or pair).

Click to expand...

Are you talking about predication or branching? Predication has no "quad reducing" penalty. And where do you get this information on losing pipelines. A branch should just cause a stall for synchronization while it waits for other pipes in the quad to finish.

Click to expand...

Predication just means (as far as I know), that you tell the hardware what direction your branch is supposded to take most often. That works ok, as long as the same branch is taken for all pixels.

No, that's branch prediction. Predication is mechanism for disabling reads/writes based on a predicate. The instruction still costs the same, it just doesn't have any effect on the shader.

Wether or not a branch causes a stall for synchronization (and stores the other pixels while executing the instructions for one of them) or just drops down to a one pixel pipeline, stores the other pixels and goes on executing the instructions for one of them, what is the difference?

The difference is pipelining. If the pipelines within a quad can operate on difference instructions (MIMD), then if one branch takes say, 20 instructions, and the other branch takes say, 10 instructions, then all 4 pixels within the quad will finish in a maximum of 20 (if say, 3 of them took the branch that costs 10 cycles), since they'd have to wait for the fourth pixel.

If on the other hand, if it is impossible for one pipeline to be executing an instruction that is different than the other 3 (SIMD), then a portion of the pipelines get "disabled" like with predication while the others execute a portion of the branch. There doesn't seem to be a good reason for this IMHO except to say instruction fetch/decode logic.

[/quote]

Hyp-X · Apr 23, 2004

DiGuru said:
Predication just means (as far as I know), that you tell the hardware what direction your branch is supposded to take most often. That works ok, as long as the same branch is taken for all pixels.

Nope, you are wrong.
Predication avoids branching. It's conditional execution. Like CMOV.

Wether or not a branch causes a stall for synchronization (and stores the other pixels while executing the instructions for one of them) or just drops down to a one pixel pipeline, stores the other pixels and goes on executing the instructions for one of them, what is the difference?

The difference shown as a 2 pipe example:

Code:

DC's way:
1  2  3  4a 5a 6a 7a
1  2  3  4b 5b -   -

Your way:
1  2  3  4a 5a 6a 7a -   -
1  2  3  -   -    -   -   4b 5b

Hyp-X · Apr 23, 2004

DemoCoder said:
If on the other hand, if it is impossible for one pipeline to be executing an instruction that is different than the other 3 (SIMD), then a portion of the pipelines get "disabled" like with predication while the others execute a portion of the branch. There doesn't seem to be a good reason for this IMHO except to say instruction fetch/decode logic.

Instruction fetch/decode.
Constant register fetch.
Iterator/temp register addressing.

I suspect these are close to as much logic as the actual execution units (there's a reason after all to go with a SIMD architecture).

OTOH, even if it's SIMD it can be optimized to only diverge when it's really neccessary. So if there's a data dependent branch, but it has the same result for the entire quad than the execution can continue with all the 4 pixels.

This means that algorithms that results in the same execution path for large continuous screen areas will not take big a performance hit.

Note that I'm not saying the NV40 is working this way.
(I have no idea.)

DemoCoder · Apr 23, 2004

I agree. If the branches are based off of per-vertex input registers (Vn, vFace, etc), then you'll have all 4 pipes taking the same branch and the compiler will have it easier deciding when to predicate and when to branch. If the condition is based on dependent texture fetch, then it is much harder, and the compiler or developer will have to start branch predicting. If the majority of the pixels take one branch or the other, you could still end up with a big win.

It's just difficult to analyze and Nvidia developer relations will have to provide ample training and tools to assist devs in knowing when and how to use branches.

Frank · Apr 23, 2004

DemoCoder said:
DiGuru said:

DemoCoder said:

DiGuru said:

That depends. At the moment you execute an action that is pixel dependant, your quad pipeline will stop being a quad and start being a single pixel pipeline (or possibly two reduced single pixel pipelines). So even if you take the loss of the cycles for that action, all further actions will only execute on a single pixel (or pair).

Click to expand...

Are you talking about predication or branching? Predication has no "quad reducing" penalty. And where do you get this information on losing pipelines. A branch should just cause a stall for synchronization while it waits for other pipes in the quad to finish.

Click to expand...

Predication just means (as far as I know), that you tell the hardware what direction your branch is supposded to take most often. That works ok, as long as the same branch is taken for all pixels.

Click to expand...

No, that's branch prediction. Predication is mechanism for disabling reads/writes based on a predicate. The instruction still costs the same, it just doesn't have any effect on the shader.

Ah, I missed that one. Thanks.

Wether or not a branch causes a stall for synchronization (and stores the other pixels while executing the instructions for one of them) or just drops down to a one pixel pipeline, stores the other pixels and goes on executing the instructions for one of them, what is the difference?

Click to expand...

The difference is pipelining. If the pipelines within a quad can operate on difference instructions (MIMD), then if one branch takes say, 20 instructions, and the other branch takes say, 10 instructions, then all 4 pixels within the quad will finish in a maximum of 20 (if say, 3 of them took the branch that costs 10 cycles), since they'd have to wait for the fourth pixel.

If on the other hand, if it is impossible for one pipeline to be executing an instruction that is different than the other 3 (SIMD), then a portion of the pipelines get "disabled" like with predication while the others execute a portion of the branch. There doesn't seem to be a good reason for this IMHO except to say instruction fetch/decode logic.

The ALU and datapath make a pipeline. Most transistors that go into an ALU consist of execution logic. Next to the instruction fetch/decode logic, you need a bunch of transistors to execute each opcode. And you need transistors that determine where to store what. And transistors to keep track of what each part is used for. And caches to store the state of all that when switching fragments and shaders.

The registers that do the actual work are just a small part of it. And those are the main parts you have to duplicate for SIMD, next to some limited datapaths and extra storage. And you can even save a lot of transistors around the pipe, that handle the input and output.

We should ask sireric, but I think you could make no more than two one-pixel pipelines with the transistors used to make a quad, if that.

Frank · Apr 23, 2004

Ah, yes. I was a bit slow. I agree with both of you.

DemoCoder · Apr 23, 2004

I agree, I just don't know, although I have reason to suspect it's true. (of course, we could also say that NV seems to use way more transistors per pipe than ATI too

)

However, it's only a relevant issue if a predominant number of your branches differ per quad. I suspect that's not the case. Like with Nalu, I suspect that branches will be defined by relatively infrequently changing boundaries (e.g. skin to scale) that are larger than a quad. You'd have to implement something like a "screen door" effect to have a predominant number of pixels in quads take separate branches.

A pathological example would be rendering the Mandelbrot set.

Hyp-X · Apr 23, 2004

DemoCoder said:
It's just difficult to analyze and Nvidia developer relations will have to provide ample training and tools to assist devs in knowing when and how to use branches.

What I'm worrying about is if I use HLSL (or Glslang), how do I specify that an if should be compiled as a predication/cmp-struct or a dynamic branch?

Will compilers try to decide that for me?

Code:

if (tex2D(sampler, coords) > 0.5)
    oColor = func1();
else
    oColor = func2();

In this case wether predication or a dynamic branch is faster might depend on the actual texture!

Frank · Apr 23, 2004

DemoCoder said:
I agree, I just don't know, although I have reason to suspect it's true. (of course, we could also say that NV seems to use way more transistors per pipe than ATI too )

However, it's only a relevant issue if a predominant number of your branches differ per quad. I suspect that's not the case. Like with Nalu, I suspect that branches will be defined by relatively infrequently changing boundaries (e.g. skin to scale) that are larger than a quad. You'd have to implement something like a "screen door" effect to have a predominant number of pixels in quads take separate branches.

A pathological example would be rendering the Mandelbrot set.

Yes, that's why it is not very relevant if multiple branches in complex shaders that result in a different path for each pixel (like when calculating a Mandelbroth) would be efficient: it wouldn't. It would take about four times the amount of instruction cycles to finish.

As long as all pixels in a quad take the same path, flow control can be effective and fast. And for very special effects, the speed does not matter very much.

MfA · Apr 23, 2004

I think quad based rendering should be a developer choice myself ... I think 4*32 bit floating point arithmetic takes enough area not to make sharing control logic a necessity from an area point of view. Also it doesnt fit all that well in a unified shading architecture.

Sometimes it is a nice way to be able to get difference values .... sometimes it is a nice way to make it impossible to perform efficient rendering.

Once you get rid of quads the whole branching question simply becomes a question of overcoming a little latency (and very little compared to memory access at that). Not a big deal.

Frank · Apr 23, 2004

So, how do we explain this?

Tridam said:
DarN said:

Yes, but does it handle it Well?

Click to expand...

The answer is "No" at least with the current drivers (60.72). PS3.0 support isn't exposed until DX9.0c is installed. I've done some quick tests for our review. Branching is very costly. So costly that it looks like the drivers are buggy. 9 pipeline passes for the simplest branching I was able to think about :

Code:

ps_3_0 dcl vFace def c0, 0, 0, 0, 0 def c1, 1, 0, 0, 0 def c2, 0, 1, 0, 0 if_ge vFace, c0.x mov oC0, c1 else mov oC0, c2 endif

DemoCoder · Apr 23, 2004

Hyp-X said:
Will compilers try to decide that for me?

Code:

if (tex2D(sampler, coords) > 0.5) oColor = func1(); else oColor = func2();

In this case wether predication or a dynamic branch is faster might depend on the actual texture!

I agree, clearly there needs to be a hinting mechanism, e.g.

#pragma(branch_use_predicate)
if(...)
...
#pragma(branch_compiler_choice)

By default, if a texture is used as part of the condition, the compiler should default to predication unless told otherwise. Right now, developers may have to use an asm() directive to get around this.

KimB · Apr 23, 2004

Yes, with three different ways to perform branching in PS 3.0 hardware, there will definitely need to be some work done in optimizing compilers to execute branching in the most optimal way.

Hopefully it can be done well without the need for developer input, though in the short-term a hinting mechanism may be necessary for optimal performance in many cases.

The penalty of flow control

Frank

Certified not a majority

DemoCoder

Frank

Certified not a majority

Frank

Certified not a majority

DemoCoder

ninelven

PM

Frank

Certified not a majority

DemoCoder

Hyp-X

Irregular

Hyp-X

Irregular

DemoCoder

Frank

Certified not a majority

Frank

Certified not a majority

DemoCoder

Hyp-X

Irregular

Frank

Certified not a majority

MfA

Frank

Certified not a majority

DemoCoder

KimB

Similar threads