Tip for better performance

vinartrulz · Jan 26, 2010

Hello all..

Am currently porting an algorithm to GPU.
Algorithm is pretty simple but i cudn't find an alternate implementation which suits GPU because of which am not currently getting desired result out of GPU .

Algorithm.

For each voxel of a 512 X 256 X 256 i need to look around adjacent 12 voxel and if any of those ( voxel == level ) i need modify the value corresponding to tat voxel.
Current implementation wud look like

if( pixel1 == level ) flag = true;
else if( pixel2 == level ) flag = true;
...............
else if( pixel12== level ) flag = true;

As i understand GPU executes all these statement regardless of whether flag is set during the first statement itself. Is there some other tricky way to accomplish this. Looking forward to your suggestion.

AndreasL · Jan 27, 2010

The performance of dynamic branching differs depending on what GPU you are using and on the input data. Several (for example 8 or 16) fragments/pixels are processed using the same stream of instructions so if some of the fragments take a specific branch the instructions of that branch must be executed. If however none of the fragments take the branch it can be skipped all together.

I don't find any easier way of explaining it but look at this presentation pages 25-28:
http://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdf

If it is possible to do it without dynamic branch I would at least try it.

One way would be to look at the comparison in another way.

if(pixel1 == level) can be expressed as if((pixel1-level) == 0)

(pixel1-level) * (pixel2-level) * (pixel3-level) * ... * (pixel12-level)
should then only equal zero if any of the pixels equals level.

Ps.
You don't mention if it is floats or integers you are using. If you are using floats then I also should warn about doing equality comparison on floating numbers, it's generally a bad thing to do.
Ds.

/Andreas

Mintmaster · Jan 27, 2010

vinartrulz said:
As i understand GPU executes all these statement regardless of whether flag is set during the first statement itself. Is there some other tricky way to accomplish this. Looking forward to your suggestion.

It may be out of your control, with the compiler deciding what's best. I think that if you put the texture loads inside the branch then it should use dynamic branching.

Of course, you only gain with dynamic branching if all pixels in a warp/wavefront skip at least some of the elseifs.

vinartrulz · Jan 28, 2010

Thanks a lot for the replies...
I tried implementing using cg and cuda. I am getting better results with Cg .. it could be because my cuda implementation is far from an optimal one.

My doubt is can somebdy guarantee that an optimal cuda implementation of an algorithm will always be better than its cg counterpart...

or its algorithm dependent...Am eager to get a reply for this.. :idea:

ccanan · Feb 1, 2010

I think you can use [branch] and [flatten] to control the asm code generated by the compiler, and check the asm code.
then see the result from profiler.

Tip for better performance

vinartrulz

AndreasL

Mintmaster

vinartrulz

ccanan

Similar threads