Tip for better performance

vinartrulz

Newcomer
Hello all..

Am currently porting an algorithm to GPU.
Algorithm is pretty simple but i cudn't find an alternate implementation which suits GPU because of which am not currently getting desired result out of GPU .

Algorithm.

For each voxel of a 512 X 256 X 256 i need to look around adjacent 12 voxel and if any of those ( voxel == level ) i need modify the value corresponding to tat voxel.
Current implementation wud look like

if( pixel1 == level ) flag = true;
else if( pixel2 == level ) flag = true;
...............
else if( pixel12== level ) flag = true;

As i understand GPU executes all these statement regardless of whether flag is set during the first statement itself. Is there some other tricky way to accomplish this. Looking forward to your suggestion.
 
The performance of dynamic branching differs depending on what GPU you are using and on the input data. Several (for example 8 or 16) fragments/pixels are processed using the same stream of instructions so if some of the fragments take a specific branch the instructions of that branch must be executed. If however none of the fragments take the branch it can be skipped all together.

I don't find any easier way of explaining it but look at this presentation pages 25-28:
http://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdf


If it is possible to do it without dynamic branch I would at least try it.

One way would be to look at the comparison in another way.

if(pixel1 == level) can be expressed as if((pixel1-level) == 0)

(pixel1-level) * (pixel2-level) * (pixel3-level) * ... * (pixel12-level)
should then only equal zero if any of the pixels equals level.


Ps.
You don't mention if it is floats or integers you are using. If you are using floats then I also should warn about doing equality comparison on floating numbers, it's generally a bad thing to do.
Ds.

/Andreas
 
As i understand GPU executes all these statement regardless of whether flag is set during the first statement itself. Is there some other tricky way to accomplish this. Looking forward to your suggestion.
It may be out of your control, with the compiler deciding what's best. I think that if you put the texture loads inside the branch then it should use dynamic branching.

Of course, you only gain with dynamic branching if all pixels in a warp/wavefront skip at least some of the elseifs.
 
Thanks a lot for the replies...
I tried implementing using cg and cuda. I am getting better results with Cg .. it could be because my cuda implementation is far from an optimal one.

My doubt is can somebdy guarantee that an optimal cuda implementation of an algorithm will always be better than its cg counterpart...;) or its algorithm dependent...Am eager to get a reply for this..:idea:
 
I think you can use [branch] and [flatten] to control the asm code generated by the compiler, and check the asm code.
then see the result from profiler.
 
Back
Top