N-Queen Solver for OpenCL

Discussion in 'GPGPU Technology & Programming' started by pcchen, Jan 10, 2010.

  1. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    This is cool :)
    I'm still doing experients on vectorized version for Cypress, but my computer is not very cooperative and I have to replace my PSU (my old PSU is probably not powerful enough to drive both GTX 285 and 5850). I still expect to see at least 2x speed up for Cypress in vectorized version.
     
  2. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,809
    Likes Received:
    484
    Location:
    Torquay, UK
    I can't wait for vectorized version! :smile:
     
  3. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    Is he optimizing the code at all? This is not an optimized version so I don't think he should be using it to bench anything, particular ATI GPUs.

    You can get a good speedup by decreasing the CF through some simple techniques I believe.

    I do have a question, does your code auto-verify results? I'm assuming so, yes?
     
  4. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    Global memory is done in 128 bit, but I'm not sure if they optimize float values. Of course, if you aren't memory bound then it won't matter. As far as the ALU ops, the compiler does do a fairly decent job of packing instructions.

    I'm interested to see the vectorized results also.
     
  5. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    Also, does anyone know how many threads Ryan Smith used in his "benchmark"?
     
  6. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    It seems he is using 256k threads? Is that correct?
     
  7. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    Global memory access is not important for this program, because most of the work does not use global memory at all. Basically, for each work item, there is only 4 reads and 2 writes.

    Right now I'm focusing my vectorization experiments on the "one queen at the corner" case, which is handled by a separate, simpler function. The ALU packing ratio of this function (non-vectorized) is only 34%, which is not very good. That's why I think it's possible that a vectorized version may be able to run twice as fast.

    However, my current vectorized version, although have a ~84% ALU packing ratio, runs much slower because for some reason it has many global memory access (in the order of tens of thousands per work item). I suspect that this could be due to increased register pressure from vectorization. However, the number of work items per work group is already reduced 4 times, so it shouldn't need many more registers. And reducing the number of work items per work group further does not help either. I think if this problem can be solved, it's possible to see a good speed up from vectorization.
     
  8. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    I'm not sure because I didn't see any mention of this in the article. However, I don't think many more threads will help. By default, the number of thread is the number of cores * 256 * 2. So in the case of Radeon 5850 it's 9216 threads, and in the case of 5870 it'd be 10240 threads.
     
  9. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    Can you post the vectorized code? I'd be interested to take a look.

    Yes, if you use local memory this is correct; however, the 4870 doesn't have this and currently uses local memory as global memory.

    Also, I'm curious as to why your packing ratio increased so much, this would imply that the compiler is not good at packing, which counters what I've seen from the compiler.

    Also, have you applied any control flow reduction techniques? I applied one to one if-else statement in your queen1 kernel and on cypress got a 13% improvement on 16k threads with a 16x16 board.
     
  10. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    Ok, just wondering, I get different execution times for different number of threads, say 16k vs. 32k for 17x17 board (what he used).
     
  11. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    I now think it's quite possible the register pressure is responsible for the extra global memory access, because I changed all int4/uint4 into int2/uint2 and now all global memory access are gone.

    Now the non-vectorized nqueen1 function takes 0.187 second to run (the time is given by profiler rather than wall clock). Vectorized nqueen1 takes 0.137 second to run. Of course, ALU packing ratio is a little lower because of using 2D vectors instead of 4D vectors, but it's still at about 81%. Also, there is no branch instructions in the main loop in the vectorized version anymore.

    An interesting thing is, for nqueen1 alone, now GTX 285, which takes 0.158s to run, now runs slower than the vectorized 5850. Of course, it's still unknown whether the main part can be optimized for Cypress as good as this :)

    The vectorized code looks like this:

    Code:
    __kernel void nqueen1_vec(int board_size, int level, int threads, __global uint2* params, __global uint2* results, __constant uint2* forbidden)
    {
    	int idx = get_global_id(0);
    	int tid = get_local_id(0);
    
    	uint2 ms;
    	__local uint nsx[12][WORK_ITEMS];
    	__local uint nsy[12][WORK_ITEMS];
    
    	uint2 mask = params[idx];
    	uint2 left_mask = params[idx + threads];
    	uint2 right_mask = params[idx + threads * 2];
    	int2 second_row = convert_int2(params[idx + threads * 3]);
    	uint2 board_mask = (uint2) ((1 << board_size) - 1);
    	uint2 left_mask_big = (uint2) 0;
    	uint2 right_mask_big = (uint2) 0;
    	uint2 solutions = (uint2) 0;
    	int2 i = (int2) 0;
    	uint2 nsi, nsi_mask;
    		
    	ms = mask | left_mask | right_mask | (convert_uint2(i < second_row) & (uint2)2);
    	nsi = ((ms + (uint2) 1) & ~ms);
    	nsx[0][tid] = nsi.x;
    	nsy[0][tid] = nsi.y;
    
    	while(any(i >= (int2) 0)) {
    		nsi.x = nsx[max(i.x, 0)][tid];
    		nsi.y = nsy[max(i.y, 0)][tid];
    		nsi_mask = convert_uint2((nsi & board_mask) != (uint2) 0) & convert_uint2(i >= (int2) 0);
    
    		{
    			// for nsi_mask == true...
    			mask |= (nsi & nsi_mask);
    			left_mask_big = select(left_mask_big, (left_mask_big << (uint2) 1) | (left_mask >> (uint2) 31), nsi_mask);
    			left_mask = select(left_mask, (left_mask | nsi) << (uint2) 1, nsi_mask);
    			right_mask_big = select(right_mask_big, (right_mask_big >> (uint2) 1) | (right_mask << (uint2) 31), nsi_mask);
    			right_mask = select(right_mask, ((right_mask | nsi) >> (uint2) 1), nsi_mask);
    			ms = mask | left_mask | right_mask | (convert_uint2((i + 1) < second_row) & (uint2)2);
    			nsi = select(nsi, ((ms + (uint2) 1) & ~ms), nsi_mask);
    			i = select(i, i + 1, convert_int2(nsi_mask));
    			nsx[max(i.x, 0)][tid] = nsi.x;
    			nsy[max(i.y, 0)][tid] = nsi.y;
    		}
    			
    		{
    			// for nsi_mask == false
    			solutions -= (convert_uint2(i == (int2) level) & ~nsi_mask);
    			i = select(i - 1, i, convert_int2(nsi_mask));
    
    			nsi.x = nsx[max(i.x, 0)][tid];
    			nsi.y = nsy[max(i.y, 0)][tid];
    			nsi_mask = ~nsi_mask & convert_uint2(i >= (int2) 0);
    		
    			// for i >= 0
    			mask = select(mask, mask & ~nsi, nsi_mask);
    			left_mask = select(left_mask, (((left_mask >> (uint2) 1) | (left_mask_big << (uint2) 31)) & ~nsi), nsi_mask);
    			left_mask_big = select(left_mask_big, (left_mask_big >> (uint2) 1), nsi_mask);
    			right_mask = select(right_mask, (((right_mask << (uint2) 1) | (right_mask_big >> (uint2) 31)) & ~nsi), nsi_mask);
    			right_mask_big = select(right_mask_big, (right_mask_big << (uint2) 1), nsi_mask);
    			ms = mask | left_mask | right_mask | nsi | (convert_uint2(i < second_row) & (uint2)2);
    			nsi = select(nsi, ((ms + nsi) & ~ms), nsi_mask);
    			
    			nsx[max(i.x, 0)][tid] = nsi.x;
    			nsy[max(i.y, 0)][tid] = nsi.y;
    		}
    	}
    
    	results[idx] = solutions * (uint2) 8;
    	results[idx + threads] = solutions;
    		
    }
    
     
  12. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    Do you mind posting the whole solution/project with the vectorized version and all? Thank you.
     
  13. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    Also, I'd be interested to see if your increase in ALUPackingRatio was really attributable to the vectorization OR to your elimination of control flow.

    All that control flow creates "blocks" in the clauses and the compiler doesn't "pack" across clauses, so when you reduce control flow like you did you get better packing.

    Do you have a non-vectorized version with no control flow? What is the packing ratio for that?
     
  14. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    How many threads?

    Running 16834 threads on a 16x16 board my profiled kernel time for queen1 (non-vectorized, non-control flow) is 26.6232.
     
  15. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    I tried using

    if(n) ...

    or

    a = n ? b : c;

    neither increases ALU packing ratio.

    I put the current experimental version in the attachment. This version now runs only nqueen1.
     

    Attached Files:

  16. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    I did some further experiments on the main nqueen function.

    I found that the rotation check part (check whether the solution is 2-way or 4-way symmetry or no symmetry at all) is a major performance bottleneck for Cypress. I compared two different versions, one with full rotation, another with no rotation at all, and assume all solutions are asymmetric (which is of course incorrect). The run time of the two versions are (for 17 queen):

    Normal: 2.384s
    Skip: 2.064s

    This means the rotation part takes around 0.3s on GTX 285, which is not significant. However, for Cypress, it's quite different:

    Normal: 7.311s
    Skip: 3.885s

    Almost half of the execution time is spent on the rotation check part. I'm not sure about the reason behind this, although honestly the rotation check part is not very well optimized. There are some global memory access which are not coalesced and the loops are not unrolled.

    If the normal computation part of the main function is vectorized as the nqueen1, the skipped version is also faster, takes only 2.364s to run. However, for some reason it hangs with the rotation check enabled (the rotation check is not vectorized, so it's done one by one). If only one rotation check is done (there should be two in the vectorized version) it doesn't hang, but it takes 4.055s to run. So if both checks are done, it can be estimated that the run time should be around 5.7s.

    So, apparently, the rotation check part is an important optimization target for running on Cypress. I'll think about it when I have more free time :)
     
  17. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    pcchen,

    One last question, what board size and thread count are you running? I just want to know for comparison.
     
  18. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    I use 17 queen, and in most case the default thread count.

    I tried to take advantage of the larger local memory of Cypress to avoid global memory access in the rotation check part. However, since currently Cypress' OpenCL still doesn't support byte addressable memory in local memory, I have to "simulate" it with logic shift and bitwise operations. Right now the result looks good: it doesn't hang anymore, and it's also significantly faster. Another side effect is, in the older version, using more threads don't seem to help much, but now it improves performance up to 73728 threads (the old default is 9216 threads).

    The new version running for 17 queen takes 4.984s on my 5850 compared to previous 7.311s. I think there are still some optimizations can be done in the rotation check part for Cypress.
     
  19. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    Again, if you wouldn't mind posting the code. :)

    Thank you.

    Also, I seem to get an infinite loop when attempting to profile your vec version when I use int4/uint4 instead of "2", but the without profiling the code runs.
     
  20. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    I also encountered some driver hang up too. Normally it happens when there are some problems in the kernel (such as writing to some memory address it shouldn't write). But there are also some random hang up such as the situation I described earlier (the rotation check part causes driver hang). Fortunately, it's generally able to come back after a few seconds (at least on my Windows 7... :) ).

    I did some modifications to the code. Now it automatically enables vectorized version when it detects a "vector GPU." Right now the vectorized version always use local memory, and don't use global atomics (in current Stream SDK it crashes the compiler anyway).

    The new source code is in the attachment.
     

    Attached Files:

Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...