N-Queen Solver for OpenCL

Discussion in 'GPGPU Technology & Programming' started by pcchen, Jan 10, 2010.

  1. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,494
    Likes Received:
    405
    Location:
    Varna, Bulgaria
    HD5870 @ 900/5000MHz

    GPU:
    Code:
    nqueen_cl 17
    
    Platform [0]: ATI Stream
    Select platform 0
    Using GPU device
    Using 81920 threads
    Using vectorization
    17-queen has 95815104 solutions (11977939 unique)
    Time used: 3.82s
    
    CPU:
    Code:
    Cnqueen_cl -clcpu 17
    
    Platform [0]: ATI Stream
    Select platform 0
    Using CPU device
    Using 8192 threads
    17-queen has 95815104 solutions (11977939 unique)
    Time used: 7.96s
    
    The -novec switch causes program to stop executing.
     
  2. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,436
    Likes Received:
    443
    Location:
    New York
    Looks like I'm getting an incorrect result from the latest build.

    Original:
    Code:
    Platform [0]: NVIDIA CUDA
    Select platform 0
    Using GPU device
    Using 15360 threads
    Using global atomics
    17-queen has 95815104 solutions (11977939 unique)
    Time used: 3.5s
    Latest:
    Code:
    Platform [0]: NVIDIA CUDA
    Select platform 0
    Using GPU device
    Using 15360 threads
    Using global atomics
    17-queen has 10919088 solutions (1364886 unique)
    Time used: 0.32s
    
     
  3. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,773
    Likes Received:
    153
    Location:
    Taiwan
    Looks like you got the nqueen1 only version (which runs only the "one queen in the corner" case). I have updated the files in the first post and I think that should be the correct one.
     
  4. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,773
    Likes Received:
    153
    Location:
    Taiwan
    I think you'll need -noatomics for -novec on RV870 because there is a compiler related problem in current AMD Stream SDK.
     
  5. Bjorn Häske

    Newcomer

    Joined:
    Mar 30, 2010
    Messages:
    5
    Likes Received:
    0
    New SDK was released.
    With it, it's possible to activate bitalign on ATI/AMD GPUs using OpenCL.
    Oh, and some multi-gpu issues fixed.
     
  6. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    Performance for this nqueen sovler also improves with latest SDK update.. now the vec4 version is the fastest. The packing on nqueen1 is ~94%.
     
  7. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,773
    Likes Received:
    153
    Location:
    Taiwan
    I think it's now possible to have a vec4 nqueen to run with the new SDK (in the older SDK it frequently hang the system). Also, the "semi-byte addressing" hack can be retired now :)
     
  8. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    Yes, as I said above, the vec4 version runs great now and is the fastest version.
     
  9. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,773
    Likes Received:
    153
    Location:
    Taiwan
    Yes, the nqueen1_vec function is now faster in vec4. But the nqueen_vec function still behaves wildly. I didn't write a vec4 version of nqueen_vec, but now I modified one from the vec2 version, but it still hangs (although recoverable), probably from a very long running time when solving 17 queen. I tested it with 8 queen to 16 queen and it gives correct answers. However, it runs very slowly with 16 queen (around 10 seconds).

    Since the vec4 version of nqueen_vec has a very big symmetry check part (basically 4 times) it could be the reason why it's slow.
     
  10. Arnold Beckenbauer

    Veteran

    Joined:
    Oct 11, 2006
    Messages:
    1,415
    Likes Received:
    348
    Location:
    Germany
    n00b's question: What is the VRAM usage of nqueen_cl?
     
  11. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,773
    Likes Received:
    153
    Location:
    Taiwan
    It's very small. Basically each thread reads and writes its own data, which is around the board size * 32 bits number, and write to 2 32 bits number. So for around 70000 threads, it's at most a few MB.
     
  12. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    Do you mean it's still slow in comparison to the full version? the full vec2 version? Have you looked at the ISA? Can you put up the full vec4 version (nqueen and nqueen1)?
     
  13. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,773
    Likes Received:
    153
    Location:
    Taiwan
    Yes, it's slower than the original vec2 version (which takes only 0.84 second).
    I updated the codes with my vec4 version here:

    Source
    Executable only

    I included 64 bits executables because 32 bits executables don't work on my computer when NVIDIA's video card is enabled.

    The modifications for byte addressable extensions are not in there yet.
     
  14. aak

    aak
    Newcomer

    Joined:
    Jan 14, 2010
    Messages:
    1
    Likes Received:
    0
    The reason for vec4 being slow is that you will have to use
    __kernel __attribute__((reqd_work_group_size(64, 1,1)))
    Without it the compiler will assume that you need your work group size to be 256
    (I remember reading that on an ATI forum that per opencl specs this is the default work group size the compiler will assume).
    If the compiler assumes that you need 256 threads he will start spilling registers to the memory because your kernel
    uses a lot of registers
    (you can see too many RD_SCRATCH and MEM_SCRATCH_WRITE in the ISA generated), so there are a lot of
    global memory read/write uncached, also the code size is big more than 60k . When I use
    __kernel __attribute__((reqd_work_group_size(64, 1,1)))
    in your kernel my stock 5850 took 8.95 Sec in 17 queens using 95 registers (without this option it took more than 10 sec in the 16 queens).
    E:\nqueen_cl>nqueen_cl64.exe -vec4 -local -p 17
    N-Queen solver for OpenCL
    Ping-Che Chen

    Platform [0]: ATI Stream
    Select platform 0
    Using GPU device
    Using 73728 threads
    Using vectorization
    Use 4D vectors
    Profile time: 8778111182ns
    17-queen has 95815104 solutions (11977939 unique)
    Time used: 8.95s
     
  15. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,773
    Likes Received:
    153
    Location:
    Taiwan
    Ah that's explained it, thanks. However it's still slower than the vec2 version, as vec2 version takes only about 5 seconds to run 17-queen. But this is still an important tip. :)
     
  16. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,773
    Likes Received:
    153
    Location:
    Taiwan
    Recently I decided to test a new way to make this program more balanced. In the original algorithm, every thread (or work item) may have very different running time. For example, one thread may end pretty early because that branch may be quite hopeless, while another thread may produce a lot of solutions. However, because within a warp each thread have to wait for one another, a lot of threads may sit idle for a lot of time.

    The modification using global atomics does not actually solve this problem, because its structure is two loops. That means, if some thread finishes its inner loop, it still has to wait for other threads to do its outer loop. It looks like this:

    Code:
    while(condition for outer loop) {
      initialize for next batch
      while(condition for inner loop) {
        ...
      }
    }
    One way to solve this is to make it a single loop, that is:

    Code:
    for(;;) {
      if(condition for inner loop) {
        ...
      }
      else {
        if(condition for outer loop) {
          initialize for next batch
        }
        else {
          break;
        }
      }
    }
    This way, for any thread with inner loop finished, it can continue to do another batch of inner loops without having to wait for other threads in the same warp.

    A breif test on my GeForce GTX 285 shows around 10% performance improvements. Along with other optimizations (mainly adjustments on the number of threads...), now 17 queen takes around 1.88s to run on GTX 285.

    I decided to test it on a GeForce GTX 460 (which actually performs quite well before this modification), and it does not run very well. Actually, it does not run at all. However, it runs without global atomics, but slower. After some checks here and there, I finally found that the problem is with the board_array variable, which is originally an int array:

    Code:
    #define BOARD_ARRAY_DECL int board_array[32];
    At first, I changed it to using shared memory, and it worked! (I tried it in earlier version to improve performance but GTX 285's shared memory is not big enough) Then I decided to change back to using normal private memory but with char:

    Code:
    #define BOARD_ARRAY_DECL char board_array[32];
    and it still works. Weird. Looks like a bug in NVIDIA's OpenCL implementation.

    Anyway, now it performs well, but not that well. It's only a little faster than a GTX 285 now. I suspect that in earlier version (the two loop version) it's faster because it has better warp scheduling than GTX 285 (which may mean that the threads don't have to wait as long as in the case of a GTX 285, in the original version). After the modification, the advantage is gone and now GTX 285 is up to the same league of a GTX 460.

    Also, I modified the program to choose the number of threads by the number of "cores", and multiply it by the number of threads per "core." For example, a GTX 285 has 30 "cores." So the default number of threads is 30*256 = 7680. It works rather well (changing the number does not make performance better, generally makes it worse). However, for Fermi it's bad, since Fermi has very small number of "cores." On the GeForce GTX 460, the default number is 7*256 = 1792, but tests show that the best number is 5376, that is, 7*256*3. Using more threads per group does not improve performance either.

    Unfortunately, since I don't have access to a Radeon 5850 anymore, I can't test it with the new OpenCL SDK from AMD, so I didn't make any changes to the vectorized version.

    I also write a multi-threaded CPU version for Win32, so it now takes advantage of all cores of a CPU. My Core i7 920 takes around 6.8 seconds to solve 17 queen (GTX 285 is 1.88s and GTX 460 is 1.65s).

    Another possible improvement is to make it supports multiple GPUs on a computer. Then two GTX 460 (which seems to be quite popular now) should be twice as fast as one GTX 460. But that's a bit complex and I think I'll do that when I have more free time :)

    The new version can be downloaded here:
    http://www.kimicat.com/dang-an-jia/nqueen_cl_src.zip
     
  17. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    17 queen:

    It's 6.75s on an i7-860 with HT on.
    It wouldn't run on my 5850 (couldn't recognize Device 1 and it stayed on CPU) so I'm installing the 2.2 SDK.

    installed stream sdk 2.2, GPU result:
    Vec2: 9.932 seconds (18432 threads)
    16.9 secs for vec4 (18432 threads)
    global atomics: 0.651s (4608 threads)
    using clcpu takes 31s (16384 threads)

    I was playing with the block size, shortest time I got was 0.627 with 256 blocks (errors above that)

    edit: old results

     
    #117 neliz, Aug 16, 2010
    Last edited by a moderator: Aug 16, 2010
  18. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,773
    Likes Received:
    153
    Location:
    Taiwan
    Hmm... so it seems to be quite weird. I remembered that it was much quicker on 5850 with vectored kernels (around 5 seconds). I'll try it with the new SDK 2.2 to see what's the problem when I have some time.
    The result from global atomics doesn't seem to be right. 17 queen should have 95815104 distinct solutions and 11977939 unique solutions.
    clcpu also was also faster. I remembered getting around 7.5s on my Core i7 920.
     
  19. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    I posted my old results there, cpu seems to be unchanged, cl values are out of whack.
     
  20. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,494
    Likes Received:
    405
    Location:
    Varna, Bulgaria
    Code:
    >nqueen_cl -cpu 17
    
    17-queen has 95815104 solutions (11977939 unique)
    Time used: 4.85s
    Code:
    >nqueen_cl -novec 17
    
    Platform [0]: ATI Stream
    Select platform 0
    Using GPU device
    Using 5120 threads
    Block size = 256 threads
    Using global atomics
    17-queen has 10919088 solutions (1364886 unique)
    Time used: 0.281s
    Code:
    >nqueen_cl -vec2 17
    
    Unknown option -vec2
    Platform [0]: ATI Stream
    Select platform 0
    Using GPU device
    Using 20480 threads
    Block size = 256 threads
    Using vectorization
    Use 2D vectors
    17-queen has 95815104 solutions (11977939 unique)
    Time used: 3.96s
    i7-920 @ 3996 MHz
    HD5870 @ 900/1250 MHz
    Stream SDK 2.2, Cat 10.8 beta5
    Win7 x64
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...