N-Queen Solver for OpenCL

Discussion in 'GPGPU Technology & Programming' started by pcchen, Jan 10, 2010.

  1. pcchen

    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,650
    Location:
    Taiwan
    I rewrote my old n-Queen solver from CUDA into OpenCL. It's not a port, because I rewrite everything from scratch (only the idea remains the same). I was hoping to use OpenCL for comparison between AMD and NVIDIA hardwares.

    Unfortunately, for some reason, AMD's OpenCL compiler crashed when compiling my kernel for GPU (it's ok for CPU version though). So right now it doesn't work on AMD's GPU at all, but with AMD Stream SDK 2.0 it's possible to run on CPU devices.

    Both source and executable are provided in the attachment.

    The arguments are:

    nqueen_cl -cpu -clcpu -p -thread # -platform # -local N

    -cpu: use CPU implementation (single threaded)
    -clcpu: use OpenCL CPU devices
    -p: enable profiling
    -thread #: set the number of threads. (default: max work group item number * number of devices)
    -platform #: select a platform (default: select platform 0)
    -local: use local memory (shared memory) for arrays. This runs faster on NVIDIA's GPU because NVIDIA GPU have no indexed registers.
    -noatomics: do not use global atomics (this is needed for now to avoid crash on AMD Stream SDK 2.0.1 when running scalar code)
    -novec: do not use vectorization
    N: the board size (1 ~ 32)

    Note: large board size (> 20) takes forever to run (I estimate that 20 queen would take more than 2 hours to run on 9800GT).

    The N-queen algorithm is a straightforward one. There is no special considerations to avoid redundant boards (i.e. most boards can be mirrored and rotated to create 8 solutions). Only a simple mirror reduction is used.

    Running on my computer (Core 2 Duo E8400 3.0GHz):

    16 queen -cpu: 7.27 s
    16 queen -clcpu: 3.79 s

    On GPU (GeForce 9800GT):

    16 queen: 2.36 s
    16 queen -local: 1.23 s


    About the crash problem: the original kernel I've developed (i.e. the kernel used by clcpu path) doesn't crash the compiler. However, it's extremely slow (> 20 s for 16 queen on my 4850, and > 7s on 9800GT) because it uses a four arrays to simulate a stack for recursion. These arrays are good for CPU version because they reduce the amount of computation, but for GPU they are too hard on registers. So I developed another kernel which uses only one array, but it requires more computation to generate/restore data for each steps. However, this new kernel crashes the compiler (it works when selecting CPU devices, but it's slower). I'm using Cat 9.12 hotfix right now.

    [EDIT] I updated the CPU program with reduced symmetry which runs faster.
    [EDIT2] OpenCL kernels are also with reduced symmetry now.
    [EDIT3] Vectorized version for RV870 is added. (4/10/2010)
     

    Attached Files:

    #1 pcchen, Jan 10, 2010
    Last edited by a moderator: Jan 11, 2010
  2. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,355
    Location:
    Varna, Bulgaria
    Intel Q9450 @ 3608MHz:
    Code:
    nqueen_cl.exe -clcpu 16
    N-Queen solver for OpenCL
    Ping-Che Chen
    Platform [0]: ATI Stream
    Select platform 0
    Using CPU device
    Using 4096 threads
    16-queen has 14772512 solutions
    Time used: 1.64s
    
    Running on GPU device target, the program gives me an error in aticaldd.dll module.
     
  3. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    13,094
    why does the cpu version use only one thread ?
     
  4. mhouston

    mhouston A little of this and that
    Regular

    Joined:
    Oct 7, 2005
    Messages:
    344
    Location:
    Cupertino
    We'll take a look at the compiler issues.
     
  5. pcchen

    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,650
    Location:
    Taiwan
    I had a multi-threaded version in my old CUDA code, but that's not portable (it uses Win32 API for managing threads). Also, I think with CPU support OpenCL can reduce or remove the requirement of writing own multi-threaded codes for CPU :)

    Thanks, I'm eager to know how this will run on my 4850 :)
     
  6. trinibwoy

    trinibwoy Meh
    Legend Alpha

    Joined:
    Mar 17, 2004
    Messages:
    10,317
    Location:
    New York
    Cool, did you compare the CUDA and OpenCL versions or has the algorithm changed so much that it's not apples to apples?
     
  7. pcchen

    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,650
    Location:
    Taiwan
    The CUDA version initially use 4 arrays so it limited the number of threads to 96 per SM (because the amount of shared memory is limited). I had written a newer version which IIRC reduced the number of arrays to 2 so it can use more threads. I don't have the newer version right now, but compared to the older 4 arrays version, OpenCL version is a little faster (1.24s vs 1.328s on a 8800GT).

    Note that the major benefit of using more threads is hiding latency. Since this program does not access global memory frequently, it doesn't have to hide a massive amount of latency. Basically it only has to have enough threads to hide ALU latency, which IIRC on NVIDIA's GPU is 192 threads per work group. By using less arrays to enable using more threads, additional computation is required, so if the benefit of hiding ALU latency does not out weight the additional computations, it's not going to be faster.

    The performance of this program is probably limited primarily by warp serialization due to branch divergence. This is an example of "not-so-suitable-to-GPU" type parallelizable workload :)
     
  8. pcchen

    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,650
    Location:
    Taiwan
    I checked my old CUDA version. It uses only one array for stack, but the stored data is different (although there shouldn't be too much difference in computation). However, the old CUDA version also stores results in shared memory and performs reduction when the kernel completed, so it can only have 128 threads rather than 256 threads, therefore it's still slower. I guess it's a bad decision to do reduction in the GPU because CPU is basically sit idling when the GPU is doing all the works.

    On my GTX 285, the CUDA version takes 0.686 second to run (16 queen), while the OpenCL version takes 0.643 second. Then I found that in my old CUDA version, the array indexing is reversed to avoid bank conflict, so I did the same, and it becomes a bit faster to 0.537 second.

    The OpenCL CPU version takes 1.76 second to run on my Core i7 920. The single threaded CPU version takes 7.58 seconds. The number of threads for the OpenCL CPU version is a bit too much and may reduce performance, so I reduced the amount of threads for the CPU version, but it's not really much faster (from 1.76 second to 1.68 second).

    I updated the source code and executables in the attachments.
     
  9. pcchen

    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,650
    Location:
    Taiwan
    I made a CPU version which tries to avoid redundant patterns. The original version only avoid horizontal mirroring, i.e.

    Code:
    - - - - x     x - - - -
    - x - - -     - - - x - 
    - - - x -     - x - - -
    x - - - -     - - - - x
    - - x - -     - - x - -
    
    are the same. It's easy to do so because there is no possibility of generating the same answer after mirroring.

    However, there are actually 8 possible transformations, i.e. 4 rotations and mirroring of each rotations. So if it's possible to generate only "unique" solutions, then it's possible to increase speed by a factor of 4. Unfortunately, it's not that simple, because it's possible for some solutions to be symmetric. For example:

    Code:
    - - x -
    x - - -
    - - - x
    - x - -
    
    is symmetric, which only produces 2 non-unique solutions, rather than 8. That's because if you rotate this solution by 90 degree it's actually the same. There are also symmetries which is 180 degree:

    Code:
    - - - - - x -
    - - x - - - -
    x - - - - - -
    - - - x - - -
    - - - - - - x
    - - - - x - -
    - x - - - - -
    
    These solutions only produce 4 non-unique solutions. In order to avoid generating redundant solutions, there are two different cases. The first case is when there's a piece at the corner:

    Code:
    - - - - - - x
    - - - - - - - 
    - - - - - - - 
    - - - - - - -
    - - - - - - - 
    - - - - - - - 
    - - - - - - -
    
    In this case, there is no possibility of any kind of symmetry, because any rotation moves the corner piece to a different corner. There is also no possibility of mirroring symmetry, because if you mirror along the diagonal axis, every piece will be moved to an empty place (otherwise the two pieces can capture each other). However, there is a risk of generating the mirrored solutions twice:

    Code:
    - - - - - - x     - - - - - - x
    - - - - x - -     - - x - - - -
    - - x - - - -     - - - - - x -
    x - - - - - -     - x - - - - -
    - - - - - x -     - - - - x - -
    - - - x - - -     x - - - - - -
    - x - - - - -     - - - x - - -
    
    This can be avoided by using the "kill the second column" trick. That is, when the position of the piece on the second row is decided, all places on the second column higher than that position are marked as used:

    Code:
    - - - - - - x
    - - x - - - -
    - - - - - # -
    - - - - - # -
    - - - - - # -
    - - - - - - -
    - - - - - - -
    
    Other cases are more complicated though. To avoid generating redundant rotation and mirroring, when the position of the piece on the first row is decided, these positions are marked as used:

    Code:
    # # - - x # #
    # - - - - - #
    - - - - - - -
    - - - - - - -
    - - - - - - -
    # - - - - - #
    # # - - - # #
    
    However, even with these, it's still possible to generate redundant solutions. For example:

    Code:
    - - - - - x - -     - - - - - x - -
    - - x - - - - -     - x - - - - - -
    x - - - - - - -     - - - - - - x -
    - - - - - - x -     x - - - - - - -
    - - - - x - - -     - - - x - - - -
    - - - - - - - x     - - - - - - - x
    - x - - - - - -     - - - - x - - -
    - - - x - - - -     - - x - - - - -
    
    These two solutions are "the same" but they will be generated separately. To know about this and avoid counting twice, every solution has to be rotated and pick only the "smallest" solution. This way, for every solutions only the smallest one will be counted. This is also used for counting symmetry: if a solution is the same when rotated 90 degrees, the symmetry is 2. If it's the same only when rotated 180 degrees, the symmetry is 4. Otherwise, it's 8.

    By doing these "optimizations" the CPU implementation runs a bit faster. Computing 16-queen on my Core 2 3.0GHz now takes 4.64 seconds (instead of previous 7.27 seconds, that's 57% faster), and it computes both total number of solutions and unique solutions (16-queen has 1846955 unique solutions).
     
  10. Simon F

    Simon F Tea maker
    Moderator Veteran Subscriber

    Joined:
    Feb 8, 2002
    Messages:
    4,535
    Location:
    In the Island of Sodor, where the steam trains lie
    Nice bit of analysis.
     
  11. prunedtree

    Newcomer

    Joined:
    Aug 8, 2009
    Messages:
    27
    For fun, I tried to see what kind of `performance' vs `programming time' ratio one could typically achieve on this problem !

    Reading various info on internet
    == time: 1h / performance: no code yet

    First attempt, single-threaded, (inspired from http://www.ic-net.or.jp/home/takaken/e/queen/index.html)
    == time: 1h 10 minutes / performance: 16 queens in 16 sec

    Removing basic symmetry
    == time: 1h 15 minutes / performance: 16 queens in 8 sec

    Multi-threading (very naive, just one thread for each of the first branches)
    == time: 1h 30 minutes / performance: 16 queens in 2 sec

    Playing around, found false sharing
    == time: 1h 45 minutes / performance: 16 queens in 1.75 sec

    Some ricing, one level unroll (28% faster) and tail function (7% faster)
    == time: 2h / performance: 16 queens in 1.27 sec

    The CPU used is a i7 920 (2.66 Ghz)

    It's interesting that the multi-threading speedup is superlinear (4.9x) thanks to SMT (hyperthreading).

    No attempt was made at more complex symmetry reduction, although if I understand right it should come nearly free in terms of the backtracking throughput. No attempt was made at turning the recursion into an iteration either (my 1-level unroll is a just copy/paste ^^)

    This naive version of 16 queens has a 570595151 nodes search tree, which it processes at a rate at 450 million nodes per second.
     
  12. pcchen

    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,650
    Location:
    Taiwan
    Ah, it's similar to the CPU version of my old CUDA code. The old CPU version is also only mirrored and multi-threaded using Win32 API.

    I becomes interested in reducing more symmetry after thinking a lot about this. The ideas in my previous post is actually pretty similar to the page you linked. I think it's possible to make GPU run faster with reduced symmetry, but it seems that the more complex symmetry reduction may bring even less performance gain on GPU than on CPU, because GPU really don't like branchy codes :)

    The reason why it has superlinear sppedup with hyperthreading is probably also related to the branchy nature of this problem, as hyperthreading may help hiding some latency better.

    However, when I did the old CUDA version I was pretty impressed by the ability of GPU on running this kind of codes. I was assuming that GPU should be pretty bad at this kind of problems and will lag behind a similar level CPU, but I was wrong.

    Unfortunately, my P6T motherboard is dead, and it may take some time to fix it. So now I can't update programs on my GTX 285 for a while. I can use my home computer only on weekend.
     
  13. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Location:
    In the know
    nqueen -cpu 6.64s
    nqueen -clcpu 3.85s

    gpu crashes even with Cat 10.1 beta
     
  14. Lightman

    Veteran

    Joined:
    Jun 9, 2008
    Messages:
    1,599
    Location:
    Torquay, UK
    My turn:
    -cpu 16 = 3.75s
    -clcpu 16 = 1.51s

    crash using GPU :cry:

    Phenom II 940@3455MHz
     
  15. mhouston

    mhouston A little of this and that
    Regular

    Joined:
    Oct 7, 2005
    Messages:
    344
    Location:
    Cupertino
    Yep, we have confirmed that this code causes LLVM to generate irreducible control flow which makes our compiler unhappy. We have an internal bugtrack for this already.
     
  16. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,355
    Location:
    Varna, Bulgaria
    Q9450 @ 3608MHz

    -cpu 16: 3.88s
    -clcpu 16: 1.81s

    I can't believe PII is faster here... :runaway:
     
  17. Lightman

    Veteran

    Joined:
    Jun 9, 2008
    Messages:
    1,599
    Location:
    Torquay, UK
    Would 3 complex decoders help here?
    Or larger L1 cache?

    I was pleasantly surprised as well :lol:

    @mhouston:
    Good to know this will be sorted!
     
  18. entity279

    Regular

    Joined:
    May 12, 2008
    Messages:
    536
    Location:
    Romania
    Also had a classifier implemented on CPUs and it ran on my X2 3800+ (@2400) as fast as it did on a Q6600 stock (single threaded ofcorse, with main CPU-time-user loop unrolled). It mostly multiplied floats and every now and then it extracted square roots.

    LE: and the number of unrolls was tested on both CPUs so it didn't favor a specific architecture.
     
    #18 entity279, Jan 18, 2010
    Last edited by a moderator: Jan 18, 2010
  19. pcchen

    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,650
    Location:
    Taiwan
    I added the symmetry reduction algorithms to the OpenCL version, so it computes unique solutions too. The CPU version is also optimized a little bit.

    On my Core i7 920 + GeForce GTX 285:

    -cpu 16: 4.3s
    -clcpu 16: 1.38s
    16 (gpu): 1.09s
    -local 16 (gpu): 0.492s (0.542s in old version)

    -cpu 17: 29.6s
    -clcpu 17: 8.96s
    17 (gpu): 4.05s
    -local 17 (gpu): 2.51s (4.36s in old version)

    -cpu 18: 216s
    -clcpu 18: 60.1s
    18 (gpu): 32.8s
    -local 18 (gpu): 24.5s (42.3s in old version)

    Unfortunately, this program still can't run on ATI's GPU.
    Another interesting thing is, it seems that the GPU versions scale worse than CPU versions. So in theory for some very big n CPU could be faster than GPU.
     
  20. pcchen

    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,650
    Location:
    Taiwan
    I did some rather trivial optimizations for GPU to make it a little faster, mainly the symmetry check part.

    Running on GeForce GTX 285:

    16 (gpu): 1.09s
    -local 16 (gpu): 0.5s

    17 (gpu): 4.01s
    -local 17 (gpu): 2.55s

    18 (gpu): 28.3s
    -local 18 (gpu): 19.2s
     

Share This Page

  • About Beyond3D

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...