N-Queen Solver for OpenCL

Discussion in 'GPGPU Technology & Programming' started by pcchen, Jan 10, 2010.

  1. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    Just tested this app with a beta of the next Stream SDK release and it works. Performance could be improved if you can find a way to vectorize your algorithm :) The compiled code has lots of flow control with little work (300 clauses, 425 ALU instructions on 5870).
     
  2. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    Wow, that's great! I has been thinking about how to vectorize the code, although it's probably not easy but I think at least it's possible. :)
     
  3. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    I found a bug which may prevent it to run with -local on G8X/G9X GPUs. A workaround is to specify thread numbers to multiple of 192 (such as 5376). I'll post updates later.
     
  4. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    This works with the latest AMD Stream SDK 2.0.1 now. On my Radeon 4850 it takes 4.61 seconds to run 16 queen, and 3.79 seconds to run -local version. This is interesting because AFAIK Radeon 4850 does not support real local memory in OpenCL. However, it's possible that using local memory forces the compiler to use memory instead of registers so it reduces pressure on registers.
     
  5. mhouston

    mhouston A little of this and that
    Regular

    Joined:
    Oct 7, 2005
    Messages:
    344
    Likes Received:
    38
    Location:
    Cupertino
    On windows you can use the profiler to dump the generated ISA and you can look at the differences.
     
  6. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,969
    Likes Received:
    963
    Location:
    Torquay, UK
    On stock HD5870 it takes:
    2.27s for 16 queen
    0.85s for 16 queen -local

    Just for fun on HD5870@1130/1248:
    2.07s for 16 queen
    0.67s for 16 queen -local
    66.0s for 18 queen -local

    EDIT!
    The above was on old version!

    This is on the new one:
    On stock HD5870 it takes:
    3.40s for 16 queen
    0.77 for 16 queen -local

    Just for fun on HD5870@1130/1248:
    2.80s for 16 queen
    0.59s for 16 queen -local
    30.50s for 18 queen -local

    New version is a lot slower than old not using local memory and quite a bit faster using it? Why is that?
     
    #26 Lightman, Feb 13, 2010
    Last edited by a moderator: Feb 13, 2010
  7. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    The new version uses a more complex algorithm to remove more symmetry. I'm not sure why using local is faster, though. I used the Stream KernelAnalyzer to compare the two kernels, but the local version seems to be using more registers than normal version, which is pretty odd.

    I think the number of work items per work group is probably also an important factor. I plan to make a new option so it's possible to set the number of work items per work group (currently it's only possible to set the number of total work items).

    If a vectorized kernel is faster (I'm still thinking about it), I think it's possible that Radeon 5870/5850 to be faster than GTX 285.
     
  8. psychocoder

    Newcomer

    Joined:
    Feb 27, 2010
    Messages:
    1
    Likes Received:
    0
    Hi,

    how long your opencl code need to calculate full 19 queens and 20 queens problem??

    thank you
     
  9. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    I have been experimenting with some optimization ideas. One is vectorization, another one is to reduce imbalance. That is, since each work item may take different time to run (some work items may end pretty soon while others may take a long time to finish), it creates an imbalance where some work items have to wait for other work items to finish, which is not very efficient. Furthermore, this problem is worse as the number of work items grows larger. However, the number of work items still has to be large enough to fully utilize the GPU, so it's a problem.

    I made a simple modification to try reducing imbalance, by using a "global index" with an atomic add for each work item to get next work instead of just sit idle. Of course, this requires a GPU with support for atomic operations (i.e. cl_khr_global_int32_base_atomics).

    Here are some prelimeary results (with GeForce 8800GT) :

    New -local 17: 4.41s
    Old -local 17: 5.06s

    New -local 18: 32.8s
    Old -local 18: 38.8s

    New -local 19: 270s
    Old -local 19: 330s

    [EDIT] A test run with n = 20 takes around 2280 seconds.
     
  10. Lux_

    Newcomer

    Joined:
    Sep 22, 2005
    Messages:
    206
    Likes Received:
    1
  11. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    I took a look at this program and it looks similar to my old algorithm, i.e. only the mirror symmetry is removed. I didn't recompile the program but it takes around 10 seconds to solve 16 queen on a Core 2 1.83GHz. The "rotation symmetry" removed program (the single threaded cpu path) takes 5.78 seconds to solve 16 queen.
     
  12. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    Running on GTX 285:

    New -local 17: 2.41s
    Old -local 17: 2.44s

    New -local 18: 15.1s
    Old -local 18: 18.5s

    New -local 19: 123s
    Old -local 19: 155s

    New -local 20: ~ 1060s
     
  13. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    Yes. What is your packing now? I'm pretty sure you should be able to improve your packing and reduce control flow.
     
  14. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    I got some weird problems now...
    NVIDIA has released their new driver (196.75) which is supposed to support the latest ICD which is compatible with the latest ATI Stream SDK. However, after I installed the new driver, it can't find its own OpenCL device. After a few check I found out that nvcuda.dll lacks the necessary clIcdGetPlatformIDsKHR function (it has the other two functions).

    So now it's a weird thing, which is, after installing ATI Stream SDK 2.0.1, NVIDIA's OpenCL.dll (which is actually Khronos' opencl.dll) can find only ATI's GPU on my computer. I delayed installing ATI Stream SDK 2.0.1 on my computer just because I was afraid that I might not be able to use my GTX 285 for OpenCL, but now NVIDIA's own driver just did that... Anyway, I hope they fixed that soon.

    Then there is another problem. With previous SDK, the OpenCL compiler crashes with my n-queen kernel. 2.0.1 SDK fixed that. However, after I added the "imbalance reducing" logic (which is basically another loop around the original loop) it crashes again. Fortunately I coded a "force no atomics" option and I can verify that it runs without the new loop.

    The new program is posted here instead of under the first post because it currently crashes with ATI Stream SDK 2.0.1 without the -noatomics option. It does not contain any of my vectorization experiments yet.

    New options:

    -noatomics: do not use atomics

    Fixed:

    -threads #: works with -local now


    [EDIT] I reinstalled the driver and now OpenCL works again for NVIDIA's GPU. Now any OpenCL program can see both platforms working! That's great!
     

    Attached Files:

  15. mhouston

    mhouston A little of this and that
    Regular

    Joined:
    Oct 7, 2005
    Messages:
    344
    Likes Received:
    38
    Location:
    Cupertino
    Reproduced and internal bug filed.
     
  16. XMAN26

    Banned

    Joined:
    Feb 17, 2003
    Messages:
    702
    Likes Received:
    1

    Which version ran a little quicker?

    Cuda = .686s
    OCL = .643s
    ? = .537s
     
  17. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    The OpenCL version runs faster when it uses the same technique as the older CUDA version.
    However, with newer algorithms 16 queen takes longer to run (although 17 or larger queens take less time to run).
     
  18. Farid

    Farid Artist formely known as Vysez
    Veteran Subscriber

    Joined:
    Mar 22, 2004
    Messages:
    3,844
    Likes Received:
    108
    Location:
    Paris, France
  19. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,969
    Likes Received:
    963
    Location:
    Torquay, UK
    Yeah, I noticed that as well :smile:

    Congrats PCChen :cool2:
     
  20. Arnold Beckenbauer

    Veteran Subscriber

    Joined:
    Oct 11, 2006
    Messages:
    1,756
    Likes Received:
    722
    Location:
    Germany
    hd4850 (675/993)
     
    #40 Arnold Beckenbauer, Mar 27, 2010
    Last edited by a moderator: Mar 27, 2010
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...