N-Queen Solver for OpenCL

Discussion in 'GPGPU Technology & Programming' started by pcchen, Jan 10, 2010.

  1. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    Hmm.. it looks like novec is not correct on SDK 2.2. However, vec2 looks to be faster than SDK 2.1. That's pretty nice.
    My computer with the Radeon 5850 is still using Cat 10.6. Since I'm not using CrossFire, is it ok to use 10.7 (since SDK 2.2 requires 10.7)? I don't want to interfere with the normal usage (i.e. gaming) of that computer now...
     
  2. mhouston

    mhouston A little of this and that
    Regular

    Joined:
    Oct 7, 2005
    Messages:
    344
    Likes Received:
    38
    Location:
    Cupertino
    For SDK 2.2, you'll want 10.7b, the driver linked to on the SDK download page.
     
  3. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0

    I would be careful with using SDK 2.2/10.7 as there are all kinds of problems with it. I'm not sure why those problems didn't get fixed before release but....

    ...personally, I had to go back to using 2.1/10.5. It's possible that 10.8 beta may fix the issues but I don't know.
     
  4. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    I fixed a bug in the non-local non-vectorized version. It no longer hangs now (and should run correctly on AMD's GPU too). Actually since local version is faster on both GPU, I think it's probably better to make it the default kernel (and add a -nolocal switch to turn it off for experiments).

    Unfortunately, now GTX 460 still has some weird bug. The "forbidden" array, which was declared __constant, does not work this way and has to be declared as __global. This only happens on GTX 460. GTX 285 and 8800GT are all fine by __constant.

    I also forgot to add pragma to enable cl_khr_byte_addressable_store, which is required for board_array to be char. The original kernel uses int, but apparently GTX 460 don't like that.

    The new version can be downloaded from the same place:
    http://www.kimicat.com/dang-an-jia/nqueen_cl_src.zip
     
  5. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    I added the new global atomics mechanism to the vectorized kernels. Apparently its much more effective on AMD GPU. Running 17 queen on Radeon 5850 is now faster @ 3.31s instead of previous ~5s.

    nqueen_cl 16: 0.902s
    nqueen_cl 17: 3.31s
    nqueen_cl 18: 23s

    These are with SDK 2.2 and 10.7 beta driver (for OpenCL).

    However, there is a very weird result for the non-vectorized kernel: it's now much faster on Radeon 5850:

    nqueen_cl -novec -local 16: 0.858s
    nqueen_cl -novec -local 17: 2.99s
    nqueen_cl -novec -local 18: 20.4s

    compared to noatomics version:

    nqueen_cl -novec -noatomics -local 16: 0.887s
    nqueen_cl -novec -noatomics -local 17: 5.59s
    nqueen_cl -novec -noatomics -local 18: 40s

    Very interesting. It seems to me that SDK 2.2 is much better at running non-vectorized code than SDK 2.1. Kudos to AMD for the great work :)

    Again, the latest version can be downloaded from the same location.
    http://www.kimicat.com/dang-an-jia/nqueen_cl_src.zip
     
  6. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,494
    Likes Received:
    405
    Location:
    Varna, Bulgaria
    That's weird!

    The new binary now takes much longer time to process the non-vector kernel:

    Code:
    >nqueen_cl -novec 17
    
    Platform [0]: ATI Stream
    Select platform 0
    Using GPU device
    Using 5120 threads
    Block size = 256 threads
    Using global atomics
    17-queen has 95815104 solutions (11977939 unique)
    Time used: 31.4s
    Compare this to my previous result:
    Code:
    >nqueen_cl -novec 17
    
    Platform [0]: ATI Stream
    Select platform 0
    Using GPU device
    Using 5120 threads
    Block size = 256 threads
    Using global atomics
    17-queen has 10919088 solutions (1364886 unique)
    Time used: 0.281s
    The difference is that the new binary now reports the correct number of solutions.
     
  7. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Are your timings for the non-vector kernel repeatable. The difference in time suggests that it might not be the case due to some freak driver/hw situation.
     
  8. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,494
    Likes Received:
    405
    Location:
    Varna, Bulgaria
    Yep, but there is a strange hang down of the system during the execution. The vector kernel is OK.
     
  9. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Something for the driver people to look at then.
     
  10. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    The old non-local non-vector version has a bug which may cause kernel hang, so it's not usable (you can also see the result is incorrect). This is now fixed that's why it's slower while producing correct result. :)
    I suggest always using local version (i.e. use -local argument). It's faster not only on NVIDIA's GPU but also on AMD's GPU. Actually it's now faster than vectorized version.

    I also done a few profiling and strangely the vectorized version has lower ALU busy rate than scalar version (about 30% vs 40%). The vectorized version has better ALU packing rate and lower ALU stall by LDS rate, but it has more instructions.
     
  11. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    I made a new version which is default to use local memory. This affects only scalar kernels. To disable using local memory for arrays, use -nolocal option.
    This version also contains a small optimization which makes it a little bit faster.

    Source code & Executable
    Executable only
     
  12. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,483
    Likes Received:
    505
    Location:
    New York
    Code:
    >nqueen_cl 17
     
    Platform [0]: NVIDIA CUDA
    Select platform 0
    Using GPU device
    Using 5376 threads
    Block size = 256 threads
    Using global atomics
    17-queen has 95815104 solutions (11977939 unique)
    Time used: 2.78
    GTX 460 405/1800 (3D clocks aren't getting triggered).
     
  13. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    I made a new version which supports multiple devices. This is sort of preliminary because I haven't done much testing yet. Although it's probably best to support multiple devices using multiple threads, I decided to use single thread instead because it's easier. However, the downside is that OpenCL 1.0 lacks some important functions for supporting multiple devices from a single host thread.

    Anyway, this is the test results on two GeForce GTX 460 (clocked @ 800MHz, which is nearly 20% faster than normal 460):

    Two devices:
    Code:
    N-Queen solver for OpenCL
    Ping-Che Chen
    
    Platform [0]: NVIDIA CUDA
    Select platform 0
    Using GPU device
    Device 0: GeForce GTX 460
            Using 5376 threads
            Block size = 256 threads
            Using global atomics
    Device 1: GeForce GTX 460
            Using 5376 threads
            Block size = 256 threads
            Using global atomics
    Profile time for device 0: 5417762176ns
    Profile time for device 1: 4899273472ns
    18-queen has 666090624 solutions (83263591 unique)
    Time used: 7.1s
    One device:

    Code:
    N-Queen solver for OpenCL
    Ping-Che Chen
    
    Platform [0]: NVIDIA CUDA
    Select platform 0
    Using GPU device
    Device 0: GeForce GTX 460
            Using 5376 threads
            Block size = 256 threads
            Using global atomics
    Profile time for device 0: 10313844128ns
    18-queen has 666090624 solutions (83263591 unique)
    Time used: 11.3s
    As you can see, although the time spent on the device is roughly the same (5.4+4.9 = 10.3s), but the wall clock time is not that good. Basically, on single device run, the "extra CPU time" is around 11.3 - 10.3 = 1.0s. However, on two devices run, the "extra CPU time" is around 7.1 - 5.4 = 1.7s. It varies a little from run to run but it's almost always at least 50% more.
     
  14. olao

    Newcomer

    Joined:
    Sep 5, 2010
    Messages:
    9
    Likes Received:
    0
    Hi, apologies if I missed some post. It seems that you launch fewer threads in each CTA when you use the vectorized version. Hope I'm with you so far. If this is the case, may the slower result not simply be due to the hardware not having enough independent wavefronts to cover for arithmetic latency?

    I'm not too familiar with the details of the AMD GPU, but by and large it seems pretty similar to the NVIDIA GPUs. For NVIDIA this could certainly be a problem, IIRC for the 200 series you needed at least 2 blocks at 2 warps each to keep a GPU core busy...

    Anyhow, just my 14.5öre.
     
  15. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    Yes, you're correct. It's to make sure that the size requirement of the local memory to be the same for both scalar version and the vectored version. The 2D vectorized version has 128 work items per work group so it should be enough to hide ALU latency on AMD's GPU.

    On NVIDIA's GPU the number of work items per work group is 256 (which is unfortunately the maximum number currently supported by this program). However, since the local memory usage is low enough (less than 16KB) so on 200 series GPU it can run two work groups on a single MP (the occupancy is 0.5 in this case). So it should be enough to competely hide the ALU latency.

    I think the problem of the vectorized version could be related to the fact that it has to process two (or four, in the case of 4D version) items in sync. Compared to the scalar version, it's twice as much (or four times as much). Since the execution time required for each item can be very different, the efficiency can be much worse. Especially when considering that AMD's GPU need to run 64 work items in sync (compared to NVIDIA's 32 work items).

    I'm still planning to do a "work item imbalance" analysis to verify this theory though :)
     
  16. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    Some 6870 results using the latest version of nqueen:

    Lol @ Buzzard?

    Results for the previous version of nqueen:

    and
     
    #136 neliz, Oct 31, 2010
    Last edited by a moderator: Oct 31, 2010
  17. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    Hey, Buzzards are nice birds :)

    Anyway, your test results seem to be inline with 5850. Now scalar version is faster than vectorized version, although there are probably still some tricks can be used to improve vectorized version.
     
  18. moozoo

    Newcomer

    Joined:
    Jul 23, 2010
    Messages:
    109
    Likes Received:
    1
    Did you ever release this version?
    Where can I get the latest?
    Have you thought of putting the source on sourceforge?
     
  19. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    This should always point to the latest released version.
    Since this is just a simple experiments of some ideas, I didn't think about putting this on sourceforge or other places, although the source code is always available.

    [EDIT] By the way, I tested it with the new ATI Stream SDK 2.3 on my Radeon 5850.

    17 queen with 2d vec: 2.72s
    17 queen with scalar: 2.76s

    These are with global atomics. It's much slower without global atomics.
     
  20. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,809
    Likes Received:
    484
    Location:
    Torquay, UK
    HD6970 stock, SDK 2.3

    17 queen with 2d vec: 1.8s
    17 queen with scalar: hang :sad:
    17 queen with nolocal: 1.33s [Edit: bugged score]
     
    #140 Lightman, Dec 20, 2010
    Last edited by a moderator: Dec 20, 2010
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...