Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 10-Jan-2010, 17:49   #1
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,485
Default N-Queen Solver for OpenCL

I rewrote my old n-Queen solver from CUDA into OpenCL. It's not a port, because I rewrite everything from scratch (only the idea remains the same). I was hoping to use OpenCL for comparison between AMD and NVIDIA hardwares.

Unfortunately, for some reason, AMD's OpenCL compiler crashed when compiling my kernel for GPU (it's ok for CPU version though). So right now it doesn't work on AMD's GPU at all, but with AMD Stream SDK 2.0 it's possible to run on CPU devices.

Both source and executable are provided in the attachment.

The arguments are:

nqueen_cl -cpu -clcpu -p -thread # -platform # -local N

-cpu: use CPU implementation (single threaded)
-clcpu: use OpenCL CPU devices
-p: enable profiling
-thread #: set the number of threads. (default: max work group item number * number of devices)
-platform #: select a platform (default: select platform 0)
-local: use local memory (shared memory) for arrays. This runs faster on NVIDIA's GPU because NVIDIA GPU have no indexed registers.
-noatomics: do not use global atomics (this is needed for now to avoid crash on AMD Stream SDK 2.0.1 when running scalar code)
-novec: do not use vectorization
N: the board size (1 ~ 32)

Note: large board size (> 20) takes forever to run (I estimate that 20 queen would take more than 2 hours to run on 9800GT).

The N-queen algorithm is a straightforward one. There is no special considerations to avoid redundant boards (i.e. most boards can be mirrored and rotated to create 8 solutions). Only a simple mirror reduction is used.

Running on my computer (Core 2 Duo E8400 3.0GHz):

16 queen -cpu: 7.27 s
16 queen -clcpu: 3.79 s

On GPU (GeForce 9800GT):

16 queen: 2.36 s
16 queen -local: 1.23 s


About the crash problem: the original kernel I've developed (i.e. the kernel used by clcpu path) doesn't crash the compiler. However, it's extremely slow (> 20 s for 16 queen on my 4850, and > 7s on 9800GT) because it uses a four arrays to simulate a stack for recursion. These arrays are good for CPU version because they reduce the amount of computation, but for GPU they are too hard on registers. So I developed another kernel which uses only one array, but it requires more computation to generate/restore data for each steps. However, this new kernel crashes the compiler (it works when selecting CPU devices, but it's slower). I'm using Cat 9.12 hotfix right now.

[EDIT] I updated the CPU program with reduced symmetry which runs faster.
[EDIT2] OpenCL kernels are also with reduced symmetry now.
[EDIT3] Vectorized version for RV870 is added. (4/10/2010)
Attached Files
File Type: zip nqueen_cl.zip (18.3 KB, 200 views)
File Type: zip nqueen_cl_src.zip (28.6 KB, 135 views)

Last edited by pcchen; 11-Jan-2010 at 15:35. Reason: Update programs
pcchen is offline   Reply With Quote
Old 10-Jan-2010, 18:48   #2
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 3,008
Send a message via Skype™ to fellix
Default

Intel Q9450 @ 3608MHz:
Code:
nqueen_cl.exe -clcpu 16
N-Queen solver for OpenCL
Ping-Che Chen
Platform [0]: ATI Stream
Select platform 0
Using CPU device
Using 4096 threads
16-queen has 14772512 solutions
Time used: 1.64s
Running on GPU device target, the program gives me an error in aticaldd.dll module.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 10-Jan-2010, 19:15   #3
Davros
Darlek ******
 
Join Date: Jun 2004
Posts: 11,041
Default

why does the cpu version use only one thread ?
__________________
Guardian of the Bodacious Three Terabytes of Gaming Goodness™
Davros is offline   Reply With Quote
Old 10-Jan-2010, 23:38   #4
mhouston
A little of this and that
 
Join Date: Oct 2005
Location: Cupertino
Posts: 342
Default

We'll take a look at the compiler issues.
mhouston is offline   Reply With Quote
Old 11-Jan-2010, 00:45   #5
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,485
Default

Quote:
Originally Posted by Davros View Post
why does the cpu version use only one thread ?
I had a multi-threaded version in my old CUDA code, but that's not portable (it uses Win32 API for managing threads). Also, I think with CPU support OpenCL can reduce or remove the requirement of writing own multi-threaded codes for CPU

Quote:
Originally Posted by mhouston
We'll take a look at the compiler issues.
Thanks, I'm eager to know how this will run on my 4850
pcchen is offline   Reply With Quote
Old 11-Jan-2010, 01:01   #6
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,947
Default

Cool, did you compare the CUDA and OpenCL versions or has the algorithm changed so much that it's not apples to apples?
__________________
What the deuce!?
trinibwoy is offline   Reply With Quote
Old 11-Jan-2010, 04:21   #7
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,485
Default

Quote:
Originally Posted by trinibwoy View Post
Cool, did you compare the CUDA and OpenCL versions or has the algorithm changed so much that it's not apples to apples?
The CUDA version initially use 4 arrays so it limited the number of threads to 96 per SM (because the amount of shared memory is limited). I had written a newer version which IIRC reduced the number of arrays to 2 so it can use more threads. I don't have the newer version right now, but compared to the older 4 arrays version, OpenCL version is a little faster (1.24s vs 1.328s on a 8800GT).

Note that the major benefit of using more threads is hiding latency. Since this program does not access global memory frequently, it doesn't have to hide a massive amount of latency. Basically it only has to have enough threads to hide ALU latency, which IIRC on NVIDIA's GPU is 192 threads per work group. By using less arrays to enable using more threads, additional computation is required, so if the benefit of hiding ALU latency does not out weight the additional computations, it's not going to be faster.

The performance of this program is probably limited primarily by warp serialization due to branch divergence. This is an example of "not-so-suitable-to-GPU" type parallelizable workload
pcchen is offline   Reply With Quote
Old 11-Jan-2010, 15:35   #8
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,485
Default

I checked my old CUDA version. It uses only one array for stack, but the stored data is different (although there shouldn't be too much difference in computation). However, the old CUDA version also stores results in shared memory and performs reduction when the kernel completed, so it can only have 128 threads rather than 256 threads, therefore it's still slower. I guess it's a bad decision to do reduction in the GPU because CPU is basically sit idling when the GPU is doing all the works.

On my GTX 285, the CUDA version takes 0.686 second to run (16 queen), while the OpenCL version takes 0.643 second. Then I found that in my old CUDA version, the array indexing is reversed to avoid bank conflict, so I did the same, and it becomes a bit faster to 0.537 second.

The OpenCL CPU version takes 1.76 second to run on my Core i7 920. The single threaded CPU version takes 7.58 seconds. The number of threads for the OpenCL CPU version is a bit too much and may reduce performance, so I reduced the amount of threads for the CPU version, but it's not really much faster (from 1.76 second to 1.68 second).

I updated the source code and executables in the attachments.
pcchen is offline   Reply With Quote
Old 15-Jan-2010, 19:46   #9
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,485
Default

I made a CPU version which tries to avoid redundant patterns. The original version only avoid horizontal mirroring, i.e.

Code:
- - - - x     x - - - -
- x - - -     - - - x - 
- - - x -     - x - - -
x - - - -     - - - - x
- - x - -     - - x - -
are the same. It's easy to do so because there is no possibility of generating the same answer after mirroring.

However, there are actually 8 possible transformations, i.e. 4 rotations and mirroring of each rotations. So if it's possible to generate only "unique" solutions, then it's possible to increase speed by a factor of 4. Unfortunately, it's not that simple, because it's possible for some solutions to be symmetric. For example:

Code:
- - x -
x - - -
- - - x
- x - -
is symmetric, which only produces 2 non-unique solutions, rather than 8. That's because if you rotate this solution by 90 degree it's actually the same. There are also symmetries which is 180 degree:

Code:
- - - - - x -
- - x - - - -
x - - - - - -
- - - x - - -
- - - - - - x
- - - - x - -
- x - - - - -
These solutions only produce 4 non-unique solutions. In order to avoid generating redundant solutions, there are two different cases. The first case is when there's a piece at the corner:

Code:
- - - - - - x
- - - - - - - 
- - - - - - - 
- - - - - - -
- - - - - - - 
- - - - - - - 
- - - - - - -
In this case, there is no possibility of any kind of symmetry, because any rotation moves the corner piece to a different corner. There is also no possibility of mirroring symmetry, because if you mirror along the diagonal axis, every piece will be moved to an empty place (otherwise the two pieces can capture each other). However, there is a risk of generating the mirrored solutions twice:

Code:
- - - - - - x     - - - - - - x
- - - - x - -     - - x - - - -
- - x - - - -     - - - - - x -
x - - - - - -     - x - - - - -
- - - - - x -     - - - - x - -
- - - x - - -     x - - - - - -
- x - - - - -     - - - x - - -
This can be avoided by using the "kill the second column" trick. That is, when the position of the piece on the second row is decided, all places on the second column higher than that position are marked as used:

Code:
- - - - - - x
- - x - - - -
- - - - - # -
- - - - - # -
- - - - - # -
- - - - - - -
- - - - - - -
Other cases are more complicated though. To avoid generating redundant rotation and mirroring, when the position of the piece on the first row is decided, these positions are marked as used:

Code:
# # - - x # #
# - - - - - #
- - - - - - -
- - - - - - -
- - - - - - -
# - - - - - #
# # - - - # #
However, even with these, it's still possible to generate redundant solutions. For example:

Code:
- - - - - x - -     - - - - - x - -
- - x - - - - -     - x - - - - - -
x - - - - - - -     - - - - - - x -
- - - - - - x -     x - - - - - - -
- - - - x - - -     - - - x - - - -
- - - - - - - x     - - - - - - - x
- x - - - - - -     - - - - x - - -
- - - x - - - -     - - x - - - - -
These two solutions are "the same" but they will be generated separately. To know about this and avoid counting twice, every solution has to be rotated and pick only the "smallest" solution. This way, for every solutions only the smallest one will be counted. This is also used for counting symmetry: if a solution is the same when rotated 90 degrees, the symmetry is 2. If it's the same only when rotated 180 degrees, the symmetry is 4. Otherwise, it's 8.

By doing these "optimizations" the CPU implementation runs a bit faster. Computing 16-queen on my Core 2 3.0GHz now takes 4.64 seconds (instead of previous 7.27 seconds, that's 57% faster), and it computes both total number of solutions and unique solutions (16-queen has 1846955 unique solutions).
pcchen is offline   Reply With Quote
Old 16-Jan-2010, 07:59   #10
Simon F
Tea maker
 
Join Date: Feb 2002
Location: In the Island of Sodor, where the steam trains lie
Posts: 4,444
Default

Nice bit of analysis.
Simon F is offline   Reply With Quote
Old 16-Jan-2010, 18:33   #11
prunedtree
Regular
 
Join Date: Aug 2009
Posts: 27
Default

For fun, I tried to see what kind of `performance' vs `programming time' ratio one could typically achieve on this problem !

Reading various info on internet
== time: 1h / performance: no code yet

First attempt, single-threaded, (inspired from http://www.ic-net.or.jp/home/takaken/e/queen/index.html)
== time: 1h 10 minutes / performance: 16 queens in 16 sec

Removing basic symmetry
== time: 1h 15 minutes / performance: 16 queens in 8 sec

Multi-threading (very naive, just one thread for each of the first branches)
== time: 1h 30 minutes / performance: 16 queens in 2 sec

Playing around, found false sharing
== time: 1h 45 minutes / performance: 16 queens in 1.75 sec

Some ricing, one level unroll (28% faster) and tail function (7% faster)
== time: 2h / performance: 16 queens in 1.27 sec

The CPU used is a i7 920 (2.66 Ghz)

It's interesting that the multi-threading speedup is superlinear (4.9x) thanks to SMT (hyperthreading).

No attempt was made at more complex symmetry reduction, although if I understand right it should come nearly free in terms of the backtracking throughput. No attempt was made at turning the recursion into an iteration either (my 1-level unroll is a just copy/paste ^^)

This naive version of 16 queens has a 570595151 nodes search tree, which it processes at a rate at 450 million nodes per second.
prunedtree is offline   Reply With Quote
Old 16-Jan-2010, 19:16   #12
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,485
Default

Ah, it's similar to the CPU version of my old CUDA code. The old CPU version is also only mirrored and multi-threaded using Win32 API.

I becomes interested in reducing more symmetry after thinking a lot about this. The ideas in my previous post is actually pretty similar to the page you linked. I think it's possible to make GPU run faster with reduced symmetry, but it seems that the more complex symmetry reduction may bring even less performance gain on GPU than on CPU, because GPU really don't like branchy codes

The reason why it has superlinear sppedup with hyperthreading is probably also related to the branchy nature of this problem, as hyperthreading may help hiding some latency better.

However, when I did the old CUDA version I was pretty impressed by the ability of GPU on running this kind of codes. I was assuming that GPU should be pretty bad at this kind of problems and will lag behind a similar level CPU, but I was wrong.

Unfortunately, my P6T motherboard is dead, and it may take some time to fix it. So now I can't update programs on my GTX 285 for a while. I can use my home computer only on weekend.
pcchen is offline   Reply With Quote
Old 17-Jan-2010, 10:40   #13
neliz
MSI Man
 
Join Date: Mar 2005
Location: In the know
Posts: 4,899
Send a message via ICQ to neliz Send a message via MSN to neliz Send a message via Skype™ to neliz
Default

nqueen -cpu 6.64s
nqueen -clcpu 3.85s

gpu crashes even with Cat 10.1 beta
__________________
I miss you CJ, 1976 - 2010
neliz is offline   Reply With Quote
Old 17-Jan-2010, 17:02   #14
Lightman
Senior Member
 
Join Date: Jun 2008
Location: Torquay, UK
Posts: 1,150
Default

My turn:
-cpu 16 = 3.75s
-clcpu 16 = 1.51s

crash using GPU

Phenom II 940@3455MHz
Lightman is offline   Reply With Quote
Old 17-Jan-2010, 17:11   #15
mhouston
A little of this and that
 
Join Date: Oct 2005
Location: Cupertino
Posts: 342
Default

Yep, we have confirmed that this code causes LLVM to generate irreducible control flow which makes our compiler unhappy. We have an internal bugtrack for this already.
mhouston is offline   Reply With Quote
Old 17-Jan-2010, 18:30   #16
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 3,008
Send a message via Skype™ to fellix
Default

Q9450 @ 3608MHz

-cpu 16: 3.88s
-clcpu 16: 1.81s

I can't believe PII is faster here...
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 17-Jan-2010, 20:19   #17
Lightman
Senior Member
 
Join Date: Jun 2008
Location: Torquay, UK
Posts: 1,150
Default

Quote:
Originally Posted by fellix View Post
Q9450 @ 3608MHz

-cpu 16: 3.88s
-clcpu 16: 1.81s

I can't believe PII is faster here...
Would 3 complex decoders help here?
Or larger L1 cache?

I was pleasantly surprised as well

@mhouston:
Good to know this will be sorted!
Lightman is offline   Reply With Quote
Old 18-Jan-2010, 13:22   #18
entity279
Member
 
Join Date: May 2008
Location: Romania
Posts: 436
Send a message via Yahoo to entity279
Default

Quote:
Originally Posted by fellix View Post
Q9450 @ 3608MHz
I can't believe PII is faster here...
Also had a classifier implemented on CPUs and it ran on my X2 3800+ (@2400) as fast as it did on a Q6600 stock (single threaded ofcorse, with main CPU-time-user loop unrolled). It mostly multiplied floats and every now and then it extracted square roots.

LE: and the number of unrolls was tested on both CPUs so it didn't favor a specific architecture.

Last edited by entity279; 18-Jan-2010 at 13:26. Reason: Added details
entity279 is online now   Reply With Quote
Old 04-Feb-2010, 04:44   #19
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,485
Default

I added the symmetry reduction algorithms to the OpenCL version, so it computes unique solutions too. The CPU version is also optimized a little bit.

On my Core i7 920 + GeForce GTX 285:

-cpu 16: 4.3s
-clcpu 16: 1.38s
16 (gpu): 1.09s
-local 16 (gpu): 0.492s (0.542s in old version)

-cpu 17: 29.6s
-clcpu 17: 8.96s
17 (gpu): 4.05s
-local 17 (gpu): 2.51s (4.36s in old version)

-cpu 18: 216s
-clcpu 18: 60.1s
18 (gpu): 32.8s
-local 18 (gpu): 24.5s (42.3s in old version)

Unfortunately, this program still can't run on ATI's GPU.
Another interesting thing is, it seems that the GPU versions scale worse than CPU versions. So in theory for some very big n CPU could be faster than GPU.
pcchen is offline   Reply With Quote
Old 04-Feb-2010, 18:42   #20
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,485
Default

I did some rather trivial optimizations for GPU to make it a little faster, mainly the symmetry check part.

Running on GeForce GTX 285:

16 (gpu): 1.09s
-local 16 (gpu): 0.5s

17 (gpu): 4.01s
-local 17 (gpu): 2.55s

18 (gpu): 28.3s
-local 18 (gpu): 19.2s
pcchen is offline   Reply With Quote
Old 04-Feb-2010, 23:51   #21
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,333
Send a message via ICQ to OpenGL guy
Default

Just tested this app with a beta of the next Stream SDK release and it works. Performance could be improved if you can find a way to vectorize your algorithm The compiled code has lots of flow control with little work (300 clauses, 425 ALU instructions on 5870).
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 05-Feb-2010, 04:36   #22
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,485
Default

Quote:
Originally Posted by OpenGL guy View Post
Just tested this app with a beta of the next Stream SDK release and it works. Performance could be improved if you can find a way to vectorize your algorithm The compiled code has lots of flow control with little work (300 clauses, 425 ALU instructions on 5870).
Wow, that's great! I has been thinking about how to vectorize the code, although it's probably not easy but I think at least it's possible.
pcchen is offline   Reply With Quote
Old 05-Feb-2010, 05:32   #23
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,485
Default

I found a bug which may prevent it to run with -local on G8X/G9X GPUs. A workaround is to specify thread numbers to multiple of 192 (such as 5376). I'll post updates later.
pcchen is offline   Reply With Quote
Old 13-Feb-2010, 00:04   #24
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,485
Default

This works with the latest AMD Stream SDK 2.0.1 now. On my Radeon 4850 it takes 4.61 seconds to run 16 queen, and 3.79 seconds to run -local version. This is interesting because AFAIK Radeon 4850 does not support real local memory in OpenCL. However, it's possible that using local memory forces the compiler to use memory instead of registers so it reduces pressure on registers.
pcchen is offline   Reply With Quote
Old 13-Feb-2010, 01:16   #25
mhouston
A little of this and that
 
Join Date: Oct 2005
Location: Cupertino
Posts: 342
Default

On windows you can use the profiler to dump the generated ISA and you can look at the differences.
mhouston is offline   Reply With Quote

Reply

Tags
nqueen, opencl

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 15:22.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.