View Full Version : N-Queen Solver for OpenCL
I rewrote my old n-Queen solver from CUDA into OpenCL. It's not a port, because I rewrite everything from scratch (only the idea remains the same). I was hoping to use OpenCL for comparison between AMD and NVIDIA hardwares.
Unfortunately, for some reason, AMD's OpenCL compiler crashed when compiling my kernel for GPU (it's ok for CPU version though). So right now it doesn't work on AMD's GPU at all, but with AMD Stream SDK 2.0 it's possible to run on CPU devices.
Both source and executable are provided in the attachment.
The arguments are:
nqueen_cl -cpu -clcpu -p -thread # -platform # -local N
-cpu: use CPU implementation (single threaded)
-clcpu: use OpenCL CPU devices
-p: enable profiling
-thread #: set the number of threads. (default: max work group item number * number of devices)
-platform #: select a platform (default: select platform 0)
-local: use local memory (shared memory) for arrays. This runs faster on NVIDIA's GPU because NVIDIA GPU have no indexed registers.
-noatomics: do not use global atomics (this is needed for now to avoid crash on AMD Stream SDK 2.0.1 when running scalar code)
-novec: do not use vectorization
N: the board size (1 ~ 32)
Note: large board size (> 20) takes forever to run (I estimate that 20 queen would take more than 2 hours to run on 9800GT).
The N-queen algorithm is a straightforward one. There is no special considerations to avoid redundant boards (i.e. most boards can be mirrored and rotated to create 8 solutions). Only a simple mirror reduction is used.
Running on my computer (Core 2 Duo E8400 3.0GHz):
16 queen -cpu: 7.27 s
16 queen -clcpu: 3.79 s
On GPU (GeForce 9800GT):
16 queen: 2.36 s
16 queen -local: 1.23 s
About the crash problem: the original kernel I've developed (i.e. the kernel used by clcpu path) doesn't crash the compiler. However, it's extremely slow (> 20 s for 16 queen on my 4850, and > 7s on 9800GT) because it uses a four arrays to simulate a stack for recursion. These arrays are good for CPU version because they reduce the amount of computation, but for GPU they are too hard on registers. So I developed another kernel which uses only one array, but it requires more computation to generate/restore data for each steps. However, this new kernel crashes the compiler (it works when selecting CPU devices, but it's slower). I'm using Cat 9.12 hotfix right now.
[EDIT] I updated the CPU program with reduced symmetry which runs faster.
[EDIT2] OpenCL kernels are also with reduced symmetry now.
[EDIT3] Vectorized version for RV870 is added. (4/10/2010)
Intel Q9450 @ 3608MHz:
nqueen_cl.exe -clcpu 16
N-Queen solver for OpenCL
Ping-Che Chen
Platform [0]: ATI Stream
Select platform 0
Using CPU device
Using 4096 threads
16-queen has 14772512 solutions
Time used: 1.64s
Running on GPU device target, the program gives me an error in aticaldd.dll module.
why does the cpu version use only one thread ?
mhouston
10-Jan-2010, 23:38
We'll take a look at the compiler issues.
why does the cpu version use only one thread ?
I had a multi-threaded version in my old CUDA code, but that's not portable (it uses Win32 API for managing threads). Also, I think with CPU support OpenCL can reduce or remove the requirement of writing own multi-threaded codes for CPU :)
We'll take a look at the compiler issues.
Thanks, I'm eager to know how this will run on my 4850 :)
trinibwoy
11-Jan-2010, 01:01
Cool, did you compare the CUDA and OpenCL versions or has the algorithm changed so much that it's not apples to apples?
Cool, did you compare the CUDA and OpenCL versions or has the algorithm changed so much that it's not apples to apples?
The CUDA version initially use 4 arrays so it limited the number of threads to 96 per SM (because the amount of shared memory is limited). I had written a newer version which IIRC reduced the number of arrays to 2 so it can use more threads. I don't have the newer version right now, but compared to the older 4 arrays version, OpenCL version is a little faster (1.24s vs 1.328s on a 8800GT).
Note that the major benefit of using more threads is hiding latency. Since this program does not access global memory frequently, it doesn't have to hide a massive amount of latency. Basically it only has to have enough threads to hide ALU latency, which IIRC on NVIDIA's GPU is 192 threads per work group. By using less arrays to enable using more threads, additional computation is required, so if the benefit of hiding ALU latency does not out weight the additional computations, it's not going to be faster.
The performance of this program is probably limited primarily by warp serialization due to branch divergence. This is an example of "not-so-suitable-to-GPU" type parallelizable workload :)
I checked my old CUDA version. It uses only one array for stack, but the stored data is different (although there shouldn't be too much difference in computation). However, the old CUDA version also stores results in shared memory and performs reduction when the kernel completed, so it can only have 128 threads rather than 256 threads, therefore it's still slower. I guess it's a bad decision to do reduction in the GPU because CPU is basically sit idling when the GPU is doing all the works.
On my GTX 285, the CUDA version takes 0.686 second to run (16 queen), while the OpenCL version takes 0.643 second. Then I found that in my old CUDA version, the array indexing is reversed to avoid bank conflict, so I did the same, and it becomes a bit faster to 0.537 second.
The OpenCL CPU version takes 1.76 second to run on my Core i7 920. The single threaded CPU version takes 7.58 seconds. The number of threads for the OpenCL CPU version is a bit too much and may reduce performance, so I reduced the amount of threads for the CPU version, but it's not really much faster (from 1.76 second to 1.68 second).
I updated the source code and executables in the attachments.
I made a CPU version which tries to avoid redundant patterns. The original version only avoid horizontal mirroring, i.e.
- - - - x x - - - -
- x - - - - - - x -
- - - x - - x - - -
x - - - - - - - - x
- - x - - - - x - -
are the same. It's easy to do so because there is no possibility of generating the same answer after mirroring.
However, there are actually 8 possible transformations, i.e. 4 rotations and mirroring of each rotations. So if it's possible to generate only "unique" solutions, then it's possible to increase speed by a factor of 4. Unfortunately, it's not that simple, because it's possible for some solutions to be symmetric. For example:
- - x -
x - - -
- - - x
- x - -
is symmetric, which only produces 2 non-unique solutions, rather than 8. That's because if you rotate this solution by 90 degree it's actually the same. There are also symmetries which is 180 degree:
- - - - - x -
- - x - - - -
x - - - - - -
- - - x - - -
- - - - - - x
- - - - x - -
- x - - - - -
These solutions only produce 4 non-unique solutions. In order to avoid generating redundant solutions, there are two different cases. The first case is when there's a piece at the corner:
- - - - - - x
- - - - - - -
- - - - - - -
- - - - - - -
- - - - - - -
- - - - - - -
- - - - - - -
In this case, there is no possibility of any kind of symmetry, because any rotation moves the corner piece to a different corner. There is also no possibility of mirroring symmetry, because if you mirror along the diagonal axis, every piece will be moved to an empty place (otherwise the two pieces can capture each other). However, there is a risk of generating the mirrored solutions twice:
- - - - - - x - - - - - - x
- - - - x - - - - x - - - -
- - x - - - - - - - - - x -
x - - - - - - - x - - - - -
- - - - - x - - - - - x - -
- - - x - - - x - - - - - -
- x - - - - - - - - x - - -
This can be avoided by using the "kill the second column" trick. That is, when the position of the piece on the second row is decided, all places on the second column higher than that position are marked as used:
- - - - - - x
- - x - - - -
- - - - - # -
- - - - - # -
- - - - - # -
- - - - - - -
- - - - - - -
Other cases are more complicated though. To avoid generating redundant rotation and mirroring, when the position of the piece on the first row is decided, these positions are marked as used:
# # - - x # #
# - - - - - #
- - - - - - -
- - - - - - -
- - - - - - -
# - - - - - #
# # - - - # #
However, even with these, it's still possible to generate redundant solutions. For example:
- - - - - x - - - - - - - x - -
- - x - - - - - - x - - - - - -
x - - - - - - - - - - - - - x -
- - - - - - x - x - - - - - - -
- - - - x - - - - - - x - - - -
- - - - - - - x - - - - - - - x
- x - - - - - - - - - - x - - -
- - - x - - - - - - x - - - - -
These two solutions are "the same" but they will be generated separately. To know about this and avoid counting twice, every solution has to be rotated and pick only the "smallest" solution. This way, for every solutions only the smallest one will be counted. This is also used for counting symmetry: if a solution is the same when rotated 90 degrees, the symmetry is 2. If it's the same only when rotated 180 degrees, the symmetry is 4. Otherwise, it's 8.
By doing these "optimizations" the CPU implementation runs a bit faster. Computing 16-queen on my Core 2 3.0GHz now takes 4.64 seconds (instead of previous 7.27 seconds, that's 57% faster), and it computes both total number of solutions and unique solutions (16-queen has 1846955 unique solutions).
Simon F
16-Jan-2010, 07:59
Nice bit of analysis.
prunedtree
16-Jan-2010, 18:33
For fun, I tried to see what kind of `performance' vs `programming time' ratio one could typically achieve on this problem !
Reading various info on internet
== time: 1h / performance: no code yet
First attempt, single-threaded, (inspired from http://www.ic-net.or.jp/home/takaken/e/queen/index.html)
== time: 1h 10 minutes / performance: 16 queens in 16 sec
Removing basic symmetry
== time: 1h 15 minutes / performance: 16 queens in 8 sec
Multi-threading (very naive, just one thread for each of the first branches)
== time: 1h 30 minutes / performance: 16 queens in 2 sec
Playing around, found false sharing
== time: 1h 45 minutes / performance: 16 queens in 1.75 sec
Some ricing, one level unroll (28% faster) and tail function (7% faster)
== time: 2h / performance: 16 queens in 1.27 sec
The CPU used is a i7 920 (2.66 Ghz)
It's interesting that the multi-threading speedup is superlinear (4.9x) thanks to SMT (hyperthreading).
No attempt was made at more complex symmetry reduction, although if I understand right it should come nearly free in terms of the backtracking throughput. No attempt was made at turning the recursion into an iteration either (my 1-level unroll is a just copy/paste ^^)
This naive version of 16 queens has a 570595151 nodes search tree, which it processes at a rate at 450 million nodes per second.
Ah, it's similar to the CPU version of my old CUDA code. The old CPU version is also only mirrored and multi-threaded using Win32 API.
I becomes interested in reducing more symmetry after thinking a lot about this. The ideas in my previous post is actually pretty similar to the page you linked. I think it's possible to make GPU run faster with reduced symmetry, but it seems that the more complex symmetry reduction may bring even less performance gain on GPU than on CPU, because GPU really don't like branchy codes :)
The reason why it has superlinear sppedup with hyperthreading is probably also related to the branchy nature of this problem, as hyperthreading may help hiding some latency better.
However, when I did the old CUDA version I was pretty impressed by the ability of GPU on running this kind of codes. I was assuming that GPU should be pretty bad at this kind of problems and will lag behind a similar level CPU, but I was wrong.
Unfortunately, my P6T motherboard is dead, and it may take some time to fix it. So now I can't update programs on my GTX 285 for a while. I can use my home computer only on weekend.
nqueen -cpu 6.64s
nqueen -clcpu 3.85s
gpu crashes even with Cat 10.1 beta
Lightman
17-Jan-2010, 17:02
My turn:
-cpu 16 = 3.75s
-clcpu 16 = 1.51s
crash using GPU :cry:
Phenom II 940@3455MHz
mhouston
17-Jan-2010, 17:11
Yep, we have confirmed that this code causes LLVM to generate irreducible control flow which makes our compiler unhappy. We have an internal bugtrack for this already.
Q9450 @ 3608MHz
-cpu 16: 3.88s
-clcpu 16: 1.81s
I can't believe PII is faster here... :runaway:
Lightman
17-Jan-2010, 20:19
Q9450 @ 3608MHz
-cpu 16: 3.88s
-clcpu 16: 1.81s
I can't believe PII is faster here... :runaway:
Would 3 complex decoders help here?
Or larger L1 cache?
I was pleasantly surprised as well :lol:
@mhouston: (http://forum.beyond3d.com/member.php?u=6907)
Good to know this will be sorted!
entity279
18-Jan-2010, 13:22
Q9450 @ 3608MHz
I can't believe PII is faster here... :runaway:
Also had a classifier implemented on CPUs and it ran on my X2 3800+ (@2400) as fast as it did on a Q6600 stock (single threaded ofcorse, with main CPU-time-user loop unrolled). It mostly multiplied floats and every now and then it extracted square roots.
LE: and the number of unrolls was tested on both CPUs so it didn't favor a specific architecture.
I added the symmetry reduction algorithms to the OpenCL version, so it computes unique solutions too. The CPU version is also optimized a little bit.
On my Core i7 920 + GeForce GTX 285:
-cpu 16: 4.3s
-clcpu 16: 1.38s
16 (gpu): 1.09s
-local 16 (gpu): 0.492s (0.542s in old version)
-cpu 17: 29.6s
-clcpu 17: 8.96s
17 (gpu): 4.05s
-local 17 (gpu): 2.51s (4.36s in old version)
-cpu 18: 216s
-clcpu 18: 60.1s
18 (gpu): 32.8s
-local 18 (gpu): 24.5s (42.3s in old version)
Unfortunately, this program still can't run on ATI's GPU.
Another interesting thing is, it seems that the GPU versions scale worse than CPU versions. So in theory for some very big n CPU could be faster than GPU.
I did some rather trivial optimizations for GPU to make it a little faster, mainly the symmetry check part.
Running on GeForce GTX 285:
16 (gpu): 1.09s
-local 16 (gpu): 0.5s
17 (gpu): 4.01s
-local 17 (gpu): 2.55s
18 (gpu): 28.3s
-local 18 (gpu): 19.2s
OpenGL guy
04-Feb-2010, 23:51
Just tested this app with a beta of the next Stream SDK release and it works. Performance could be improved if you can find a way to vectorize your algorithm :) The compiled code has lots of flow control with little work (300 clauses, 425 ALU instructions on 5870).
Just tested this app with a beta of the next Stream SDK release and it works. Performance could be improved if you can find a way to vectorize your algorithm :) The compiled code has lots of flow control with little work (300 clauses, 425 ALU instructions on 5870).
Wow, that's great! I has been thinking about how to vectorize the code, although it's probably not easy but I think at least it's possible. :)
I found a bug which may prevent it to run with -local on G8X/G9X GPUs. A workaround is to specify thread numbers to multiple of 192 (such as 5376). I'll post updates later.
This works with the latest AMD Stream SDK 2.0.1 now. On my Radeon 4850 it takes 4.61 seconds to run 16 queen, and 3.79 seconds to run -local version. This is interesting because AFAIK Radeon 4850 does not support real local memory in OpenCL. However, it's possible that using local memory forces the compiler to use memory instead of registers so it reduces pressure on registers.
mhouston
13-Feb-2010, 01:16
On windows you can use the profiler to dump the generated ISA and you can look at the differences.
Lightman
13-Feb-2010, 14:38
This works with the latest AMD Stream SDK 2.0.1 now. On my Radeon 4850 it takes 4.61 seconds to run 16 queen, and 3.79 seconds to run -local version. This is interesting because AFAIK Radeon 4850 does not support real local memory in OpenCL. However, it's possible that using local memory forces the compiler to use memory instead of registers so it reduces pressure on registers.
On stock HD5870 it takes:
2.27s for 16 queen
0.85s for 16 queen -local
Just for fun on HD5870@1130/1248:
2.07s for 16 queen
0.67s for 16 queen -local
66.0s for 18 queen -local
EDIT!
The above was on old version!
This is on the new one:
On stock HD5870 it takes:
3.40s for 16 queen
0.77 for 16 queen -local
Just for fun on HD5870@1130/1248:
2.80s for 16 queen
0.59s for 16 queen -local
30.50s for 18 queen -local
New version is a lot slower than old not using local memory and quite a bit faster using it? Why is that?
The new version uses a more complex algorithm to remove more symmetry. I'm not sure why using local is faster, though. I used the Stream KernelAnalyzer to compare the two kernels, but the local version seems to be using more registers than normal version, which is pretty odd.
I think the number of work items per work group is probably also an important factor. I plan to make a new option so it's possible to set the number of work items per work group (currently it's only possible to set the number of total work items).
If a vectorized kernel is faster (I'm still thinking about it), I think it's possible that Radeon 5870/5850 to be faster than GTX 285.
psychocoder
27-Feb-2010, 18:57
Hi,
how long your opencl code need to calculate full 19 queens and 20 queens problem??
thank you
I have been experimenting with some optimization ideas. One is vectorization, another one is to reduce imbalance. That is, since each work item may take different time to run (some work items may end pretty soon while others may take a long time to finish), it creates an imbalance where some work items have to wait for other work items to finish, which is not very efficient. Furthermore, this problem is worse as the number of work items grows larger. However, the number of work items still has to be large enough to fully utilize the GPU, so it's a problem.
I made a simple modification to try reducing imbalance, by using a "global index" with an atomic add for each work item to get next work instead of just sit idle. Of course, this requires a GPU with support for atomic operations (i.e. cl_khr_global_int32_base_atomics).
Here are some prelimeary results (with GeForce 8800GT) :
New -local 17: 4.41s
Old -local 17: 5.06s
New -local 18: 32.8s
Old -local 18: 38.8s
New -local 19: 270s
Old -local 19: 330s
[EDIT] A test run with n = 20 takes around 2280 seconds.
http://jsomers.com/nqueen_demo/nqueens.html
On this page there is a fast CPU-solver. I have a Pentium M from 2003, but owners of a more up to date hardware can make some benchmarks :)
http://jsomers.com/nqueen_demo/nqueens.html
On this page there is a fast CPU-solver. I have a Pentium M from 2003, but owners of a more up to date hardware can make some benchmarks :)
I took a look at this program and it looks similar to my old algorithm, i.e. only the mirror symmetry is removed. I didn't recompile the program but it takes around 10 seconds to solve 16 queen on a Core 2 1.83GHz. The "rotation symmetry" removed program (the single threaded cpu path) takes 5.78 seconds to solve 16 queen.
Running on GTX 285:
New -local 17: 2.41s
Old -local 17: 2.44s
New -local 18: 15.1s
Old -local 18: 18.5s
New -local 19: 123s
Old -local 19: 155s
New -local 20: ~ 1060s
ryta1203
01-Mar-2010, 19:23
Just tested this app with a beta of the next Stream SDK release and it works. Performance could be improved if you can find a way to vectorize your algorithm :) The compiled code has lots of flow control with little work (300 clauses, 425 ALU instructions on 5870).
Yes. What is your packing now? I'm pretty sure you should be able to improve your packing and reduce control flow.
I got some weird problems now...
NVIDIA has released their new driver (196.75) which is supposed to support the latest ICD which is compatible with the latest ATI Stream SDK. However, after I installed the new driver, it can't find its own OpenCL device. After a few check I found out that nvcuda.dll lacks the necessary clIcdGetPlatformIDsKHR function (it has the other two functions).
So now it's a weird thing, which is, after installing ATI Stream SDK 2.0.1, NVIDIA's OpenCL.dll (which is actually Khronos' opencl.dll) can find only ATI's GPU on my computer. I delayed installing ATI Stream SDK 2.0.1 on my computer just because I was afraid that I might not be able to use my GTX 285 for OpenCL, but now NVIDIA's own driver just did that... Anyway, I hope they fixed that soon.
Then there is another problem. With previous SDK, the OpenCL compiler crashes with my n-queen kernel. 2.0.1 SDK fixed that. However, after I added the "imbalance reducing" logic (which is basically another loop around the original loop) it crashes again. Fortunately I coded a "force no atomics" option and I can verify that it runs without the new loop.
The new program is posted here instead of under the first post because it currently crashes with ATI Stream SDK 2.0.1 without the -noatomics option. It does not contain any of my vectorization experiments yet.
New options:
-noatomics: do not use atomics
Fixed:
-threads #: works with -local now
[EDIT] I reinstalled the driver and now OpenCL works again for NVIDIA's GPU. Now any OpenCL program can see both platforms working! That's great!
mhouston
02-Mar-2010, 18:44
Reproduced and internal bug filed.
On my GTX 285, the CUDA version takes 0.686 second to run (16 queen), while the OpenCL version takes 0.643 second. Then I found that in my old CUDA version, the array indexing is reversed to avoid bank conflict, so I did the same, and it becomes a bit faster to 0.537 second.
Which version ran a little quicker?
Cuda = .686s
OCL = .643s
? = .537s
Which version ran a little quicker?
Cuda = .686s
OCL = .643s
? = .537s
The OpenCL version runs faster when it uses the same technique as the older CUDA version.
However, with newer algorithms 16 queen takes longer to run (although 17 or larger queens take less time to run).
Quick bump to point out that Ryan Smith from Anand's is using pcchen's program in his bench suite (with full credits in the body of the article):
http://www.anandtech.com/video/showdoc.aspx?i=3783&p=6
:mrgreen:
Lightman
27-Mar-2010, 14:16
Quick bump to point out that Ryan Smith from Anand's is using pcchen's program in his bench suite (with full credits in the body of the article):
http://www.anandtech.com/video/showdoc.aspx?i=3783&p=6
:mrgreen:
Yeah, I noticed that as well :smile:
Congrats PCChen :cool2:
Arnold Beckenbauer
27-Mar-2010, 19:29
c:\users\denis x64\downloads\nqueen_cl(2)>nqueen_cl -threads 15360 -noatomics -local 17
n-queen solver for opencl
ping-che chen
platform [0]: Ati stream
select platform 0
using gpu device
using 15360 threads
17-queen has 95815104 solutions (11977939 unique)
time used: 20.7s
hd4850 (675/993)
Quick bump to point out that Ryan Smith from Anand's is using pcchen's program in his bench suite (with full credits in the body of the article):
http://www.anandtech.com/video/showdoc.aspx?i=3783&p=6
:mrgreen:
This is cool :)
I'm still doing experients on vectorized version for Cypress, but my computer is not very cooperative and I have to replace my PSU (my old PSU is probably not powerful enough to drive both GTX 285 and 5850). I still expect to see at least 2x speed up for Cypress in vectorized version.
Lightman
29-Mar-2010, 11:48
This is cool :)
I'm still doing experients on vectorized version for Cypress, but my computer is not very cooperative and I have to replace my PSU (my old PSU is probably not powerful enough to drive both GTX 285 and 5850). I still expect to see at least 2x speed up for Cypress in vectorized version.
I can't wait for vectorized version! :smile:
ryta1203
30-Mar-2010, 16:37
Quick bump to point out that Ryan Smith from Anand's is using pcchen's program in his bench suite (with full credits in the body of the article):
http://www.anandtech.com/video/showdoc.aspx?i=3783&p=6
:mrgreen:
Is he optimizing the code at all? This is not an optimized version so I don't think he should be using it to bench anything, particular ATI GPUs.
You can get a good speedup by decreasing the CF through some simple techniques I believe.
I do have a question, does your code auto-verify results? I'm assuming so, yes?
ryta1203
30-Mar-2010, 16:40
This is cool :)
I'm still doing experients on vectorized version for Cypress, but my computer is not very cooperative and I have to replace my PSU (my old PSU is probably not powerful enough to drive both GTX 285 and 5850). I still expect to see at least 2x speed up for Cypress in vectorized version.
Global memory is done in 128 bit, but I'm not sure if they optimize float values. Of course, if you aren't memory bound then it won't matter. As far as the ALU ops, the compiler does do a fairly decent job of packing instructions.
I'm interested to see the vectorized results also.
ryta1203
30-Mar-2010, 17:24
Also, does anyone know how many threads Ryan Smith used in his "benchmark"?
ryta1203
30-Mar-2010, 17:40
It seems he is using 256k threads? Is that correct?
Global memory is done in 128 bit, but I'm not sure if they optimize float values. Of course, if you aren't memory bound then it won't matter. As far as the ALU ops, the compiler does do a fairly decent job of packing instructions.
I'm interested to see the vectorized results also.
Global memory access is not important for this program, because most of the work does not use global memory at all. Basically, for each work item, there is only 4 reads and 2 writes.
Right now I'm focusing my vectorization experiments on the "one queen at the corner" case, which is handled by a separate, simpler function. The ALU packing ratio of this function (non-vectorized) is only 34%, which is not very good. That's why I think it's possible that a vectorized version may be able to run twice as fast.
However, my current vectorized version, although have a ~84% ALU packing ratio, runs much slower because for some reason it has many global memory access (in the order of tens of thousands per work item). I suspect that this could be due to increased register pressure from vectorization. However, the number of work items per work group is already reduced 4 times, so it shouldn't need many more registers. And reducing the number of work items per work group further does not help either. I think if this problem can be solved, it's possible to see a good speed up from vectorization.
It seems he is using 256k threads? Is that correct?
I'm not sure because I didn't see any mention of this in the article. However, I don't think many more threads will help. By default, the number of thread is the number of cores * 256 * 2. So in the case of Radeon 5850 it's 9216 threads, and in the case of 5870 it'd be 10240 threads.
ryta1203
30-Mar-2010, 21:15
Can you post the vectorized code? I'd be interested to take a look.
Yes, if you use local memory this is correct; however, the 4870 doesn't have this and currently uses local memory as global memory.
Also, I'm curious as to why your packing ratio increased so much, this would imply that the compiler is not good at packing, which counters what I've seen from the compiler.
Also, have you applied any control flow reduction techniques? I applied one to one if-else statement in your queen1 kernel and on cypress got a 13% improvement on 16k threads with a 16x16 board.
ryta1203
30-Mar-2010, 21:20
I'm not sure because I didn't see any mention of this in the article. However, I don't think many more threads will help. By default, the number of thread is the number of cores * 256 * 2. So in the case of Radeon 5850 it's 9216 threads, and in the case of 5870 it'd be 10240 threads.
Ok, just wondering, I get different execution times for different number of threads, say 16k vs. 32k for 17x17 board (what he used).
Can you post the vectorized code? I'd be interested to take a look.
Yes, if you use local memory this is correct; however, the 4870 doesn't have this and currently uses local memory as global memory.
Also, I'm curious as to why your packing ratio increased so much, this would imply that the compiler is not good at packing, which counters what I've seen from the compiler.
Also, have you applied any control flow reduction techniques? I applied one to one if-else statement in your queen1 kernel and on cypress got a 13% improvement on 16k threads with a 16x16 board.
I now think it's quite possible the register pressure is responsible for the extra global memory access, because I changed all int4/uint4 into int2/uint2 and now all global memory access are gone.
Now the non-vectorized nqueen1 function takes 0.187 second to run (the time is given by profiler rather than wall clock). Vectorized nqueen1 takes 0.137 second to run. Of course, ALU packing ratio is a little lower because of using 2D vectors instead of 4D vectors, but it's still at about 81%. Also, there is no branch instructions in the main loop in the vectorized version anymore.
An interesting thing is, for nqueen1 alone, now GTX 285, which takes 0.158s to run, now runs slower than the vectorized 5850. Of course, it's still unknown whether the main part can be optimized for Cypress as good as this :)
The vectorized code looks like this:
__kernel void nqueen1_vec(int board_size, int level, int threads, __global uint2* params, __global uint2* results, __constant uint2* forbidden)
{
int idx = get_global_id(0);
int tid = get_local_id(0);
uint2 ms;
__local uint nsx[12][WORK_ITEMS];
__local uint nsy[12][WORK_ITEMS];
uint2 mask = params[idx];
uint2 left_mask = params[idx + threads];
uint2 right_mask = params[idx + threads * 2];
int2 second_row = convert_int2(params[idx + threads * 3]);
uint2 board_mask = (uint2) ((1 << board_size) - 1);
uint2 left_mask_big = (uint2) 0;
uint2 right_mask_big = (uint2) 0;
uint2 solutions = (uint2) 0;
int2 i = (int2) 0;
uint2 nsi, nsi_mask;
ms = mask | left_mask | right_mask | (convert_uint2(i < second_row) & (uint2)2);
nsi = ((ms + (uint2) 1) & ~ms);
nsx[0][tid] = nsi.x;
nsy[0][tid] = nsi.y;
while(any(i >= (int2) 0)) {
nsi.x = nsx[max(i.x, 0)][tid];
nsi.y = nsy[max(i.y, 0)][tid];
nsi_mask = convert_uint2((nsi & board_mask) != (uint2) 0) & convert_uint2(i >= (int2) 0);
{
// for nsi_mask == true...
mask |= (nsi & nsi_mask);
left_mask_big = select(left_mask_big, (left_mask_big << (uint2) 1) | (left_mask >> (uint2) 31), nsi_mask);
left_mask = select(left_mask, (left_mask | nsi) << (uint2) 1, nsi_mask);
right_mask_big = select(right_mask_big, (right_mask_big >> (uint2) 1) | (right_mask << (uint2) 31), nsi_mask);
right_mask = select(right_mask, ((right_mask | nsi) >> (uint2) 1), nsi_mask);
ms = mask | left_mask | right_mask | (convert_uint2((i + 1) < second_row) & (uint2)2);
nsi = select(nsi, ((ms + (uint2) 1) & ~ms), nsi_mask);
i = select(i, i + 1, convert_int2(nsi_mask));
nsx[max(i.x, 0)][tid] = nsi.x;
nsy[max(i.y, 0)][tid] = nsi.y;
}
{
// for nsi_mask == false
solutions -= (convert_uint2(i == (int2) level) & ~nsi_mask);
i = select(i - 1, i, convert_int2(nsi_mask));
nsi.x = nsx[max(i.x, 0)][tid];
nsi.y = nsy[max(i.y, 0)][tid];
nsi_mask = ~nsi_mask & convert_uint2(i >= (int2) 0);
// for i >= 0
mask = select(mask, mask & ~nsi, nsi_mask);
left_mask = select(left_mask, (((left_mask >> (uint2) 1) | (left_mask_big << (uint2) 31)) & ~nsi), nsi_mask);
left_mask_big = select(left_mask_big, (left_mask_big >> (uint2) 1), nsi_mask);
right_mask = select(right_mask, (((right_mask << (uint2) 1) | (right_mask_big >> (uint2) 31)) & ~nsi), nsi_mask);
right_mask_big = select(right_mask_big, (right_mask_big << (uint2) 1), nsi_mask);
ms = mask | left_mask | right_mask | nsi | (convert_uint2(i < second_row) & (uint2)2);
nsi = select(nsi, ((ms + nsi) & ~ms), nsi_mask);
nsx[max(i.x, 0)][tid] = nsi.x;
nsy[max(i.y, 0)][tid] = nsi.y;
}
}
results[idx] = solutions * (uint2) 8;
results[idx + threads] = solutions;
}
ryta1203
30-Mar-2010, 22:04
Do you mind posting the whole solution/project with the vectorized version and all? Thank you.
ryta1203
30-Mar-2010, 22:08
Also, I'd be interested to see if your increase in ALUPackingRatio was really attributable to the vectorization OR to your elimination of control flow.
All that control flow creates "blocks" in the clauses and the compiler doesn't "pack" across clauses, so when you reduce control flow like you did you get better packing.
Do you have a non-vectorized version with no control flow? What is the packing ratio for that?
ryta1203
30-Mar-2010, 22:22
How many threads?
Running 16834 threads on a 16x16 board my profiled kernel time for queen1 (non-vectorized, non-control flow) is 26.6232.
Also, I'd be interested to see if your increase in ALUPackingRatio was really attributable to the vectorization OR to your elimination of control flow.
All that control flow creates "blocks" in the clauses and the compiler doesn't "pack" across clauses, so when you reduce control flow like you did you get better packing.
Do you have a non-vectorized version with no control flow? What is the packing ratio for that?
I tried using
if(n) ...
or
a = n ? b : c;
neither increases ALU packing ratio.
I put the current experimental version in the attachment. This version now runs only nqueen1.
I did some further experiments on the main nqueen function.
I found that the rotation check part (check whether the solution is 2-way or 4-way symmetry or no symmetry at all) is a major performance bottleneck for Cypress. I compared two different versions, one with full rotation, another with no rotation at all, and assume all solutions are asymmetric (which is of course incorrect). The run time of the two versions are (for 17 queen):
Normal: 2.384s
Skip: 2.064s
This means the rotation part takes around 0.3s on GTX 285, which is not significant. However, for Cypress, it's quite different:
Normal: 7.311s
Skip: 3.885s
Almost half of the execution time is spent on the rotation check part. I'm not sure about the reason behind this, although honestly the rotation check part is not very well optimized. There are some global memory access which are not coalesced and the loops are not unrolled.
If the normal computation part of the main function is vectorized as the nqueen1, the skipped version is also faster, takes only 2.364s to run. However, for some reason it hangs with the rotation check enabled (the rotation check is not vectorized, so it's done one by one). If only one rotation check is done (there should be two in the vectorized version) it doesn't hang, but it takes 4.055s to run. So if both checks are done, it can be estimated that the run time should be around 5.7s.
So, apparently, the rotation check part is an important optimization target for running on Cypress. I'll think about it when I have more free time :)
ryta1203
31-Mar-2010, 02:31
pcchen,
One last question, what board size and thread count are you running? I just want to know for comparison.
pcchen,
One last question, what board size and thread count are you running? I just want to know for comparison.
I use 17 queen, and in most case the default thread count.
I tried to take advantage of the larger local memory of Cypress to avoid global memory access in the rotation check part. However, since currently Cypress' OpenCL still doesn't support byte addressable memory in local memory, I have to "simulate" it with logic shift and bitwise operations. Right now the result looks good: it doesn't hang anymore, and it's also significantly faster. Another side effect is, in the older version, using more threads don't seem to help much, but now it improves performance up to 73728 threads (the old default is 9216 threads).
The new version running for 17 queen takes 4.984s on my 5850 compared to previous 7.311s. I think there are still some optimizations can be done in the rotation check part for Cypress.
ryta1203
31-Mar-2010, 16:33
I use 17 queen, and in most case the default thread count.
I tried to take advantage of the larger local memory of Cypress to avoid global memory access in the rotation check part. However, since currently Cypress' OpenCL still doesn't support byte addressable memory in local memory, I have to "simulate" it with logic shift and bitwise operations. Right now the result looks good: it doesn't hang anymore, and it's also significantly faster. Another side effect is, in the older version, using more threads don't seem to help much, but now it improves performance up to 73728 threads (the old default is 9216 threads).
The new version running for 17 queen takes 4.984s on my 5850 compared to previous 7.311s. I think there are still some optimizations can be done in the rotation check part for Cypress.
Again, if you wouldn't mind posting the code. :)
Thank you.
Also, I seem to get an infinite loop when attempting to profile your vec version when I use int4/uint4 instead of "2", but the without profiling the code runs.
Again, if you wouldn't mind posting the code. :)
Thank you.
Also, I seem to get an infinite loop when attempting to profile your vec version when I use int4/uint4 instead of "2", but the without profiling the code runs.
I also encountered some driver hang up too. Normally it happens when there are some problems in the kernel (such as writing to some memory address it shouldn't write). But there are also some random hang up such as the situation I described earlier (the rotation check part causes driver hang). Fortunately, it's generally able to come back after a few seconds (at least on my Windows 7... :) ).
I did some modifications to the code. Now it automatically enables vectorized version when it detects a "vector GPU." Right now the vectorized version always use local memory, and don't use global atomics (in current Stream SDK it crashes the compiler anyway).
The new source code is in the attachment.
nqueen_cl 17
Platform [0]: ATI Stream
Select platform 0
Using GPU device
Using 81920 threads
17-queen has 95815104 solutions (11977939 unique)
Time used: 3.82s
HD5870 @ 900MHz GPU.
I've been thinking about using a different algorithm for computing unique solutions. I had tried another approach, using a rotational order instead of current top-down order to completely avoid generating redundant solutions. However, it's more computing intensive so it's actually slower on CPU even though it doesn't need a final rotation check step. The idea back then is that the number of solutions is actually very small compared to the whole problem space, so it makes sense to trade the computation in normal problem space to more computation for solutions only, and the total amount of computation will be smaller.
But think about it now, it seems to be a good idea for GPU, at least for Cypress. GPU have more computing power than CPU. In the case of Cypress, it's even more obvious because despite its higher theoretical computing power it's still slower than a GT200. That suggests a large amount of computing power is not well utilized.
Of course, now it's clear that register pressure is also an important issue to consider. So for a new algorithm to be faster, it needs to not only be less branchy but also uses less memory (local memory can cover for some register pressure as evidenced here, but even that has its limit).
Arnold Beckenbauer
31-Mar-2010, 19:54
C:\Users\Denis x64\Downloads\nqueen_cl_src>nqueen_cl -threads 15360 -noatomics -
local 17
N-Queen solver for OpenCL
Ping-Che Chen
Platform [0]: ATI Stream
Select platform 0
Using GPU device
Using 15360 threads
17-queen has 16344 solutions (2043 unique)
Time used: 0.919s
Something is wrong.
(HD4850 700/993, Stream SDK 2.01)
Something is wrong.
(HD4850 700/993, Stream SDK 2.01)
The new program is optimized for Evergreen architecture. Basically, it uses local memory to relieve register pressure. Since 48x0 does not support OpenCL style local memory, it may have some problems. I think it's still possible to fix that, but that'll have to wait till I get home on the weekend, as my Radeon 4850 is in my home computer. :)
ryta1203
06-Apr-2010, 03:19
I don't understand why you are so concerned with register pressure on the ATI cards, they have a much larger register file than CUDA cards. Also, register pressure effects different features for ATI than Nvidia. It's possible to decrease performance by reducing register pressure.
Also, as far as ALU utilization, you can just look at the ALUBusy counter.
I don't understand why you are so concerned with register pressure on the ATI cards, they have a much larger register file than CUDA cards. Also, register pressure effects different features for ATI than Nvidia. It's possible to decrease performance by reducing register pressure.
Also, as far as ALU utilization, you can just look at the ALUBusy counter.
Because by eliminating register spill its performance increased many times.
CarstenS
06-Apr-2010, 10:33
nqueen_cl 17
Platform [0]: ATI Stream
Select platform 0
Using GPU device
Using 81920 threads
17-queen has 95815104 solutions (11977939 unique)
Time used: 3.82s
HD5870 @ 900MHz GPU.
With this version:
http://forum.beyond3d.com/showpost.php?p=1416167&postcount=60
I got
nqueen_cl.exe -local -noatomics 17
Platform [0]: NVIDIA CUDA
Select platform 0
Using GPU device
Using 30720 threads
17-queen has 95815104 solutions (11977939 unique)
Time used: 1.14s
On a GTX 480 (stock)
Some more stuff (with Board-Soze of 18, this time), best of three runs:
nqueen 18
-> 7,74s
nqueen 18 with local switch
-> 7,91s
nqueen 18 with noatomics switch
->10.1s
nqueen 18 with local & noatomics switches
-> 8,6s
Interestingly, GPU-z was running alongside and i read the memory controller load.
local: 1-2 %
noatomics: 45ish % (38-49 %)
plain: 55ish % (50-57 %)
ryta1203
06-Apr-2010, 17:09
Because by eliminating register spill its performance increased many times.
On the ATI Cards? How do you know you were getting register spilling? How many GPR was the kernel using? How many wavefronts were running in parallel? Like I said, the ATI cards have a much larger register file than Nvidia cards, I'd be surprised if you were getting actually spilling.
Are you sure the performance increase wasn't from something else, like cache hit performance?
On the ATI Cards? How do you know you were getting register spilling? How many GPR was the kernel using? How many wavefronts were running in parallel? Like I said, the ATI cards have a much larger register file than Nvidia cards, I'd be surprised if you were getting actually spilling.
Are you sure the performance increase wasn't from something else, like cache hit performance?
It's pretty simple though. Before using local memory to replace private memory, it has many more global memory access than what actually needed in the program. That can only indicate that the private memory are not in registers. Those global memory access are all gone after replacing private memory with local memory. I sort of expected the same as you, that's why I didn't think that using local memory is good for ATI's GPU (I did that for NVIDIA's GPU because NVIDIA's registers can't be indexed, so private arrays can't be in registers). However, I found out that using local memory is also very good for ATI's GPU, even before vectorization.
ATI does have more total amount of registers, but it seems that the number of registers available to a work item (thread) is limited. This can be problematic when you need to vectorize. Since it's rarely required to vectorize on NVIDIA's GPU, the register pressure is actually less than ATI.
Some more stuff (with Board-Soze of 18, this time), best of three runs:
nqueen 18
-> 7,74s
nqueen 18 with local switch
-> 7,91s
nqueen 18 with noatomics switch
->10.1s
nqueen 18 with local & noatomics switches
-> 8,6s
Interestingly, GPU-z was running alongside and i read the memory controller load.
local: 1-2 %
noatomics: 45ish % (38-49 %)
plain: 55ish % (50-57 %)
Thanks for these interesting results :)
It's very surprising to see that using local memory is a bit slower than not using local memory on Fermi. As your GPU-Z data has shown, using local memory eliminates almost all global memory traffic, which should be good. Although I guess that the L1 cache probably helps quite a lot. It doesn't seem to work for the no-atomics case, though.
I sort of expected the same as you, that's why I didn't think that using local memory is good for ATI's GPU (I did that for NVIDIA's GPU because NVIDIA's registers can't be indexed, so private arrays can't be in registers). However, I found out that using local memory is also very good for ATI's GPU, even before vectorization.
BTW, according some post in ATI OpenCL forum it looks like array are currently stored in global memory (!). You may still get decent performance thanks to HD5xxx cache but they can really hurt performance. This may explain why you see a performance improvement for using local memory on ATI too.
ryta1203
06-Apr-2010, 22:15
It's pretty simple though. Before using local memory to replace private memory, it has many more global memory access than what actually needed in the program. That can only indicate that the private memory are not in registers. Those global memory access are all gone after replacing private memory with local memory. I sort of expected the same as you, that's why I didn't think that using local memory is good for ATI's GPU (I did that for NVIDIA's GPU because NVIDIA's registers can't be indexed, so private arrays can't be in registers). However, I found out that using local memory is also very good for ATI's GPU, even before vectorization.
ATI does have more total amount of registers, but it seems that the number of registers available to a work item (thread) is limited. This can be problematic when you need to vectorize. Since it's rarely required to vectorize on NVIDIA's GPU, the register pressure is actually less than ATI.
I'm mostly certain this is not true.
Though I haven't looked at, nor do I hav, your old code, were you running this on a RV870? What do you mean by "private memory" (since this does not exist under this terminology)? Also, the compiler does a good job of packing scalar values into the 128-bit registers (meaning it doesn't use one to one scalar value to register).
ryta1203
06-Apr-2010, 22:16
BTW, according some post in ATI OpenCL forum it looks like array are currently stored in global memory (!). You may still get decent performance thanks to HD5xxx cache but they can really hurt performance. This may explain why you see a performance improvement for using local memory on ATI too.
Yes, this is true also, thanks for remembering this. This is probably the cause of the performance increase, not register pressure.
ryta1203
06-Apr-2010, 22:17
Thanks for these interesting results :)
It's very surprising to see that using local memory is a bit slower than not using local memory on Fermi. As your GPU-Z data has shown, using local memory eliminates almost all global memory traffic, which should be good. Although I guess that the L1 cache probably helps quite a lot. It doesn't seem to work for the no-atomics case, though.
Does the local memory become cache by default (if it's not used)?
BTW, according some post in ATI OpenCL forum it looks like array are currently stored in global memory (!). You may still get decent performance thanks to HD5xxx cache but they can really hurt performance. This may explain why you see a performance improvement for using local memory on ATI too.
Although this could be a reason, but it still does not explain why, when the non-vectorized code does not access global memory (after the local memory trick), it does after vectorized to 4D. The vectorized variables are not arrays, so they should be able to use registers for them. Only after change the vectorization to 2D the global memory access are eliminated.
Does the local memory become cache by default (if it's not used)?
Not on ATI's hardware. Fermi has a switchable L1 cache/local memory, but I don't know how it works (but it probably not automatic).
I'm mostly certain this is not true.
Though I haven't looked at, nor do I hav, your old code, were you running this on a RV870? What do you mean by "private memory" (since this does not exist under this terminology)? Also, the compiler does a good job of packing scalar values into the 128-bit registers (meaning it doesn't use one to one scalar value to register).
Yes, I use Radeon 5850. I think I already said that in my previous post.
"Private memory" is an OpenCL terminology. Since the program is written with OpenCL, I think it's appropriate to use OpenCL terms rather than some ambigious vendor specific terms.
Register packing is not an issue here, because it can't explain why 4D vectorization needs to access global memory, while 2D doesn't. This is mostly likely to be explained by register pressure.
ryta1203
07-Apr-2010, 00:39
Not on ATI's hardware. Fermi has a switchable L1 cache/local memory, but I don't know how it works (but it probably not automatic).
Yes, I know, I was asking about Fermi because I know about the switchable L1 cache/local memory.
So does it?
ryta1203
07-Apr-2010, 00:40
Yes, I use Radeon 5850. I think I already said that in my previous post.
"Private memory" is an OpenCL terminology. Since the program is written with OpenCL, I think it's appropriate to use OpenCL terms rather than some ambigious vendor specific terms.
Register packing is not an issue here, because it can't explain why 4D vectorization needs to access global memory, while 2D doesn't. This is mostly likely to be explained by register pressure.
I can't comment on that since I haven't been able to get the 4D vectorization to work with your code and the profiler.
ryta1203
07-Apr-2010, 00:46
Yes, I use Radeon 5850. I think I already said that in my previous post.
"Private memory" is an OpenCL terminology. Since the program is written with OpenCL, I think it's appropriate to use OpenCL terms rather than some ambigious vendor specific terms.
Register packing is not an issue here, because it can't explain why 4D vectorization needs to access global memory, while 2D doesn't. This is mostly likely to be explained by register pressure.
Yes, ok, I see. And Dade is correct that the ATI OpenCL compiler currently uses global memory for this.
I haven't looked at your latest posted code yet; however, is it possible that with 4D vectorization you are going over the local memory size?
I will try to run your 4D vec. code and take a look to see. I honestly still don't believe that register pressue is the issue, it doesn't appear that every other option has been exhausted.
Yes, I know, I was asking about Fermi because I know about the switchable L1 cache/local memory.
So does it?
Its design is switchable between 16KB local/48 KB cache and 48KB local/16 KB cache, so either way it has a certain amount of L1 cache available.
Since DirectCompute in DX11 requires 32KB shared memory, I guess it will always be in the 48KB local/16KB cache mode when running DX11. However, when running CUDA, it's probably possible to automatically detect which mode it should use by checking on the required local memory size of the kernel. However, I don't know whether it does this right now.
Yes, ok, I see. And Dade is correct that the ATI OpenCL compiler currently uses global memory for this.
I haven't looked at your latest posted code yet; however, is it possible that with 4D vectorization you are going over the local memory size?
I will try to run your 4D vec. code and take a look to see. I honestly still don't believe that register pressue is the issue, it doesn't appear that every other option has been exhausted.
The 4D vectorized version uses only 64 work items per work group. 2D version uses 128 work items. The amount of local memory used is the same (actually it's 12KB).
I just managed to make the OpenCL kernel analyzer to work on my computer by disabling my GTX 285 (it apparently tries to create an OpenCL context with NVIDIA's platform?). The analysis shows that the 4D version does have memory access instructions in the main loop, while the 2D version does not. I don't understand why it should have those memory access instructions though.
I uploaded the resulting files in the attachment.
CarstenS
07-Apr-2010, 06:26
Thanks for these interesting results :)
It's very surprising to see that using local memory is a bit slower than not using local memory on Fermi. As your GPU-Z data has shown, using local memory eliminates almost all global memory traffic, which should be good. Although I guess that the L1 cache probably helps quite a lot. It doesn't seem to work for the no-atomics case, though.
A note of caution: local memory only slows nqueen not konsistenly down! I'll re-check and run some more board sizes today, but I'm under the impression that it's only with smaller board sizes. With 19 and 20 it's notably faster. :)
Not on ATI's hardware. Fermi has a switchable L1 cache/local memory, but I don't know how it works (but it probably not automatic).
AFAIBT at GPU Tech Conf, you'd need a driver reload for it to switch. But that's from a time when Nvidia was confident to ship Fermi at christmas (2009!), so it may have become invalid info.
ryta1203
07-Apr-2010, 17:18
pcchen,
Can I get a full working 4D version from you? Thanks in advance. I have a your latest 2D version, with both queen1 and queen vectorized and you posted the 4D queen1 kernel, but I don't have a 4D queen kernel.
pcchen,
Can I get a full working 4D version from you? Thanks in advance. I have a your latest 2D version, with both queen1 and queen vectorized and you posted the 4D queen1 kernel, but I don't have a 4D queen kernel.
Unfortunately, there is no full 4D version, as the nqueen1 4D version is already too slow for the full version to be useful. A 4D version of the nqueen function will be much worse because the huge rotation check part will have to replicated by 4 times, which will make things worse.
I posted a working 4D nqueen1 executable in the attachment if you want to test it. It runs 64 work items per work group instead of 128 work items in the 2D case, to make all resource usages the same.
The profiling results looks like this:
4D:
Method, ExecutionOrder, GlobalWorkSize, GroupWorkSize, KernelTime, LocalMem, MemTransferSize, ALU, Fetch, Write, Wavefront, ALUBusy, ALUFetchRatio, ALUPacking, ALUStalledByLDS, LDSBankConflict, FetchUnitBusy, FetchUnitStalled, WriteUnitStalled
BufHostToDevice, 1, , , , , 48, , , , , , , , , , , ,
BufHostToDevice, 2, , , , , 10616832, , , , , , , , , , , ,
BufHostToDevice, 3, , , , , 4, , , , , , , , , , , ,
nqueen1_vec_Cypress, 4, {17344; 1; 1}, {64; 1; 1}, 828.42863, 12288,, 1196151.55, 43827.74, 87646.48, 271.00, 10.87, 27.29, 84.66, 1.00, 0.00, 1.57, 0.00, 0.00
BufDeviceToHost, 5, , , , , 589824, , , , , , , , , , , ,
2D:
Method, ExecutionOrder, GlobalWorkSize, GroupWorkSize, KernelTime, LocalMem, MemTransferSize, ALU, Fetch, Write, Wavefront, ALUBusy, ALUFetchRatio, ALUPacking, ALUStalledByLDS, LDSBankConflict, FetchUnitBusy, FetchUnitStalled, WriteUnitStalled
BufHostToDevice, 1, , , , , 48, , , , , , , , , , , ,
BufHostToDevice, 2, , , , , 10616832, , , , , , , , , , , ,
BufHostToDevice, 3, , , , , 4, , , , , , , , , , , ,
nqueen1_vec_Cypress, 4, {34688; 1; 1}, {128; 1; 1}, 131.87619, 12288,, 535959.33, 4.00, 2.00, 542.00, 60.61, 133989.83, 81.67, 11.54, 0.00, 0.00, 0.00, 0.00
BufDeviceToHost, 5, , , , , 589824, , , , , , , , , , , ,
ryta1203
07-Apr-2010, 22:30
Ok, thanks, like I said before I couldn't get your 4D vectorized version to profile without it hanging, it simply kept crashing my computer.
Ok, thanks, like I said before I couldn't get your 4D vectorized version to profile without it hanging, it simply kept crashing my computer.
I tried to make a 128 work items 4D version nqueen1, and it crashed my computer too. What's interesting is, the computer was not actually crashed, a music player running in the background kept running, for example. However, the display was frozen. In most similar cases, the display driver should recover by the system, but it didn't in this case. I have to reboot the computer.
CarstenS
07-Apr-2010, 23:21
Does the vec4-version really do the same amount of work or did i get lost somewhere in the middle? On a GTX480 its actually quite a bit faster than the normal version.
Does the vec4-version really do the same amount of work or did i get lost somewhere in the middle? On a GTX480 its actually quite a bit faster than the normal version.
It only does nqueen1, i.e. a queen in the corner case.
CarstenS
07-Apr-2010, 23:50
Ah, okay. I read the "1" but didn't know what to make of it. Since the vec4-version accepted the regular board-sizes as well, I was a bit puzzled.
ryta1203
08-Apr-2010, 15:00
I tried to make a 128 work items 4D version nqueen1, and it crashed my computer too. What's interesting is, the computer was not actually crashed, a music player running in the background kept running, for example. However, the display was frozen. In most similar cases, the display driver should recover by the system, but it didn't in this case. I have to reboot the computer.
Ah! You probably have VPU Recover turned off right? I do.
Ah! You probably have VPU Recover turned off right? I do.
To my understanding, VPU Recover is only available under Windows XP. Windows Vista and Windows 7 are supposed to be able to restart crashed video driver automatically. I also encountered several occasions of recovered restarts (I just wait for a few seconds and the video driver restarts automatically), but in this case it never restarts.
ryta1203
08-Apr-2010, 22:00
BTW, the problem with the vec4D is supposedly going to be fixed in the soon to be upcoming release. Also, I tried using __local uint4 and then accessing .x, I got an error.... this is also suppose to be fixed in the upcoming SDK release (2.02?).... via Micah.
OpenGL guy
08-Apr-2010, 23:22
To my understanding, VPU Recover is only available under Windows XP. Windows Vista and Windows 7 are supposed to be able to restart crashed video driver automatically. I also encountered several occasions of recovered restarts (I just wait for a few seconds and the video driver restarts automatically), but in this case it never restarts.
You can try changing the TdrDelay. Add a DWORD value called TdrDelay to HKLM\SYSTEM\CurrentControlSet\Control\GraphicsDriv ers. I changed mine to 0x3c (60s), for example, so I could be certain that the GPU would stay busy while I debugged the driver while running a 10 minute kernel :)
I tried compiling your 4D vectorized kernel with an internal tool and I don't see any memory accesses generated in your main loop. Can you post the app somewhere so I can try the full version with our OpenCL tool chain to make sure everything's working as expected?
OpenGL guy
08-Apr-2010, 23:22
BTW, the problem with the vec4D is supposedly going to be fixed in the soon to be upcoming release. Also, I tried using __local uint4 and then accessing .x, I got an error.... this is also suppose to be fixed in the upcoming SDK release (2.02?).... via Micah.
Next release will be 2.1.
You can try changing the TdrDelay. Add a DWORD value called TdrDelay to HKLM\SYSTEM\CurrentControlSet\Control\GraphicsDriv ers. I changed mine to 0x3c (60s), for example, so I could be certain that the GPU would stay busy while I debugged the driver while running a 10 minute kernel :)
This is a great tip, thanks :)
I tried compiling your 4D vectorized kernel with an internal tool and I don't see any memory accesses generated in your main loop. Can you post the app somewhere so I can try the full version with our OpenCL tool chain to make sure everything's working as expected?
I posted one here (http://forum.beyond3d.com/attachment.php?attachmentid=430&d=1270674247). I checked it with the kernel analyzer and the 2D version uses 17 GPRS while 4D version uses 31, which seems to be alright, but the resulting assembly of the 4D version has memory access instructions in the loop, I don't understand why.
I fixed a small bug in current code so that Radeon 4850 is able to run vectorized code. However, it's slower than the old scalar code, because local memory is not available, unfortunately. I also added a new option to disable vectorization.
I put the source code and executable in the first post. You can also download them here:
Source (http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/nqueen_cl_src.zip)
Executable (http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/nqueen_cl.zip)
CarstenS
10-Apr-2010, 07:22
When trying to run the CPU-device, I get an Open CL error: -46 (line:261)
In my system I had a HD 5870 with OpenCL CPU/GPU principally working, now I'v got temporarily a GTX 480. I'm using the following command-line: nqueen_cl -clcpu -platform 0 8
edit:
As was to be exptected, on the GPU-device, -novec is no different in case of GTX 480. I am looking forward to see if AMDs driver can extract as much parallelism from the code as an explicit vectorization.
The CPU OpenCL bug is fixed now :)
Currently vectorization is not used on NVIDIA's hardware, so -novec has no effect on them. There is currently no way to force vectorization on NVIDIA's hardware, although my previous experiments had shown that it's pretty bad for them.
A problem with parallelism of this program is that the loop is pretty small, so there is really not much parallelism to be found. The 2D or 4D vectorized version is, in a sense, loop unrolling. In theory, NVIDIA's hardwares could also benefit from this, but the register file is smaller, so it's not a good idea.
CarstenS
10-Apr-2010, 09:06
Thanks!
On my CPU (a 3,8-GHz E8500) I'm getting now 109 secs for a board size of 18, which makes the GTX 480 about 14 times faster. If I'm not mistaken, that makes one of the CPUs' cores about 6,3 times as fastas one of the so-called "CUDA-Cores" (per clock). ;)
HD5870 @ 900/5000MHz
GPU:
nqueen_cl 17
Platform [0]: ATI Stream
Select platform 0
Using GPU device
Using 81920 threads
Using vectorization
17-queen has 95815104 solutions (11977939 unique)
Time used: 3.82s
CPU:
Cnqueen_cl -clcpu 17
Platform [0]: ATI Stream
Select platform 0
Using CPU device
Using 8192 threads
17-queen has 95815104 solutions (11977939 unique)
Time used: 7.96s
The -novec switch causes program to stop executing.
trinibwoy
10-Apr-2010, 14:37
Looks like I'm getting an incorrect result from the latest build.
Original:
Platform [0]: NVIDIA CUDA
Select platform 0
Using GPU device
Using 15360 threads
Using global atomics
17-queen has 95815104 solutions (11977939 unique)
Time used: 3.5s
Latest:
Platform [0]: NVIDIA CUDA
Select platform 0
Using GPU device
Using 15360 threads
Using global atomics
17-queen has 10919088 solutions (1364886 unique)
Time used: 0.32s
Looks like I'm getting an incorrect result from the latest build.
Looks like you got the nqueen1 only version (which runs only the "one queen in the corner" case). I have updated the files in the first post and I think that should be the correct one.
HD5870 @ 900/5000MHz
The -novec switch causes program to stop executing.
I think you'll need -noatomics for -novec on RV870 because there is a compiler related problem in current AMD Stream SDK.
Bjorn Häske
09-May-2010, 18:11
New SDK was released.
With it, it's possible to activate bitalign on ATI/AMD GPUs using OpenCL.
Oh, and some multi-gpu issues fixed.
ryta1203
10-May-2010, 02:18
New SDK was released.
With it, it's possible to activate bitalign on ATI/AMD GPUs using OpenCL.
Oh, and some multi-gpu issues fixed.
Performance for this nqueen sovler also improves with latest SDK update.. now the vec4 version is the fastest. The packing on nqueen1 is ~94%.
I think it's now possible to have a vec4 nqueen to run with the new SDK (in the older SDK it frequently hang the system). Also, the "semi-byte addressing" hack can be retired now :)
ryta1203
10-May-2010, 16:25
I think it's now possible to have a vec4 nqueen to run with the new SDK (in the older SDK it frequently hang the system).
Yes, as I said above, the vec4 version runs great now and is the fastest version.
Yes, as I said above, the vec4 version runs great now and is the fastest version.
Yes, the nqueen1_vec function is now faster in vec4. But the nqueen_vec function still behaves wildly. I didn't write a vec4 version of nqueen_vec, but now I modified one from the vec2 version, but it still hangs (although recoverable), probably from a very long running time when solving 17 queen. I tested it with 8 queen to 16 queen and it gives correct answers. However, it runs very slowly with 16 queen (around 10 seconds).
Since the vec4 version of nqueen_vec has a very big symmetry check part (basically 4 times) it could be the reason why it's slow.
Arnold Beckenbauer
10-May-2010, 22:03
n00b's question: What is the VRAM usage of nqueen_cl?
n00b's question: What is the VRAM usage of nqueen_cl?
It's very small. Basically each thread reads and writes its own data, which is around the board size * 32 bits number, and write to 2 32 bits number. So for around 70000 threads, it's at most a few MB.
ryta1203
11-May-2010, 14:38
Yes, the nqueen1_vec function is now faster in vec4. But the nqueen_vec function still behaves wildly. I didn't write a vec4 version of nqueen_vec, but now I modified one from the vec2 version, but it still hangs (although recoverable), probably from a very long running time when solving 17 queen. I tested it with 8 queen to 16 queen and it gives correct answers. However, it runs very slowly with 16 queen (around 10 seconds).
Since the vec4 version of nqueen_vec has a very big symmetry check part (basically 4 times) it could be the reason why it's slow.
Do you mean it's still slow in comparison to the full version? the full vec2 version? Have you looked at the ISA? Can you put up the full vec4 version (nqueen and nqueen1)?
Do you mean it's still slow in comparison to the full version? the full vec2 version? Have you looked at the ISA? Can you put up the full vec4 version (nqueen and nqueen1)?
Yes, it's slower than the original vec2 version (which takes only 0.84 second).
I updated the codes with my vec4 version here:
Source (http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/nqueen_cl_src.zip)
Executable only (http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/nqueen_cl.zip)
I included 64 bits executables because 32 bits executables don't work on my computer when NVIDIA's video card is enabled.
The modifications for byte addressable extensions are not in there yet.
The reason for vec4 being slow is that you will have to use
__kernel __attribute__((reqd_work_group_size(64, 1,1)))
Without it the compiler will assume that you need your work group size to be 256
(I remember reading that on an ATI forum that per opencl specs this is the default work group size the compiler will assume).
If the compiler assumes that you need 256 threads he will start spilling registers to the memory because your kernel
uses a lot of registers
(you can see too many RD_SCRATCH and MEM_SCRATCH_WRITE in the ISA generated), so there are a lot of
global memory read/write uncached, also the code size is big more than 60k . When I use
__kernel __attribute__((reqd_work_group_size(64, 1,1)))
in your kernel my stock 5850 took 8.95 Sec in 17 queens using 95 registers (without this option it took more than 10 sec in the 16 queens).
E:\nqueen_cl>nqueen_cl64.exe -vec4 -local -p 17
N-Queen solver for OpenCL
Ping-Che Chen
Platform [0]: ATI Stream
Select platform 0
Using GPU device
Using 73728 threads
Using vectorization
Use 4D vectors
Profile time: 8778111182ns
17-queen has 95815104 solutions (11977939 unique)
Time used: 8.95s
The reason for vec4 being slow is that you will have to use
__kernel __attribute__((reqd_work_group_size(64, 1,1)))
Without it the compiler will assume that you need your work group size to be 256
(I remember reading that on an ATI forum that per opencl specs this is the default work group size the compiler will assume).
If the compiler assumes that you need 256 threads he will start spilling registers to the memory because your kernel
uses a lot of registers
(you can see too many RD_SCRATCH and MEM_SCRATCH_WRITE in the ISA generated), so there are a lot of
global memory read/write uncached, also the code size is big more than 60k . When I use
__kernel __attribute__((reqd_work_group_size(64, 1,1)))
in your kernel my stock 5850 took 8.95 Sec in 17 queens using 95 registers (without this option it took more than 10 sec in the 16 queens).
E:\nqueen_cl>nqueen_cl64.exe -vec4 -local -p 17
N-Queen solver for OpenCL
Ping-Che Chen
Platform [0]: ATI Stream
Select platform 0
Using GPU device
Using 73728 threads
Using vectorization
Use 4D vectors
Profile time: 8778111182ns
17-queen has 95815104 solutions (11977939 unique)
Time used: 8.95s
Ah that's explained it, thanks. However it's still slower than the vec2 version, as vec2 version takes only about 5 seconds to run 17-queen. But this is still an important tip. :)
Recently I decided to test a new way to make this program more balanced. In the original algorithm, every thread (or work item) may have very different running time. For example, one thread may end pretty early because that branch may be quite hopeless, while another thread may produce a lot of solutions. However, because within a warp each thread have to wait for one another, a lot of threads may sit idle for a lot of time.
The modification using global atomics does not actually solve this problem, because its structure is two loops. That means, if some thread finishes its inner loop, it still has to wait for other threads to do its outer loop. It looks like this:
while(condition for outer loop) {
initialize for next batch
while(condition for inner loop) {
...
}
}
One way to solve this is to make it a single loop, that is:
for(;;) {
if(condition for inner loop) {
...
}
else {
if(condition for outer loop) {
initialize for next batch
}
else {
break;
}
}
}
This way, for any thread with inner loop finished, it can continue to do another batch of inner loops without having to wait for other threads in the same warp.
A breif test on my GeForce GTX 285 shows around 10% performance improvements. Along with other optimizations (mainly adjustments on the number of threads...), now 17 queen takes around 1.88s to run on GTX 285.
I decided to test it on a GeForce GTX 460 (which actually performs quite well before this modification), and it does not run very well. Actually, it does not run at all. However, it runs without global atomics, but slower. After some checks here and there, I finally found that the problem is with the board_array variable, which is originally an int array:
#define BOARD_ARRAY_DECL int board_array[32];
At first, I changed it to using shared memory, and it worked! (I tried it in earlier version to improve performance but GTX 285's shared memory is not big enough) Then I decided to change back to using normal private memory but with char:
#define BOARD_ARRAY_DECL char board_array[32];
and it still works. Weird. Looks like a bug in NVIDIA's OpenCL implementation.
Anyway, now it performs well, but not that well. It's only a little faster than a GTX 285 now. I suspect that in earlier version (the two loop version) it's faster because it has better warp scheduling than GTX 285 (which may mean that the threads don't have to wait as long as in the case of a GTX 285, in the original version). After the modification, the advantage is gone and now GTX 285 is up to the same league of a GTX 460.
Also, I modified the program to choose the number of threads by the number of "cores", and multiply it by the number of threads per "core." For example, a GTX 285 has 30 "cores." So the default number of threads is 30*256 = 7680. It works rather well (changing the number does not make performance better, generally makes it worse). However, for Fermi it's bad, since Fermi has very small number of "cores." On the GeForce GTX 460, the default number is 7*256 = 1792, but tests show that the best number is 5376, that is, 7*256*3. Using more threads per group does not improve performance either.
Unfortunately, since I don't have access to a Radeon 5850 anymore, I can't test it with the new OpenCL SDK from AMD, so I didn't make any changes to the vectorized version.
I also write a multi-threaded CPU version for Win32, so it now takes advantage of all cores of a CPU. My Core i7 920 takes around 6.8 seconds to solve 17 queen (GTX 285 is 1.88s and GTX 460 is 1.65s).
Another possible improvement is to make it supports multiple GPUs on a computer. Then two GTX 460 (which seems to be quite popular now) should be twice as fast as one GTX 460. But that's a bit complex and I think I'll do that when I have more free time :)
The new version can be downloaded here:
http://www.kimicat.com/dang-an-jia/nqueen_cl_src.zip
17 queen:
It's 6.75s on an i7-860 with HT on.
It wouldn't run on my 5850 (couldn't recognize Device 1 and it stayed on CPU) so I'm installing the 2.2 SDK.
installed stream sdk 2.2, GPU result:
Vec2: 9.932 seconds (18432 threads)
16.9 secs for vec4 (18432 threads)
global atomics: 0.651s (4608 threads)
using clcpu takes 31s (16384 threads)
C:\temp\Stream\nqueen2>nqueen_cl -blocksize 248 -novec 17
N-Queen solver for OpenCL
Ping-Che Chen
Platform [0]: ATI Stream
Select platform 0
Using GPU device
Using 4464 threads
Block size = 248 threads
Using global atomics
17-queen has 10919088 solutions (1364886 unique)
Time used: 0.644s
I was playing with the block size, shortest time I got was 0.627 with 256 blocks (errors above that)
edit: old results
nqueen -cpu 6.64s
nqueen -clcpu 3.85s
17 queen:
It's 6.75s on an i7-860 with HT on.
It wouldn't run on my 5850 (couldn't recognize Device 1 and it stayed on CPU) so I'm installing the 2.2 SDK.
installed stream sdk 2.2, GPU result:
Vec2: 9.932 seconds (18432 threads)
16.9 secs for vec4 (18432 threads)
global atomics: 0.651s (4608 threads)
using clcpu takes 31s (16384 threads)
I was playing with the block size, shortest time I got was 0.627 with 256 blocks (errors above that)
Hmm... so it seems to be quite weird. I remembered that it was much quicker on 5850 with vectored kernels (around 5 seconds). I'll try it with the new SDK 2.2 to see what's the problem when I have some time.
The result from global atomics doesn't seem to be right. 17 queen should have 95815104 distinct solutions and 11977939 unique solutions.
clcpu also was also faster. I remembered getting around 7.5s on my Core i7 920.
Hmm... so it seems to be quite weird. I remembered that it was much quicker on 5850 with vectored kernels (around 5 seconds). I'll try it with the new SDK 2.2 to see what's the problem when I have some time.
The result from global atomics doesn't seem to be right. 17 queen should have 95815104 distinct solutions and 11977939 unique solutions.
clcpu also was also faster. I remembered getting around 7.5s on my Core i7 920.
I posted my old results there, cpu seems to be unchanged, cl values are out of whack.
>nqueen_cl -cpu 17
17-queen has 95815104 solutions (11977939 unique)
Time used: 4.85s
>nqueen_cl -novec 17
Platform [0]: ATI Stream
Select platform 0
Using GPU device
Using 5120 threads
Block size = 256 threads
Using global atomics
17-queen has 10919088 solutions (1364886 unique)
Time used: 0.281s
>nqueen_cl -vec2 17
Unknown option -vec2
Platform [0]: ATI Stream
Select platform 0
Using GPU device
Using 20480 threads
Block size = 256 threads
Using vectorization
Use 2D vectors
17-queen has 95815104 solutions (11977939 unique)
Time used: 3.96s
i7-920 @ 3996 MHz
HD5870 @ 900/1250 MHz
Stream SDK 2.2, Cat 10.8 beta5
Win7 x64
Hmm.. it looks like novec is not correct on SDK 2.2. However, vec2 looks to be faster than SDK 2.1. That's pretty nice.
My computer with the Radeon 5850 is still using Cat 10.6. Since I'm not using CrossFire, is it ok to use 10.7 (since SDK 2.2 requires 10.7)? I don't want to interfere with the normal usage (i.e. gaming) of that computer now...
mhouston
17-Aug-2010, 17:01
For SDK 2.2, you'll want 10.7b, the driver linked to on the SDK download page.
ryta1203
17-Aug-2010, 22:01
Hmm.. it looks like novec is not correct on SDK 2.2. However, vec2 looks to be faster than SDK 2.1. That's pretty nice.
My computer with the Radeon 5850 is still using Cat 10.6. Since I'm not using CrossFire, is it ok to use 10.7 (since SDK 2.2 requires 10.7)? I don't want to interfere with the normal usage (i.e. gaming) of that computer now...
I would be careful with using SDK 2.2/10.7 as there are all kinds of problems with it. I'm not sure why those problems didn't get fixed before release but....
...personally, I had to go back to using 2.1/10.5. It's possible that 10.8 beta may fix the issues but I don't know.
I fixed a bug in the non-local non-vectorized version. It no longer hangs now (and should run correctly on AMD's GPU too). Actually since local version is faster on both GPU, I think it's probably better to make it the default kernel (and add a -nolocal switch to turn it off for experiments).
Unfortunately, now GTX 460 still has some weird bug. The "forbidden" array, which was declared __constant, does not work this way and has to be declared as __global. This only happens on GTX 460. GTX 285 and 8800GT are all fine by __constant.
I also forgot to add pragma to enable cl_khr_byte_addressable_store, which is required for board_array to be char. The original kernel uses int, but apparently GTX 460 don't like that.
The new version can be downloaded from the same place:
http://www.kimicat.com/dang-an-jia/nqueen_cl_src.zip
I added the new global atomics mechanism to the vectorized kernels. Apparently its much more effective on AMD GPU. Running 17 queen on Radeon 5850 is now faster @ 3.31s instead of previous ~5s.
nqueen_cl 16: 0.902s
nqueen_cl 17: 3.31s
nqueen_cl 18: 23s
These are with SDK 2.2 and 10.7 beta driver (for OpenCL).
However, there is a very weird result for the non-vectorized kernel: it's now much faster on Radeon 5850:
nqueen_cl -novec -local 16: 0.858s
nqueen_cl -novec -local 17: 2.99s
nqueen_cl -novec -local 18: 20.4s
compared to noatomics version:
nqueen_cl -novec -noatomics -local 16: 0.887s
nqueen_cl -novec -noatomics -local 17: 5.59s
nqueen_cl -novec -noatomics -local 18: 40s
Very interesting. It seems to me that SDK 2.2 is much better at running non-vectorized code than SDK 2.1. Kudos to AMD for the great work :)
Again, the latest version can be downloaded from the same location.
http://www.kimicat.com/dang-an-jia/nqueen_cl_src.zip
That's weird!
The new binary now takes much longer time to process the non-vector kernel:
>nqueen_cl -novec 17
Platform [0]: ATI Stream
Select platform 0
Using GPU device
Using 5120 threads
Block size = 256 threads
Using global atomics
17-queen has 95815104 solutions (11977939 unique)
Time used: 31.4s
Compare this to my previous result:
>nqueen_cl -novec 17
Platform [0]: ATI Stream
Select platform 0
Using GPU device
Using 5120 threads
Block size = 256 threads
Using global atomics
17-queen has 10919088 solutions (1364886 unique)
Time used: 0.281s
The difference is that the new binary now reports the correct number of solutions.
rpg.314
21-Aug-2010, 11:47
That's weird!
The new binary now takes much longer time to process the non-vector kernel:
>nqueen_cl -novec 17
Platform [0]: ATI Stream
Select platform 0
Using GPU device
Using 5120 threads
Block size = 256 threads
Using global atomics
17-queen has 95815104 solutions (11977939 unique)
Time used: 31.4s
Compare this to my previous result:
>nqueen_cl -novec 17
Platform [0]: ATI Stream
Select platform 0
Using GPU device
Using 5120 threads
Block size = 256 threads
Using global atomics
17-queen has 10919088 solutions (1364886 unique)
Time used: 0.281s
The difference is that the new binary now reports the correct number of solutions.
Are your timings for the non-vector kernel repeatable. The difference in time suggests that it might not be the case due to some freak driver/hw situation.
Are your timings for the non-vector kernel repeatable.
Yep, but there is a strange hang down of the system during the execution. The vector kernel is OK.
rpg.314
21-Aug-2010, 12:47
Yep, but there is a strange hang down of the system during the execution. The vector kernel is OK.
Something for the driver people to look at then.
The old non-local non-vector version has a bug which may cause kernel hang, so it's not usable (you can also see the result is incorrect). This is now fixed that's why it's slower while producing correct result. :)
I suggest always using local version (i.e. use -local argument). It's faster not only on NVIDIA's GPU but also on AMD's GPU. Actually it's now faster than vectorized version.
I also done a few profiling and strangely the vectorized version has lower ALU busy rate than scalar version (about 30% vs 40%). The vectorized version has better ALU packing rate and lower ALU stall by LDS rate, but it has more instructions.
I made a new version which is default to use local memory. This affects only scalar kernels. To disable using local memory for arrays, use -nolocal option.
This version also contains a small optimization which makes it a little bit faster.
Source code & Executable (http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/nqueen_cl_src.zip)
Executable only (http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/nqueen_cl.zip)
trinibwoy
21-Aug-2010, 15:33
>nqueen_cl 17
Platform [0]: NVIDIA CUDA
Select platform 0
Using GPU device
Using 5376 threads
Block size = 256 threads
Using global atomics
17-queen has 95815104 solutions (11977939 unique)
Time used: 2.78
GTX 460 405/1800 (3D clocks aren't getting triggered).
I made a new version which supports multiple devices. This is sort of preliminary because I haven't done much testing yet. Although it's probably best to support multiple devices using multiple threads, I decided to use single thread instead because it's easier. However, the downside is that OpenCL 1.0 lacks some important functions for supporting multiple devices from a single host thread.
Anyway, this is the test results on two GeForce GTX 460 (clocked @ 800MHz, which is nearly 20% faster than normal 460):
Two devices:
N-Queen solver for OpenCL
Ping-Che Chen
Platform [0]: NVIDIA CUDA
Select platform 0
Using GPU device
Device 0: GeForce GTX 460
Using 5376 threads
Block size = 256 threads
Using global atomics
Device 1: GeForce GTX 460
Using 5376 threads
Block size = 256 threads
Using global atomics
Profile time for device 0: 5417762176ns
Profile time for device 1: 4899273472ns
18-queen has 666090624 solutions (83263591 unique)
Time used: 7.1s
One device:
N-Queen solver for OpenCL
Ping-Che Chen
Platform [0]: NVIDIA CUDA
Select platform 0
Using GPU device
Device 0: GeForce GTX 460
Using 5376 threads
Block size = 256 threads
Using global atomics
Profile time for device 0: 10313844128ns
18-queen has 666090624 solutions (83263591 unique)
Time used: 11.3s
As you can see, although the time spent on the device is roughly the same (5.4+4.9 = 10.3s), but the wall clock time is not that good. Basically, on single device run, the "extra CPU time" is around 11.3 - 10.3 = 1.0s. However, on two devices run, the "extra CPU time" is around 7.1 - 5.4 = 1.7s. It varies a little from run to run but it's almost always at least 50% more.
Hi, apologies if I missed some post. It seems that you launch fewer threads in each CTA when you use the vectorized version. Hope I'm with you so far. If this is the case, may the slower result not simply be due to the hardware not having enough independent wavefronts to cover for arithmetic latency?
I'm not too familiar with the details of the AMD GPU, but by and large it seems pretty similar to the NVIDIA GPUs. For NVIDIA this could certainly be a problem, IIRC for the 200 series you needed at least 2 blocks at 2 warps each to keep a GPU core busy...
Anyhow, just my 14.5öre.
Hi, apologies if I missed some post. It seems that you launch fewer threads in each CTA when you use the vectorized version. Hope I'm with you so far. If this is the case, may the slower result not simply be due to the hardware not having enough independent wavefronts to cover for arithmetic latency?
I'm not too familiar with the details of the AMD GPU, but by and large it seems pretty similar to the NVIDIA GPUs. For NVIDIA this could certainly be a problem, IIRC for the 200 series you needed at least 2 blocks at 2 warps each to keep a GPU core busy...
Anyhow, just my 14.5öre.
Yes, you're correct. It's to make sure that the size requirement of the local memory to be the same for both scalar version and the vectored version. The 2D vectorized version has 128 work items per work group so it should be enough to hide ALU latency on AMD's GPU.
On NVIDIA's GPU the number of work items per work group is 256 (which is unfortunately the maximum number currently supported by this program). However, since the local memory usage is low enough (less than 16KB) so on 200 series GPU it can run two work groups on a single MP (the occupancy is 0.5 in this case). So it should be enough to competely hide the ALU latency.
I think the problem of the vectorized version could be related to the fact that it has to process two (or four, in the case of 4D version) items in sync. Compared to the scalar version, it's twice as much (or four times as much). Since the execution time required for each item can be very different, the efficiency can be much worse. Especially when considering that AMD's GPU need to run 64 work items in sync (compared to NVIDIA's 32 work items).
I'm still planning to do a "work item imbalance" analysis to verify this theory though :)
Some 6870 results using the latest version of nqueen:
nqueen_cl.exe 17
Platform [0]: ATI Stream
Select platform 0
Using GPU device
Device 0: Buzzard
Using 3584 threads
Block size = 128 threads
Using global atomics
Using vectorization
Use 2D vectors
17-queen has 95815104 solutions (11977939 unique)
Time used: 3.71s
nqueen_cl.exe -novec 17
Platform [0]: ATI Stream
Select platform 0
Using GPU device
Device 0: Buzzard
Using 3584 threads
Block size = 256 threads
Using global atomics
17-queen has 95815104 solutions (11977939 unique)
Time used: 3.27s
Lol @ Buzzard?
Results for the previous version of nqueen:
nqueen_cl.exe 17
Using GPU device
Using 14336 threads
Block size = 256 threads
Using vectorization
Use 2D vectors
17-queen has 95815104 solutions (11977939 unique)
Time used: 6.18s
and
nqueen_cl.exe -novec 17
Using GPU device
Using 3584 threads
Block size = 256 threads
Using global atomics
17-queen has 10919088 solutions (1364886 unique)
Time used: 0.501s
Hey, Buzzards are nice birds :)
Anyway, your test results seem to be inline with 5850. Now scalar version is faster than vectorized version, although there are probably still some tricks can be used to improve vectorized version.
I made a new version which supports multiple devices.
Did you ever release this version?
Where can I get the latest?
Have you thought of putting the source on sourceforge?
Did you ever release this version?
Where can I get the latest?
Have you thought of putting the source on sourceforge?
This (http://forum.beyond3d.com/showpost.php?p=1463854&postcount=131) should always point to the latest released version.
Since this is just a simple experiments of some ideas, I didn't think about putting this on sourceforge or other places, although the source code is always available.
[EDIT] By the way, I tested it with the new ATI Stream SDK 2.3 on my Radeon 5850.
17 queen with 2d vec: 2.72s
17 queen with scalar: 2.76s
These are with global atomics. It's much slower without global atomics.
Lightman
20-Dec-2010, 18:21
HD6970 stock, SDK 2.3
17 queen with 2d vec: 1.8s
17 queen with scalar: hang :sad:
17 queen with nolocal: 1.33s [Edit: bugged score]
HD6970 stock, SDK 2.3
17 queen with 2d vec: 1.8s
17 queen with scalar: hang :sad:
17 queen with nolocal: 1.33s
nolocal is faster? Maybe the L1 cache of Cayman is very good :)
It looks like that I need to get hold of one of these beasts... I am still stuck with that 5850, unfortunately.
Lightman
20-Dec-2010, 18:59
nolocal is faster? Maybe the L1 cache of Cayman is very good :)
It looks like that I need to get hold of one of these beasts... I am still stuck with that 5850, unfortunately.
Tested again to verify and the result is bugged. Gives different amounts of solutions each time I run it.
Vec2 score is OK though - 95815104 solutions
Tested again to verify and the result is bugged. Gives different amounts of solutions each time I run it.
Vec2 score is OK though - 95815104 solutions
Ok. I think maybe a solution check should be in place now to avoid confusion. I'll add it later when I have some time :)
prunedtree
23-Dec-2010, 17:02
One might wonder if there's some way to solve this problem approximately (and thus much faster).
I found one interesting paper on the topic: `Counting Solutions for the N-queens and Latin Square Problems by Efficient Monte Carlo Simulations' by Zhang and Ma.
While the paper is quite cryptic on the details, it turns out the basic idea is pretty simple (consider the problem as a thermodynamic simulation).
I implemented something similar, using the Wang-Landau algorithm to compute the density of states of the boards (define energy by the number of conflicts). The result is then found by looking at the zero-energy state. This method is essentially a markov chain monte carlo algorithm, and thus variance depends on the amount of sampling.
Here's one run of this implementation:
(the same parameters are used for all n, set to achieve ~5% relative error)
[ 0] Q_1 ~ 1.0
[ 1] Q_2 ~ 0.0
[ 1] Q_3 ~ 0.0
[ 2] Q_4 ~ 2.0
[ 2] Q_5 ~ 10.1
[ 4] Q_6 ~ 4.0
[ 8] Q_7 ~ 41.0
[ 9] Q_8 ~ 91.1
[ 14] Q_9 ~ 358.9
[ 85] Q_10 ~ 707.4
[ 40] Q_11 ~ 2746.0
[ 76] Q_12 ~ 14239.4
[ 43] Q_13 ~ 74638.6
[ 187] Q_14 ~ 365537.6
[ 458] Q_15 ~ 2.284e+006
[ 248] Q_16 ~ 1.471e+007
[ 165] Q_17 ~ 9.684e+007
[ 285] Q_18 ~ 6.571e+008
[ 236] Q_19 ~ 4.908e+009
[ 971] Q_20 ~ 3.897e+010
[ 421] Q_21 ~ 3.034e+011
[1736] Q_22 ~ 2.675e+012
[2389] Q_23 ~ 2.412e+013
[2322] Q_24 ~ 2.253e+014
[4633] Q_25 ~ 2.212e+015
[2173] Q_26 ~ 2.194e+016
[ 974] Q_27 ~ 2.394e+017
[3863] Q_28 ~ 2.521e+018
[1730] Q_29 ~ 2.935e+019
[7324] Q_30 ~ 3.390e+020
[3289] Q_31 ~ 4.033e+021
[5312] Q_32 ~ 4.985e+022
[8782] Q_33 ~ 6.530e+023
The first number between brackets is the number of sweeps (in millions) to convergence. The pace is about 10 million sweeps per second. To compare, here is a list of the known values: http://oeis.org/A000170/list
Obviously it's quite inefficient for small sizes (it takes about 16 seconds to get an approximate answer for N=17, and the solver presented in this thread can find the exact result significantly faster). However it scales extremely well for higher dimensions, as long as one is satisfied with approximate solutions.
Can current GPUs run this `more clever' approach as efficiently as they can do the bruteforce search ? Given that most of the work is essentially sampling (the monte carlo method is a form of `brute force' approach as well, after all), it might not be that bad.
This is interesting. :) I think GPU is good for doing Monte Carlo style random algorithms, but a good source for random numbers may take some effort to do efficiently.
Triskaine
12-Sep-2011, 16:14
Unfortunately, for some reason, AMD's OpenCL compiler crashed when compiling my kernel for GPU (it's ok for CPU version though). So right now it doesn't work on AMD's GPU at all, but with AMD Stream SDK 2.0 it's possible to run on CPU devices.
Why are you using such an old version of the SDK? The most up to date one is 2.4 .
God that was stupid of me.
OpenGL guy
12-Sep-2011, 18:56
Why are you using such an old version of the SDK? The most up to date one is 2.4 .
That was a bot. Notice how they took the text from pcchen's original posting.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.