If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,358
|
I rewrote my old n-Queen solver from CUDA into OpenCL. It's not a port, because I rewrite everything from scratch (only the idea remains the same). I was hoping to use OpenCL for comparison between AMD and NVIDIA hardwares.
Unfortunately, for some reason, AMD's OpenCL compiler crashed when compiling my kernel for GPU (it's ok for CPU version though). So right now it doesn't work on AMD's GPU at all, but with AMD Stream SDK 2.0 it's possible to run on CPU devices. Both source and executable are provided in the attachment. The arguments are: nqueen_cl -cpu -clcpu -p -thread # -platform # -local N -cpu: use CPU implementation (single threaded) -clcpu: use OpenCL CPU devices -p: enable profiling -thread #: set the number of threads. (default: max work group item number * number of devices) -platform #: select a platform (default: select platform 0) -local: use local memory (shared memory) for arrays. This runs faster on NVIDIA's GPU because NVIDIA GPU have no indexed registers. -noatomics: do not use global atomics (this is needed for now to avoid crash on AMD Stream SDK 2.0.1 when running scalar code) -novec: do not use vectorization N: the board size (1 ~ 32) Note: large board size (> 20) takes forever to run (I estimate that 20 queen would take more than 2 hours to run on 9800GT). The N-queen algorithm is a straightforward one. There is no special considerations to avoid redundant boards (i.e. most boards can be mirrored and rotated to create 8 solutions). Only a simple mirror reduction is used. Running on my computer (Core 2 Duo E8400 3.0GHz): 16 queen -cpu: 7.27 s 16 queen -clcpu: 3.79 s On GPU (GeForce 9800GT): 16 queen: 2.36 s 16 queen -local: 1.23 s About the crash problem: the original kernel I've developed (i.e. the kernel used by clcpu path) doesn't crash the compiler. However, it's extremely slow (> 20 s for 16 queen on my 4850, and > 7s on 9800GT) because it uses a four arrays to simulate a stack for recursion. These arrays are good for CPU version because they reduce the amount of computation, but for GPU they are too hard on registers. So I developed another kernel which uses only one array, but it requires more computation to generate/restore data for each steps. However, this new kernel crashes the compiler (it works when selecting CPU devices, but it's slower). I'm using Cat 9.12 hotfix right now. [EDIT] I updated the CPU program with reduced symmetry which runs faster. [EDIT2] OpenCL kernels are also with reduced symmetry now. [EDIT3] Vectorized version for RV870 is added. (4/10/2010) Last edited by pcchen; 11-Jan-2010 at 15:35. Reason: Update programs |
|
|
|
|
|
#2 |
|
Senior Member
|
Intel Q9450 @ 3608MHz:
Code:
nqueen_cl.exe -clcpu 16 N-Queen solver for OpenCL Ping-Che Chen Platform [0]: ATI Stream Select platform 0 Using CPU device Using 4096 threads 16-queen has 14772512 solutions Time used: 1.64s
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#3 |
|
Darlek ******
Join Date: Jun 2004
Posts: 9,663
|
why does the cpu version use only one thread ?
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™ |
|
|
|
|
|
#4 |
|
A little of this and that
Join Date: Oct 2005
Location: Cupertino
Posts: 342
|
We'll take a look at the compiler issues.
|
|
|
|
|
|
#5 | |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,358
|
I had a multi-threaded version in my old CUDA code, but that's not portable (it uses Win32 API for managing threads). Also, I think with CPU support OpenCL can reduce or remove the requirement of writing own multi-threaded codes for CPU
Quote:
|
|
|
|
|
|
|
#6 |
|
Meh
Join Date: Mar 2004
Location: New York
Posts: 9,809
|
Cool, did you compare the CUDA and OpenCL versions or has the algorithm changed so much that it's not apples to apples?
__________________
What the deuce!? |
|
|
|
|
|
#7 | |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,358
|
Quote:
Note that the major benefit of using more threads is hiding latency. Since this program does not access global memory frequently, it doesn't have to hide a massive amount of latency. Basically it only has to have enough threads to hide ALU latency, which IIRC on NVIDIA's GPU is 192 threads per work group. By using less arrays to enable using more threads, additional computation is required, so if the benefit of hiding ALU latency does not out weight the additional computations, it's not going to be faster. The performance of this program is probably limited primarily by warp serialization due to branch divergence. This is an example of "not-so-suitable-to-GPU" type parallelizable workload |
|
|
|
|
|
|
#8 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,358
|
I checked my old CUDA version. It uses only one array for stack, but the stored data is different (although there shouldn't be too much difference in computation). However, the old CUDA version also stores results in shared memory and performs reduction when the kernel completed, so it can only have 128 threads rather than 256 threads, therefore it's still slower. I guess it's a bad decision to do reduction in the GPU because CPU is basically sit idling when the GPU is doing all the works.
On my GTX 285, the CUDA version takes 0.686 second to run (16 queen), while the OpenCL version takes 0.643 second. Then I found that in my old CUDA version, the array indexing is reversed to avoid bank conflict, so I did the same, and it becomes a bit faster to 0.537 second. The OpenCL CPU version takes 1.76 second to run on my Core i7 920. The single threaded CPU version takes 7.58 seconds. The number of threads for the OpenCL CPU version is a bit too much and may reduce performance, so I reduced the amount of threads for the CPU version, but it's not really much faster (from 1.76 second to 1.68 second). I updated the source code and executables in the attachments. |
|
|
|
|
|
#9 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,358
|
I made a CPU version which tries to avoid redundant patterns. The original version only avoid horizontal mirroring, i.e.
Code:
- - - - x x - - - - - x - - - - - - x - - - - x - - x - - - x - - - - - - - - x - - x - - - - x - - However, there are actually 8 possible transformations, i.e. 4 rotations and mirroring of each rotations. So if it's possible to generate only "unique" solutions, then it's possible to increase speed by a factor of 4. Unfortunately, it's not that simple, because it's possible for some solutions to be symmetric. For example: Code:
- - x - x - - - - - - x - x - - Code:
- - - - - x - - - x - - - - x - - - - - - - - - x - - - - - - - - - x - - - - x - - - x - - - - - Code:
- - - - - - x - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Code:
- - - - - - x - - - - - - x - - - - x - - - - x - - - - - - x - - - - - - - - - x - x - - - - - - - x - - - - - - - - - - x - - - - - x - - - - - x - - - x - - - - - - - x - - - - - - - - x - - - Code:
- - - - - - x - - x - - - - - - - - - # - - - - - - # - - - - - - # - - - - - - - - - - - - - - - Code:
# # - - x # # # - - - - - # - - - - - - - - - - - - - - - - - - - - - # - - - - - # # # - - - # # Code:
- - - - - x - - - - - - - x - - - - x - - - - - - x - - - - - - x - - - - - - - - - - - - - x - - - - - - - x - x - - - - - - - - - - - x - - - - - - x - - - - - - - - - - - x - - - - - - - x - x - - - - - - - - - - x - - - - - - x - - - - - - x - - - - - By doing these "optimizations" the CPU implementation runs a bit faster. Computing 16-queen on my Core 2 3.0GHz now takes 4.64 seconds (instead of previous 7.27 seconds, that's 57% faster), and it computes both total number of solutions and unique solutions (16-queen has 1846955 unique solutions). |
|
|
|
|
|
#10 |
|
Tea maker
Join Date: Feb 2002
Location: In the Island of Sodor, where the steam trains lie
Posts: 4,396
|
Nice bit of analysis.
|
|
|
|
|
|
#11 |
|
Regular
Join Date: Aug 2009
Posts: 27
|
For fun, I tried to see what kind of `performance' vs `programming time' ratio one could typically achieve on this problem !
Reading various info on internet == time: 1h / performance: no code yet First attempt, single-threaded, (inspired from http://www.ic-net.or.jp/home/takaken/e/queen/index.html) == time: 1h 10 minutes / performance: 16 queens in 16 sec Removing basic symmetry == time: 1h 15 minutes / performance: 16 queens in 8 sec Multi-threading (very naive, just one thread for each of the first branches) == time: 1h 30 minutes / performance: 16 queens in 2 sec Playing around, found false sharing == time: 1h 45 minutes / performance: 16 queens in 1.75 sec Some ricing, one level unroll (28% faster) and tail function (7% faster) == time: 2h / performance: 16 queens in 1.27 sec The CPU used is a i7 920 (2.66 Ghz) It's interesting that the multi-threading speedup is superlinear (4.9x) thanks to SMT (hyperthreading). No attempt was made at more complex symmetry reduction, although if I understand right it should come nearly free in terms of the backtracking throughput. No attempt was made at turning the recursion into an iteration either (my 1-level unroll is a just copy/paste ^^) This naive version of 16 queens has a 570595151 nodes search tree, which it processes at a rate at 450 million nodes per second. |
|
|
|
|
|
#12 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,358
|
Ah, it's similar to the CPU version of my old CUDA code. The old CPU version is also only mirrored and multi-threaded using Win32 API.
I becomes interested in reducing more symmetry after thinking a lot about this. The ideas in my previous post is actually pretty similar to the page you linked. I think it's possible to make GPU run faster with reduced symmetry, but it seems that the more complex symmetry reduction may bring even less performance gain on GPU than on CPU, because GPU really don't like branchy codes The reason why it has superlinear sppedup with hyperthreading is probably also related to the branchy nature of this problem, as hyperthreading may help hiding some latency better. However, when I did the old CUDA version I was pretty impressed by the ability of GPU on running this kind of codes. I was assuming that GPU should be pretty bad at this kind of problems and will lag behind a similar level CPU, but I was wrong. Unfortunately, my P6T motherboard is dead, and it may take some time to fix it. So now I can't update programs on my GTX 285 for a while. I can use my home computer only on weekend. |
|
|
|
|
|
#13 |
|
MSI Man
|
nqueen -cpu 6.64s
nqueen -clcpu 3.85s gpu crashes even with Cat 10.1 beta
__________________
I miss you CJ, 1976 - 2010 |
|
|
|
|
|
#14 |
|
Member
Join Date: Jun 2008
Location: Torquay, UK
Posts: 944
|
My turn:
-cpu 16 = 3.75s -clcpu 16 = 1.51s crash using GPU Phenom II 940@3455MHz |
|
|
|
|
|
#15 |
|
A little of this and that
Join Date: Oct 2005
Location: Cupertino
Posts: 342
|
Yep, we have confirmed that this code causes LLVM to generate irreducible control flow which makes our compiler unhappy. We have an internal bugtrack for this already.
|
|
|
|
|
|
#16 |
|
Senior Member
|
Q9450 @ 3608MHz
-cpu 16: 3.88s -clcpu 16: 1.81s I can't believe PII is faster here...
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#17 | |
|
Member
Join Date: Jun 2008
Location: Torquay, UK
Posts: 944
|
Quote:
Or larger L1 cache? I was pleasantly surprised as well @mhouston: Good to know this will be sorted! |
|
|
|
|
|
|
#18 |
|
Member
|
Also had a classifier implemented on CPUs and it ran on my X2 3800+ (@2400) as fast as it did on a Q6600 stock (single threaded ofcorse, with main CPU-time-user loop unrolled). It mostly multiplied floats and every now and then it extracted square roots.
LE: and the number of unrolls was tested on both CPUs so it didn't favor a specific architecture. Last edited by entity279; 18-Jan-2010 at 13:26. Reason: Added details |
|
|
|
|
|
#19 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,358
|
I added the symmetry reduction algorithms to the OpenCL version, so it computes unique solutions too. The CPU version is also optimized a little bit.
On my Core i7 920 + GeForce GTX 285: -cpu 16: 4.3s -clcpu 16: 1.38s 16 (gpu): 1.09s -local 16 (gpu): 0.492s (0.542s in old version) -cpu 17: 29.6s -clcpu 17: 8.96s 17 (gpu): 4.05s -local 17 (gpu): 2.51s (4.36s in old version) -cpu 18: 216s -clcpu 18: 60.1s 18 (gpu): 32.8s -local 18 (gpu): 24.5s (42.3s in old version) Unfortunately, this program still can't run on ATI's GPU. Another interesting thing is, it seems that the GPU versions scale worse than CPU versions. So in theory for some very big n CPU could be faster than GPU. |
|
|
|
|
|
#20 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,358
|
I did some rather trivial optimizations for GPU to make it a little faster, mainly the symmetry check part.
Running on GeForce GTX 285: 16 (gpu): 1.09s -local 16 (gpu): 0.5s 17 (gpu): 4.01s -local 17 (gpu): 2.55s 18 (gpu): 28.3s -local 18 (gpu): 19.2s |
|
|
|
|
|
#21 |
|
Senior Member
|
Just tested this app with a beta of the next Stream SDK release and it works. Performance could be improved if you can find a way to vectorize your algorithm
__________________
I speak only for myself. |
|
|
|
|
|
#22 | |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,358
|
Quote:
|
|
|
|
|
|
|
#23 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,358
|
I found a bug which may prevent it to run with -local on G8X/G9X GPUs. A workaround is to specify thread numbers to multiple of 192 (such as 5376). I'll post updates later.
|
|
|
|
|
|
#24 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,358
|
This works with the latest AMD Stream SDK 2.0.1 now. On my Radeon 4850 it takes 4.61 seconds to run 16 queen, and 3.79 seconds to run -local version. This is interesting because AFAIK Radeon 4850 does not support real local memory in OpenCL. However, it's possible that using local memory forces the compiler to use memory instead of registers so it reduces pressure on registers.
|
|
|
|
|
|
#25 |
|
A little of this and that
Join Date: Oct 2005
Location: Cupertino
Posts: 342
|
On windows you can use the profiler to dump the generated ISA and you can look at the differences.
|
|
|
|
![]() |
| Tags |
| nqueen, opencl |
| Thread Tools | |
| Display Modes | |
|
|