View Full Version : OpenCL Mandelbrot generator
I've written an OpenCL Mandelbrot set generator which writes the resulting image to a bmp file. This program uses algorithm similar to Voxilla's DirectCompute Mandelbrot viewer (thread here (http://forum.beyond3d.com/showthread.php?t=55330)). My program added 16X AA to make the final image smoother, so it's 16X slower.
Using the parameter (-0.5, 0.5) ~ (0, 1), with 2048x2048 image, 1024 iterations, I did some benchmarks:
Radeon 4850 scalar: 1.194s
Radeon 4850 vector: 0.986s
Core 2 3.0GHz scalar: 56.49s
Core 2 3.0GHz vector: 19.62s
GeForce 9800GT scalar: 1.607s
(Run the program with arguments: mandelbrot -x1 -0.5 -y1 0.5 -x2 0 -y2 1.0)
Since this program does not use shared memory at all, it runs pretty well on Radeon 4850.
I put the program executable and source code in the attachments for those interested. The image generation is written as a class so it should be easy to integrate into an interactive viewer.
Program arguments:
-w width: set width of the final image (default 2048)
-h height: set height of the final image (default 2048)
-x1 -y1 -x2 -y2: set four corners (default [-2.5, -1.5] - [0.5, 1.5])
-o output.bmp: set output file name
-cpu: use non-optimized CPU based code
-gpu: use GPU with OpenCL
-clcpu: use CPU with OpenCL (works only with Mac or AMD Stream SDK 2.0)
-platform index: select platform (default 0)
-p: enable profiler
-usetable: force using palette table (default: CPU uses table, GPU don't)
Note: currently there is no way to select scalar or vector shaders. The program select the shader based on device query (the best vector width for float).
EDIT: program updated
rpg.314
20-Dec-2009, 17:38
Which OpenCL implementation is/are this/these?
Radeon 4850 and CPU were run with AMD Stream SDK 2.0 (driver version is 9.12 hotfix).
GeForce 9800GT were run with NVIDIA built-in OpenCL implementation.
CNCAddict
20-Dec-2009, 17:56
Takes 0.470s on the 5850 :smile:
Lightman
20-Dec-2009, 18:18
Takes 0.470s on the 5850 :smile:
Takes 0.412s on HD5870 @1GHz/1250 :grin:
But this app. is very sensitive to GPU mem. clock. Just lowering it to 1150MHz resulted in 0.433s compute time.
Kudos to you pcchen for writing another excellent benchmark/util. for us :smile:!
Arnold Beckenbauer
20-Dec-2009, 18:21
HD4850 (700/1000):
C:\Users\Denis\Downloads\mandelbrot>mandelbrot -x1 -0.5 -y1 0.5 -x2 0 -y2 1.0
Width: 2048 Height: 2048
(-0.5, 0.5) - (0, 1)
Using GPU OpenCL implementation
Output file: output.bmp
Time used: 940
Core 2 Q6600 (2,4 GHz&DDRII 667):
C:\Users\Denis\Downloads\mandelbrot>mandelbrot -x1 -0.5 -y1 0.5 -x2 0 -y2 1.0
lcpu
Width: 2048 Height: 2048
(-0.5, 0.5) - (0, 1)
Using CPU OpenCL implementation
Output file: output.bmp
Time used: 13026
Thanks guys for your tests :)
Takes 0.412s on HD5870 @1GHz/1250
But this app. is very sensitive to GPU mem. clock. Just lowering it to 1150MHz resulted in 0.433s compute time.
This is interesting, as the shader does not spend much time reading/writing memory. All readings are from a constant palette (about 4KB), that should be cached. Writing are only to the final image, and that's one per pixel. It'd be interesting to see why this happens.
When I go back to my apartment I'll be able to test on my GeForce GTX 285 and on my Mac mini.
It's also amazing to see an OpenCL kernel runs so well on CPU. The un-optimized CPU version takes about 100 seconds to run, and the OpenCL implementation takes only 20 seconds on the same CPU! Also it's can be shown from Arnold's result that the OpenCL implementation takes advantage of 4 cores CPU so it runs faster on a 4 core 2.4GHz CPU than a 2 core 3.0GHz CPU. I think that's just amazing.
Radeon HD 5870 @ 900/5000MHz:
mandelbrot.exe -x1 -0.5 -y1 0.5 -x2 0 -y2 1.0
Width: 2048 Height: 2048
(-0.5, 0.5) - (0, 1)
Using GPU OpenCL implementation
Output file: output.bmp
For test only: Expires on Sun Feb 28 00:00:00 2010
Time used: 410
Core 2 Quad Q9450 @ 3608MHz:
mandelbrot.exe -x1 -0.5 -y1 0.5 -x2 0 -y2 1.0 -clcpu
Width: 2048 Height: 2048
(-0.5, 0.5) - (0, 1)
Using CPU OpenCL implementation
Output file: output.bmp
For test only: Expires on Sun Feb 28 00:00:00 2010
Time used: 8252
CNCAddict
20-Dec-2009, 19:53
OK, my clocks seemed to have been screwed up in my last test with the 5850. I've now tested at
GPU = 924mhz
MEM = 1251mhz
Time = 0.280s
Wicked fast in other words :)
Edit :: Hmm...is there some way my numbers could be bogus?? I tested at the default program settings but it appears faster than Fellix 5870 overclocked about the same as my card...
mandelbrot.exe -x1 -0.5 -y1 0.5 -x2 0 -y2 1.0
It is somewhat memory dependent.
5770@960/1200: 730
5770@960/1445: 680
and to compare with felix:
5770@900/1250: 730
So the double spec 5870 is only 78% faster
And seems like the C2Q is somewhat better than the PII here even though it's an AMD implementation:
PII 965@2.4: 14580
PII 965@3.6: 9632
Silent_Buddha
21-Dec-2009, 02:53
It's also amazing to see an OpenCL kernel runs so well on CPU. The un-optimized CPU version takes about 100 seconds to run, and the OpenCL implementation takes only 20 seconds on the same CPU! Also it's can be shown from Arnold's result that the OpenCL implementation takes advantage of 4 cores CPU so it runs faster on a 4 core 2.4GHz CPU than a 2 core 3.0GHz CPU. I think that's just amazing.
So any code written in OCL to use the CPU will automatically attempt to use all available cores without the code having to be specifically written for multiple cores?
That could be a nice boost and spur adoption of multicore (4+ cores) processors...
Regards,
SB
Probably the AA sampling is the pressing factor for the memory limitation?
Anyone with Core i7 to do a bench here? HT on and off would be curious. ;)
Edit :: Hmm...is there some way my numbers could be bogus?? I tested at the default program settings but it appears faster than Fellix 5870 overclocked about the same as my card...
Note that run time for different regions ((x1, y1) ~ (x2, y2)) are not comparable as some pixels may "jump out" pretty early while some other pixels may need to run the entire iterations (i.e. the black pixels).
As for the memory dependent thing, the only thing I can think of is reading of the palette, which is currently read from a constant table. In theory, a 4KB table should be able to fit into the constant cache, but since AMD's OpenCL implementation has some bugs with directly declared constants, I put the table into a memory object, which may make it unable to cache the table. However, replace the palette with computation is not necessarily faster (depend on how the computation is done). I'll look into this later.
So any code written in OCL to use the CPU will automatically attempt to use all available cores without the code having to be specifically written for multiple cores?
To my understanding, both AMD and Apple's CPU implementation do this. I don't know how the actual implementation is. (does it create many threads, or just creates enough threads to feed all cores?) They also both automatically uses SSE/SSE2/SSE3 instructions to do float4/double2, so it's best to run vectorized kernel on CPU. I'm not sure whether they use SSE/SSE2/SSE3 for integer vectors other than int4 (such as uchar16) though.
Probably the AA sampling is the pressing factor for the memory limitation?
Anyone with Core i7 to do a bench here? HT on and off would be curious.
The AA computation is internal, only one write out per pixel is done. However, every sub-sample needs to read from the palette.
When I'm back to my apartment I'll try it on my Core i7.
On my system (Q9450), PerfMonitor indicates pretty much zero amount of L2 cache misses. That would indicate the -clcpu code path is not limited by the cache performance, at least on Core2 arch.
I've written an OpenCL Mandelbrot set generator which writes the resulting image to a bmp file.
Few weeks I wrote an OpenCL Mandelbrot too (http://davibu.interfree.it/opencl/mandelgpu/mandelGPU.html). I wrote a vectorized kernel and it looks like the performance a comparable to yours: about 50M pixels/secs on a ATI 4870.
You may be interested to another OpenCL toy I wrote: http://davibu.interfree.it/opencl/juliagpu/juliaGPU.html It is a Julia Set ray tracer with support for supersampling, fake ambient occlusion, etc.
trinibwoy
21-Dec-2009, 12:54
GTX 285 @ 648/1476/1242: 544
GTX 285 @ 648/1476/1350: 544
GTX 285 @ 700/1585/1242: 507
GTX 285 @ 700/1585/1350: 506
Looks completely shader bound for me.
Arnold Beckenbauer
21-Dec-2009, 12:59
Few weeks I wrote an OpenCL Mandelbrot too (http://davibu.interfree.it/opencl/mandelgpu/mandelGPU.html). I wrote a vectorized kernel and it looks like the performance a comparable to yours: about 50M pixels/secs on a ATI 4870.
You may be interested to another OpenCL toy I wrote: http://davibu.interfree.it/opencl/juliagpu/juliaGPU.html It is a Julia Set ray tracer with support for supersampling, fake ambient occlusion, etc.
You are on FireUser:
http://fireuser.com/blog/opencl_path_tracer_ray_tracing_demo_using_the_amd_ opencl_beta_sdk/
:smile:
trinibwoy
21-Dec-2009, 13:13
Few weeks I wrote an OpenCL Mandelbrot too (http://davibu.interfree.it/opencl/mandelgpu/mandelGPU.html). I wrote a vectorized kernel and it looks like the performance a comparable to yours: about 50M pixels/secs on a ATI 4870.
You may be interested to another OpenCL toy I wrote: http://davibu.interfree.it/opencl/juliagpu/juliaGPU.html It is a Julia Set ray tracer with support for supersampling, fake ambient occlusion, etc.
No dice :(
Usage: JuliaGPU.exe
Usage: JuliaGPU.exe <use CPU device (0 or 1)> <use GPU device (0 or 1)> <kernel
file name> <window width> <window height>
OpenCL Device 0: Type = TYPE_GPU
OpenCL Device 0: Name = GeForce GTX 285
OpenCL Device 0: Compute units = 30
OpenCL Device 0: Max. work group size = 512
Reading file 'rendering_kernel.cl' (size 10690 bytes)
Failed to build OpenCL kernel: 1
OpenCL Programm Build Log: :196: error: no matching overload found for arguments
of type 'float, int const'
specularity * pow(dot(N, H), specularExponent) +
^~~
No dice :(
Usage: JuliaGPU.exe
Usage: JuliaGPU.exe <use CPU device (0 or 1)> <use GPU device (0 or 1)> <kernel
file name> <window width> <window height>
OpenCL Device 0: Type = TYPE_GPU
OpenCL Device 0: Name = GeForce GTX 285
OpenCL Device 0: Compute units = 30
OpenCL Device 0: Max. work group size = 512
Reading file 'rendering_kernel.cl' (size 10690 bytes)
Failed to build OpenCL kernel: 1
OpenCL Programm Build Log: :196: error: no matching overload found for arguments
of type 'float, int const'
specularity * pow(dot(N, H), specularExponent) +
^~~
NVIDIA OpenCL C compiler is a bit odd some time, try to edit the rendering_kernel.cl file and to change the following line from:
const int specularExponent = 30;
to:
const float specularExponent = 30.f;
or just re-download the http://davibu.interfree.it/opencl/juliagpu/juliaGPU.html (it is updated with the above patch).
I updated the program a bit to make it compatible with Apple's OpenCL implementation (it doesn't like mixing int with float). I also modified the kernel to compute color palette instead of reading from the table. It does not make much different on NVIDIA's hardware, but maybe it will on AMD's hardware, but since I don't have access to one now I can't test that. For CPU devices the table is still used as it's faster to use the table on CPU.
AMD released a new OpenCL SDK with OpenCL ICD support. Unfortunately I still can't get it to see both platforms. Anyway, I added platform selection support into the program. Also, I added an option to enable OpenCL's internal profiler for more accurate timing.
New options:
-platform index: select a particular platform (default 0)
-p: enable profiler
-usetable: Force using palette table (default: CPU use table, GPU don't)
On my GTX 285, using table is a bit slower (493ms vs 477ms). However, on my Mac mini it died when not using table, probably because it took too much time. When using table it takes about 13.5 seconds to run (GeForce 9400M). CPU takes about 48.4 seconds (Core 2 Duo 2.0GHz)
The new Stream SDK 2.0 wrecks the OCL compatibility for me.
When running the GPU code path, the BMP image output is entirely black with abnormally short processing times.
Arnold Beckenbauer
21-Dec-2009, 18:03
C:\Users\Denis\Downloads\mandelbrot(2)>mandelbrot -x1 -0.5 -y1 0.5 -x2 0 -y2 1.0
Width: 2048 Height: 2048
(-0.5, 0.5) - (0, 1)
Using GPU OpenCL implementation
Output file: output.bmp
Platform [0]: ATI Stream
Select platform 0
Time used: 172
Is it okay?
Is there a proper image in the output.bmp? The short run time seems to be suspicious...
Does it work with -clcpu option? Or is it black too?
The output image with the -x1 -0.5 -y1 0.5 -x2 0 -y2 1.0 options should looks like the attachment.
mhouston
21-Dec-2009, 18:21
The release SDK uses the ICD model, which means you need to use the platform calls.
http://developer.amd.com/support/KnowledgeBase/Lists/KnowledgeBase/DispForm.aspx?ID=71
That details the changes that should also make the older AMD and Nvidia drivers also work okay.
I made changes to the program to use platforms, but clGetPlatformIDs still get only one platform only ("NVIDIA CUDA"). I checked the registry and it seems to register all three dll (atiocl.dll, atiocl64.dll, and nvcuda.dll). I was hoping to be able to run CPU OpenCL with AMD's SDK as I don't have an AMD GPU installed on this system, but it doesn't work for now.
By the way, I can't run the installer directly, so I run the three installers in the Packages directory (ATIStreamSDK_Dev_win764a, ATIStreamSDK_Profiler, and ATIStreamSDK_Samples_win764a) manually. Maybe I missed something here?
Arnold Beckenbauer
21-Dec-2009, 18:58
Is there a proper image in the output.bmp? The short run time seems to be suspicious...
Does it work with -clcpu option? Or is it black too?
The output image with the -x1 -0.5 -y1 0.5 -x2 0 -y2 1.0 options should looks like the attachment.
It's black and -clcpu has been ignored.
I forgot to mention: The new ATi Stream SDK 2.0 is installed.
mhouston
21-Dec-2009, 19:00
Which opencl.dll are you getting? Something could be borked with the Nvidia OpenCL.dll, but to be fair there have been ICD code changes in Khronos lately so either company could have missed something.
You shouldn't need to have an AMD GPU to get the AMD platform picked up.
I got the installer running, but it installed only the Stream Profiler. I can still install the OpenCL SDK manually.
By copying the opencl.dll from ATI Stream SDK to the executable file's directory, it's now possible to pick up the ATI Stream platform, but there is still only one platform, as NVIDIA CUDA platform is now gone. It's probably something with NVIDIA's driver though (I'm using a mid-November version). But at least now I can run with CPU OpenCL :)
Using CPU OpenCL, the result is correct (both using table and not using table). On my Core i7 920, it takes 6.17 seconds to run the (-0.5, 0.5) - (0, 1) region.
I also fixed a bug in the table code (not performance related). The program is updated.
Arnold Beckenbauer
21-Dec-2009, 19:58
Looks good with -clcpu.
C:\Users\Denis\Downloads\mandelbrot(3)>mandelbrot -clcpu -x1 -0.5 -y1 0.5 -x2 0
-y2 1.0
Width: 2048 Height: 2048
(-0.5, 0.5) - (0, 1)
Using CPU OpenCL implementation
Output file: output.bmp
Platform [0]: ATI Stream
Select platform 0
Time used: 12729
It's still broken with my HD4850. :sad:
mhouston
21-Dec-2009, 20:53
#define USE_TABLE in the shaders.cl file will get correctness back. We are looking into the codegen issue with compute_palette that seems to be causing issues.
mhouston
21-Dec-2009, 21:00
And an easier fix perhaps (and should improve performance on some platforms, perhaps other than AMD...):
__constant int4 colors[5] = {
(int4) (255, 0, 0, 0),
(int4) (0, 255, 255, 0),
(int4) (255, 0, 255, 0),
(int4) (0, 0, 255, 0),
(int4) (255, 0, 0, 0)
};
int4 compute_palette(int index)
{
int i = index % 64;
int j = (index / 64) % 4;
if(index < MAX_STEPS) {
return ((colors[j + 1] - colors[j]) * i + 32) / 64 + colors[j];
}
else {
return 0;
}
}
Arnold Beckenbauer
21-Dec-2009, 21:08
#define USE_TABLE in the shaders.cl file will get correctness back. We are looking into the codegen issue with compute_palette that seems to be causing issues.
It worked for me.
C:\Users\Denis\Downloads\mandelbrot(3)>mandelbrot -x1 -0.5 -y1 0.5 -x2 0 -y2 1.0
Width: 2048 Height: 2048
(-0.5, 0.5) - (0, 1)
Using GPU OpenCL implementation
Output file: output.bmp
Platform [0]: ATI Stream
Select platform 0
Time used: 950
And an easier fix perhaps (and should improve performance on some platforms, perhaps other than AMD...):
__constant int4 colors[5] = {
(int4) (255, 0, 0, 0),
(int4) (0, 255, 255, 0),
(int4) (255, 0, 255, 0),
(int4) (0, 0, 255, 0),
(int4) (255, 0, 0, 0)
};
int4 compute_palette(int index)
{
int i = index % 64;
int j = (index / 64) % 4;
if(index < MAX_STEPS) {
return ((colors[j + 1] - colors[j]) * i + 32) / 64 + colors[j];
}
else {
return 0;
}
}
Ah, I have actually tried this before! :) Unfortunately, NVIDIA's OpenCL compiler has some weird bugs which make it thinks this initialization is wrong. But it's glad to know that this works for AMD's OpenCL :)
codedivine
22-Dec-2009, 08:44
I tried this on Ubuntu 9.04 64-bit with the new AMD 2.0 SDK with a mobility Radeon 4570. I had to add #import <cstring> to main.cpp. Then I compiled "g++ *cpp -I/my/opencl/include/files -L/my/opencl/libs -lOpenCL".
Then I ran "./a.out -p".
I got 5.23s with the proper output.bmp.
I tried this on Ubuntu 9.04 64-bit with the new AMD 2.0 SDK with a mobility Radeon 4570. I had to add #import <cstring> to main.cpp.
Oh, that's an oversight of me. strcmp does indeed require including <cstring>. Both Visual Studio and Xcode don't complain about that so I didn't notice it. I'll add that in a later version.
I'm still trying to work out a version with __constant which works on both AMD and NVIDIA's OpenCL. NVIDIA's OpenCL compiler apparently think { } is for initializing a vector and not for an array of vectors, and that's very annoying.
trinibwoy
22-Dec-2009, 15:28
just re-download the http://davibu.interfree.it/opencl/juliagpu/juliaGPU.html (it is updated with the above patch).
Cool, works now. Thanks.
I made a new version which changes the compute_palette function into:
int4 compute_palette(int index, __constant const int4* colors)
{
int i = index % 64;
int j = (index / 64) % 4;
if(index < MAX_STEPS) {
return ((colors[j + 1] - colors[j]) * i + 32) / 64 + colors[j];
}
else {
return 0;
}
}
instead of using directly declared __constant array (which doesn't work with NVIDIA's compiler). I hope that this works on AMD's GPU now (it works with CPU mode). I also added a 64 bit compiled executable named mandelbrot64.exe.
Arnold Beckenbauer
22-Dec-2009, 15:47
Everything works:
C:\Users\Denis\Downloads\mandelbrot(4)>mandelbrot -x1 -0.5 -y1 0.5 -x2 0 -y2 1.0
-clcpu
Width: 2048 Height: 2048
(-0.5, 0.5) - (0, 1)
Using CPU OpenCL implementation
Output file: output.bmp
Platform [0]: ATI Stream
Select platform 0
Time used: 12620
C:\Users\Denis\Downloads\mandelbrot(4)>mandelbrot -x1 -0.5 -y1 0.5 -x2 0 -y2 1.0
Width: 2048 Height: 2048
(-0.5, 0.5) - (0, 1)
Using GPU OpenCL implementation
Output file: output.bmp
Platform [0]: ATI Stream
Select platform 0
Time used: 952
Yup, the lastest compile now works with the final Stream SDK 2.0!
Radeon HD 5870 @ 900/5000MHz:
Width: 2048 Height: 2048
(-0.5, 0.5) - (0, 1)
Using GPU OpenCL implementation
Output file: output.bmp
Platform [0]: ATI Stream
Select platform 0
Time used: 410
Core 2 Quad Q9450 @ 3608MHz:
Width: 2048 Height: 2048
(-0.5, 0.5) - (0, 1)
Using CPU OpenCL implementation
Output file: output.bmp
Platform [0]: ATI Stream
Select platform 0
Time used: 8432
Some more numbers E8500/HD4850
Width: 2048 Height: 2048
(-0.5, 0.5) - (0, 1)
Using GPU OpenCL implementation
Output file: output.bmp
Platform [0]: ATI Stream
Select platform 0
Time used: 999
Using CPU OpenCL implementation
Output file: output.bmp
Platform [0]: ATI Stream
Select platform 0
Time used: 21077
Lol, after logging off the missus, the CPU number dropped by 2 seconds to 19311!
trinibwoy
23-Dec-2009, 02:37
On my GTX 285, using table is a bit slower (493ms vs 477ms).
Yep, the new way is a bit faster. Are those your stock results? At stock I'm getting 515ms and 480ms overclocked.
Yep, the new way is a bit faster. Are those your stock results? At stock I'm getting 515ms and 480ms overclocked.
The new way of using __constant for some reason make it slower on NVIDIA's GPU when not using tables (my number was with the older program).
Are those 285 results at default settings or -x1 -0.5 -y1 0.5 -x2 0 -y2 1.0?
On my 260 it's so slow that I get a driver reset at 2048x2048, while at 512:
Width: 512 Height: 512
(-0.5, 0.5) - (0, 1)
Using GPU OpenCL implementation
Output file: output.bmp
Platform [0]: NVIDIA
Select platform 0
Time used: 1382
Are those 285 results at default settings or -x1 -0.5 -y1 0.5 -x2 0 -y2 1.0?
It's (-0.5, 0.5) - (0, 1.0). My GTX 285 takes 2xx ms to run at default setting.
260 shouldn't be that slow, as on my GeForce 8800GT it takes only 1469 ms to run the (-0.5, 0.5)-(0, 1.0) setting for 2048x2048. Are you using the latest driver (195.62)?
Width: 2048 Height: 2048
(-0.5, 0.5) - (0, 1)
Using GPU OpenCL implementation
Output file: output.bmp
Platform [0]: NVIDIA CUDA
Select platform 0
Time used: 1469
Nah, that was with the 190.89 from the opencl sdk page, couldn't get it to work with the latest before. But seems like I got the right driver/sdk installation order now, so I can run it with 195.62 (at least the prebuilt and in debug mode, not my own release builds).
Getting some 655 ms now.
trinibwoy
23-Dec-2009, 12:48
The new way of using __constant for some reason make it slower on NVIDIA's GPU when not using tables (my number was with the older program).
Thats strange. The older program was slower for me - 545ms.
I just tested this on my Mac mini and unfortunately both USE_TABLE and the __constant trick do not work with the GPU. They work with CPU though.
Apparently there are still some bugs with Apple's OpenCL compiler. :(
Albuquerque
24-Dec-2009, 04:12
So, question -- if I do not have a OpenCL capable device (ie, Intel CPU and NV 7300 GPU) can I simply not run this at all? I assumed the terribly unoptimized -CPU command line option would have allowed me to at least see it. However, I get an error about a missing OpenCL.DLL file, which really isn't a surprise on my rig.
I am running this on Win7Pro 32...
willardjuice
24-Dec-2009, 04:18
So, question -- if I do not have a OpenCL capable device (ie, Intel CPU and NV 7300 GPU) can I simply not run this at all? I assumed the terribly unoptimized -CPU command line option would have allowed me to at least see it. However, I get an error about a missing OpenCL.DLL file, which really isn't a surprise on my rig.
I am running this on Win7Pro 32...
IIRC AMD's "opencl CPU drivers" work on any x86 processor that has at least SSE3.
Albuquerque
24-Dec-2009, 04:25
IIRC AMD's "opencl CPU drivers" work on any x86 processor that has at least SSE3.
Duly noted, but I'm still curious why it's dependant on the OpenCL library. I'm looking through the code right now, but from the description, it sounded as if I shouldn't even need it. ???
OpenGL guy
24-Dec-2009, 05:08
Duly noted, but I'm still curious why it's dependant on the OpenCL library. I'm looking through the code right now, but from the description, it sounded as if I shouldn't even need it. ???
The OpenCL.dll is what loads the appropriate ICD. Install the Stream SDK to get the CPU device which is not a part of the OpenCL.dll at all.
In order to run OpenCL on a GPU, you need an OpenCL driver. The app isn't going to run on the CPU by itself... How could it handle the OpenCL API calls? :)
Core i7-920 @ 4300MHz
Windows 7 x64
Hyper-Threading ON:
mandelbrot64.exe -x1 -0.5 -y1 0.5 -x2 0 -y2 1.0 -clcpu
Width: 2048 Height: 2048
(-0.5, 0.5) - (0, 1)
Using CPU OpenCL implementation
Output file: output.bmp
Platform [0]: ATI Stream
Select platform 0
Time used: 4180
Hyper-Threading OFF:
Time used: 6670
btw, I noticed that the 64 and_64 executables have worse performance than the 32(?)bit one?
Duly noted, but I'm still curious why it's dependant on the OpenCL library. I'm looking through the code right now, but from the description, it sounded as if I shouldn't even need it. ???
Oh, it's possible to do that if you link OpenCL.dll dynamically. However, that means one has to load the library and get all the function pointers manually, and unfortunately that's not generally portable. Therefore, this program linked OpenCL.dll through opencl.lib, which makes the executable "require" opencl.dll to work.
It'd be nice if Microsoft (or someone else) made a publicly available OpenCL ICD (preferably with good CPU support) for everyone to download. Unfortunately, AMD's Stream SDK 2.0 is currently the closest thing we have (it takes only a simple registration to download the SDK).
btw, I noticed that the 64 and_64 executables have worse performance than the 32(?)bit one?
Do you mean the unoptimized CPU path or the OpenCL path? The unoptimized CPU path is probably not very meaningful performance wise. On the other hand, if running on GPU there should be very little difference between 32 bits and 64 bits. The only possible big difference should be in the CPU OpenCL path, as the OpenCL compiler will have more registers to use in 64 bits.
Out of curiosity, I did some tests on my computer (all using [-0.5, 0.5] ~ [0, 1] setting):
64 bits CPU (unopt) path: 124.292 s
32 bits CPU (unopt) path: 236.657 s
64 bits GPU OpenCL path: 0.506s
32 bits GPU OpenCL path: 0.508s
64 bits CPU OpenCL path: 6.240s
32 bits CPU OpenCL path: 6.074s
The unoptimized path is unsurprisingly weird performance-wise :) The GPU path is as expected, but the OpenCL CPU path is a bit surprising as 64 bits seems to be a bit slower than 32 bits.
By the way, my CPU is Core i7 920 (2.66GHz).
I'll check the numbers tonight, I just ran the 64 bit executables without options, I'll let you know.
On my Core 2 3.0GHz, 32 bits version is also faster, but not very much:
32 bits CPU OpenCL path: 19.611s
64 bits CPU OpenCL path: 20.544s
GPU using tables is consistently faster than without:
GPU 32 985 / tables 974
GPU-64 11035 / with table 11039 (very weird, after loading GPU-Z I get 985 with the 32 bit exe, but not the 64)
cpu unopt (single core):
32: 208690
64: 126499
cpu opt
32: 18680
64: 18464
GPU using tables is consistently faster than without:
GPU 32 985 / tables 974
GPU-64 11035 / with table 11039 (very weird, after loading GPU-Z I get 985 with the 32 bit exe, but not the 64)
I tried using table on my 4850. It's very weird indeed. 32 bits binary runs fast (9xx ms) with both using or not using tables. However, 64 bits binary, on my computer, it sometimes run very slowly (11xxx ms) but sometimes as fast as 32 bits binary. There are probably some bugs in AMD's 64 bits OpenCL implementation. I'll have to check it further.
trinibwoy
28-Dec-2009, 05:57
What Nvidia drivers/SDK are people using for development? I got the raytracer to compile under windows after fiddling a bit but can't get opencl.dll to link properly. According to Tim in this thread (http://forums.nvidia.com/index.php?showtopic=151149) you need the beta 3.0 SDK for 195.xx drivers to work but that didn't help. Linking to OpenCL.lib as instructed.
The OpenCL calling convention changed in 195.xx, so you need to upgrade to the latest SDK as well.
Edit: Nevermind, it works with the new SDK.
Could it be Macro-ops Fusion on core 2 duo's?
Could it be Macro-ops Fusion on core 2 duo's?
Do you mean the mysterious behaviour of 32 bits being faster than 64 bits?
To my understanding, Core 2 Duo don't do macro-ops fusion in 64 bits mode, so that could explain why it's slower in 64 bits. However, Nehalem (Core i7) does not have this behavior, but it's still slower (and even by a larger amount) in 64 bits than in 32 bits. So there could be other reasons behind this.
Lightman
06-Jan-2010, 11:43
Do you mean the mysterious behaviour of 32 bits being faster than 64 bits?
To my understanding, Core 2 Duo don't do macro-ops fusion in 64 bits mode, so that could explain why it's slower in 64 bits. However, Nehalem (Core i7) does not have this behavior, but it's still slower (and even by a larger amount) in 64 bits than in 32 bits. So there could be other reasons behind this.
Nehalem might have lover instruction issue speed in 64bit due to 16bytes fetch (depends on instruction mix). I will have a go on Phenom II to verify that.
Also memory operations are in some cases slower in 64bit mode.
Here is short explanation from Wiki:
The main disadvantage of 64-bit architectures is that relative to 32-bit architectures, the same data occupies more space in memory (due to swollen pointers and possibly other types and alignment padding). This increases the memory requirements of a given process and can have implications for efficient processor cache utilization. Maintaining a partial 32-bit model is one way to handle this and is in general reasonably effective. For example, the z/OS operating system takes this approach currently, requiring program code to reside in 31-bit address spaces (the high order bit is not used in address calculation on the underlying hardware platform) while data objects can optionally reside in 64-bit regions.
We're not talking about "slower" we're talking about a tenfold increase in processing time.
edit: E8500 with HD5850:
Width: 2048 Height: 2048
(-0.5, 0.5) - (0, 1)
Using GPU OpenCL implementation
Output file: output.bmp
Platform [0]: ATI Stream
Select platform 0
Time used: 487
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.