View Full Version : NLM denoise in DX11
I've written a simple NLM denoise program with DX11 compute shader 4.0. It works on my GTX 285, but I don't know if it works on anything else...
Anyway, if anyone's interested it can be downloaded here (http://www.kimicat.com/dang-an-jia/nlm_cs.rar). On my GTX 285 it took around 420 ms to denoise the sample image in the file.
For those interested in the source code (which is pretty boring), it can be downloaded here (http://www.kimicat.com/dang-an-jia/nlm_cs_src.rar).
Note: it can only output BMP files right now.
EDIT: A slight modification improves the speed to around 280ms.
MDolenc
07-Oct-2009, 09:36
What driver are you using? I can't run any direct compute stuff on my machine and it's a properly configured Vista 64 with latest DX SDK and everything. Are you on Win7?
I use the previous 190.62 driver on Windows 7 x64. I've heard that in order to enable compute shader on Vista, a registry has to be modified, but I don't know which one (I'll have to check NVIDIA's GPU computing SDK for that). I don't know whether the latest driver (191.07) still needs that though.
Also the latest version takes around 250ms on my GTX 285. Basically I modified the type of the colors from float4 to float3, which helps a bit for NVIDIA's scalar architecture.
I checked the release notes and it says registry keys names "D3D_39482904" should be deleted (there are about 2 instances of them).
I did some tuning on this and now it takes a GTX 285 223 ms to denoise. I also modified the program to write the compiled shader into a file named shader.bin so it don't need to compile it again next time.
A friend tested it for me on a GeForce 9600GT and it takes 717 ms.
Right now, my shader relies on texture cache to reduce memory bandwidth requirement. It's possible to use shared memory to do this, but it's much more complicated.
Just tested it on my 5850. Are both images supposed to look the same?
output says
setup time: 561
Load file time: 47
Denoise and write file time:234
edit: never mind, it seems to work.
my 5850 is at 775 core and 1100 for memory.
MDolenc
08-Oct-2009, 06:27
Thanks Chen! That helped.
Setup time: 553
Load file time: 113
Denoise and write file time: 290
On a GTX 280 with Vista 64 with 191.03 drivers.
Just tested it on my 5850. Are both images supposed to look the same?
They should be quite similar because this is just a denoise program. The source image is from an iPhone 3G's camera, which is quite noisy. After denoise, it becomes a little blurry but most random noise is gone. Of course, you can use your own images from other camera, but the size has to be able to fit inside a texture (which on current cards should be 4096x4096 or something).
output says
setup time: 561
Load file time: 47
Denoise and write file time:234
edit: never mind, it seems to work.
my 5850 is at 775 core and 1100 for memory.
That seems to be ok. The kernel is not very vectorized (as you can see in the shader code) so it's not very well suited for RV870. I'll see what I can do when I get a RV870 on my home computer, or when a compute shader enabled driver for RV770 is publicly available.
I have updated the program with support for a pixel shader code path, so now it should run on any video card with pixel shader 4.0 support (that means DX10 feature level). Although it still requires DX11 runtime to run.
From what I've seen on my GTX 285, the performance is almost the same (since the shader is, well, almost the same). The only downside of the pixel shader path is that I used D3DX to write the texture directly to a BMP file, and it decided to write it in 32 bpp mode rather than 24 bpp mode, so the resulting file is larger.
Programming-wise, the pixel shader path is much more annoying than I anticipated, partly because I've not used D3D10 doing 3D rendering for quite a while, and D3D11 is apparently more complex than D3D10. Other than that, it doesn't seem to have any nasty surprises, which is a good thing.
I've tried a few ways to utilized the shared memory to reduce the amount of texture loads. However, it increases pressure on ALU and at least on my GTX 285 it's already nearly limited by the ALU so performance is always worse. Maybe on RV870 it could be a different story.
NLM denoise using DirectX 11
Ping-Che Chen
sigma: 0.02
Using pixel shader.
Setup time: 20ms
Load file time: 30ms
Denoise and write file time: 560ms
Radeon HD 4890 @ 950MHz GPU;
Win7 x86 (7600.20510), Cat 9.11b;
CarstenS
15-Oct-2009, 10:12
Can someone - please! - re-upload a working archive for pcchen's program? Everytime I try downloading it from the original link, I get an error message when trying to decompress the rar-archive (which is only a few kByte).
edit:
Never mind, it was only me trying to save the file via rightclick - save as... - and that did not work.
Silent_Buddha
15-Oct-2009, 23:50
Hmmm, doesn't appear my 5870 is using the compute shader version? Is there a way to force it to use the compute shader?
My results...
NLM denoise using DirectX 11
Ping-Che Chen
sigma: 0.02
Using pixel shader.
Setup time: 20ms
Load file time: 36ms
Denoise and write file time: 415ms
Regards,
SB
There's a bug in compute shader detection, it should be fixed now. If there's any problem please let me know, thanks.
Silent_Buddha
16-Oct-2009, 10:09
Tried the new version and seems to work fine now as far as I can tell. Small speedup from PS version...
NLM denoise using DirectX 11
Ping-Che Chen
sigma: 0.02
Using compute shader.
Setup time: 16ms
Load file time: 36ms
Denoise and write file time: 368ms
Seems a lot slower than the 5850 score above though. Not sure why.
Regards,
SB
CarstenS
16-Oct-2009, 11:30
Same thing with most DX11-samples from the SDK you can run in Feature-Level 10.x mode also. Maybe GDS isn't activated in the drivers yet?
I updated it with a new pixel shader, which reduces the amount of texture load by preloading it into a local array. This is extremely slow on GTX 285 (about 1800 ms), perhaps because GTX 285 does not support indexed array in registers. On my Radeon HD 4850 however, it runs much faster. The original pixel shader takes about 900ms, while the new shader takes only 520ms. This also confirms my suspicion that RV770/RV870 is limited by texture load in the original shader.
If anyone with a Radeon wants to test the new shader, use -p2 switch, i.e.
nlm_cs -p2 IMG_0025.JPG output.bmp
Using:
http://developer.amd.com/gpu/shader/Pages/default.aspx
Looking at shader_ps.hlsl, the PSMain shader has a really low ALU:fetch in each of the loops, 6 ALU and 3 fetch.
Looking at PSMain2, the ATI compiler is allocating 43 vec4 registers (I haven't counted how many of the effective 172 scalar registers are actually being used). That means there's only 5 threads of 64 pixels that can be in flight. The low ALU:fetch, 2.7:1, resulting from 49 vfetch instructions, 27 loads and 206 ALU instructions in the main loop, means ATI is still fetch limited.
I guess PSMain2 uses too many registers for NVidia's architecture, so it is forced to spill to main memory.
Jawed
Silent_Buddha
16-Oct-2009, 20:37
Whoa that's a dramatic speed increase for the new PS path.
NLM denoise using DirectX 11
Ping-Che Chen
sigma: 0.02
Using pixel shader.
Setup time: 32ms
Load file time: 42ms
Denoise and write file time: 261ms
Regards,
SB
Using:
http://developer.amd.com/gpu/shader/Pages/default.aspx
Looking at shader_ps.hlsl, the PSMain shader has a really low ALU:fetch in each of the loops, 6 ALU and 3 fetch.
Yeah, I have used that and that's why I decided to modify the shader to reduce texture load. Apparently, GTX 285 has enough number of texture units so it's easy to use it as a cached read-only memory. In this shader, for every pixel there are 4851 texture loads. That's quite a crazy amount of texture loads. :)
Looking at PSMain2, the ATI compiler is allocating 43 vec4 registers (I haven't counted how many of the effective 172 scalar registers are actually being used). That means there's only 5 threads of 64 pixels that can be in flight. The low ALU:fetch, 2.7:1, resulting from 49 vfetch instructions, 27 loads and 206 ALU instructions in the main loop, means ATI is still fetch limited.
This shader reduces the amount of texture loads, from the previous 4851 loads to 1029 loads. That's a near 5 times reduction. However, as your observation, the amount of registers it requires reduces the number of threads in flight, therefore the performance suffers. Ideally, per pixel only 169 texture loads are required, but that would need a staggering 2704 bytes memory per pixel to store those textures.
The ideal way to do this is to use shared memory, which can further reduce texture load by a great number (for example, a 32x16 threads group only needs 968 loads for all these threads, rather than 169 per thread if there's no shared memory). Unfortunately, the restrictions of cs 4.0 make this very inconvenient. If I get a RV870 I can do that with cs 5.0. Another way is to use OpenCL, which does not have this restriction.
I guess PSMain2 uses too many registers for NVidia's architecture, so it is forced to spill to main memory.
I suspect that the main reason behind the poor performance on NVIDIA's architecture is that registers in NVIDIA's architecture can't be indexed, so they have to use video memory to handle arrays (unless you unroll every loops to make all array access non-indexed, but it probably going to be worse).
I suspect that the main reason behind the poor performance on NVIDIA's architecture is that registers in NVIDIA's architecture can't be indexed, so they have to use video memory to handle arrays (unless you unroll every loops to make all array access non-indexed, but it probably going to be worse).
Only the dist array (7 elements) actually uses indexed registers in the ATI code. Have to admit, I didn't even notice there were any indexed registers in the assembly first time around.
Bizarrely, the ATI code uses purely static indexing. Seems like a compiler bug - it should see these are static indexes and allocate fixed registers. Might be a side-effect of the indexing produced by the HLSL compiler, now I've looked at the D3D assembly.
Generally I was under the impression that NVidia GPUs do support indexed registers, as long as there's not too many and they can be statically allocated.
You could actually unroll the inner computation loop:
[unroll]
for(k = 0; k <= kernel_half*2; k++) {
[unroll]
for(l = 0; l <= kernel_half*2; l++) {
float3 cd = c2s[l] - c1s[k + l];
float weight = g_gaussian[p + l];
dist[k] += (dot(cd, cd) * weight);
}
}
The assembly this results in has an inner loop with 27 fetches and 123 ALUs, i.e. a reasonable 4.6 ALU:fetch.
This results in an allocation of 52 vec4 registers, which means only 4 threads. Slide 23:
http://gpgpu.org/wp/wp-content/uploads/2009/09/E1-OpenCL-Architecture.pdf
says 3 is minimum, but 5 is better. So 4 threads might be pushing it somewhat.
Worth a try.
On HD5870 the DOT4 instructions would become DOT3s, which could lower the ALU:fetch. Until GPUSA is updated for R800 it's just a guessing game though. I really don't understand why AMD doesn't have a revised version out there already.
Jawed
Arnold Beckenbauer
16-Oct-2009, 22:55
sigma: 0.02
Using pixel shader.
Setup time: 78ms
Load file time: 63ms
Denoise and write file time: 483ms
HD4850 (700/993 MHz), PCIe 1.0 (16x)
You could actually unroll the inner computation loop:
[unroll]
for(k = 0; k <= kernel_half*2; k++) {
[unroll]
for(l = 0; l <= kernel_half*2; l++) {
float3 cd = c2s[l] - c1s[k + l];
float weight = g_gaussian[p + l];
dist[k] += (dot(cd, cd) * weight);
}
}
assembly this results in has an inner loop with 27 fetches and 123 ALUs, i.e. a reasonable 4.6 ALU:fetch.
I was under impression that the compiler should automatically determine whether to unroll loops. Anyway, in general unroll such a long loop is not a good idea on NVIDIA's architecture, but apparently it's good for ATI's architecture at least in this particular instance.
The unrolled shader takes very long to compile (on my computer it took almost 43 seconds). However, the result is much better, at 340 ms on my Radeon HD 4850 (previously about 520 ms). The compile time is not a very big issue because the program saves compiled shaders into binary files for later use.
If anyone wants to try this, just added [unroll] to the two loops as the above code in shader_ps.hlsl. Remember to delete all .bin files in the directory though, as they are possibly binary codes of old shaders. I'll update the files on the server as soon as possible.
Generally I was under the impression that NVidia GPUs do support indexed registers, as long as there's not too many and they can be statically allocated.
My guess is that NVIDIA uses shared memory for indexed array if they are not very big. I do this in many of my CUDA programs as well. Since shared memory is quite small (16KB), it can't handle very large array.
Nice, that's a good speed-up, approaching 3x from the original pixel shader :grin:
Your feedback time was almost as fast as I could have done it myself if I had all the stuff required :grin:
I'm amazed by the compilation time though. Here it only takes a couple of seconds on my crappy A64 3500 (2GHz X2) in GPUSA.
Check the compilation time in GPUSA, it should be faster. I wonder what's different.
I think there are heuristics for unrolling, e.g. detecting that there's 16 or less static iterations of a loop (I've seen this behaviour in Brook+) but I've not really seen any documentation on this subject. It might be in the hands of what the HLSL compiler outputs.
e.g. I have just commented out all the loop-related pragmas and told GPUSA to avoid control flow. The resulting D3D assembly contains only a single loop, is 1398 instructions and uses no indexable registers. The assembly uses 113 vec4 registers :shock: but the loop has an ALU:fetch of 4.27. Seems unlikely it would be faster, with only 2 threads.
If I take your code from before my suggestion, and comment out the pragmas, the default HLSL compilation in GPUSA produces the same code as with your pragmas - so it is seeing the 13-long and 7-long static loops. I've just noticed that this D3D assembly contains 3 indexable registers, but the resulting hardware assembly only has 1.
Jawed
Yeah, the compile time is puzzling. But I've seen that before. It's probably a problem with D3D's HLSL compiler. GPUSA is so fast that it's default behavior is just to compile even after you typed one character into the source code window :P
To my understanding, the compiled binary is platform agnostic, as it's basically just the assembly form (although in SM4.0 it's no longer possible to directly write in assembly form). So basically the long compile time is actually not a big problem.
A more practical problem is that there is no obvious way to decide which shader to use other than plainly doing a benchmark (or by vendor ID... but I don't really like that because it's too restrictive).
Now I'm more interested in doing an OpenCL version, maybe using shared memory. It'd be interesting to compare it with the Compute shader version.
By the way, I wrote a CPU version a year ago, using SSE instructions. It's not completely the same, the CPU version works in YUV 4:2:0 color space rather than RGB, so it's less work. It's also multi-threaded. On my Core i7 920, it takes about 1.1 second to denoise the same image, slower than a lowly GeForce 9600GT. I think that shows the amazing potential of GPU in doing image processing works like this :)
I've had 5 minute+ compile times with Brook+. Unrolled stuff is normally what kills it. Brook+ seems to use the HLSL compiler though, so I've not split it out to identify which compiler is thrashing.
This is a good example of why developing for a single console makes for optimised code.
I do also think that being able to see the assembly and statistics relating to hardware execution is a vital part of tuning for performance. Sure you could suck it and see for each combination of unrolls in this shader, but the insights you can gain on the hardware execution model are pretty useful.
In the original pixel shader, I suspect that cache access patterns are really important. I also think that NVidia's more fluid instruction scheduling: the ability for loads to complete individually and in any order and for the dependent ALUs to execute immediately - as opposed to the strictly clause-based approach on ATI - is what makes the performance so good.
Is it possible to separate-out the computation time and the file-write time?
I'm also thinking for real benchmarking it'd be worthwhile doing more than one denoise. I suppose the alternative would be to use a monster 24MP image.
For OpenCL this might be a useful leg-up:
http://developer.amd.com/gpu/ATIStreamSDK/ImageConvolutionOpenCL/Pages/ImageConvolutionUsingOpenCL.aspx
Though the article is only written with a CPU as a target and doesn't touch local memory. It does some vectorisation and unrolling - though there's no unroll pragma in AMD's compiler it seems, whereas there is in NVidia's.
Whole pile of OpenCL related webinars coming up:
http://developer.nvidia.com/object/gpu_computing_online.html
At the very bottom of the page is a WMV and PDF for "Best Practices for OpenCL Programming". The audio quality's a bit ropey though.
Jawed
I do also think that being able to see the assembly and statistics relating to hardware execution is a vital part of tuning for performance. Sure you could suck it and see for each combination of unrolls in this shader, but the insights you can gain on the hardware execution model are pretty useful.
Yes, this is quite important. For example, there are various different profile registers available in CUDA which are useful for identifying potential performance bottleneck. The ability to see PTX is also useful.
In OpenCL, there is no standard intermediate format, so it's more tricky. For example, in Apple's implementation, on NVIDIA's hardware the intermediate format is PTX, on ATI it's IL, and on CPU it's unfortunately compiled x86 code :P
In the original pixel shader, I suspect that cache access patterns are really important. I also think that NVidia's more fluid instruction scheduling: the ability for loads to complete individually and in any order and for the dependent ALUs to execute immediately - as opposed to the strictly clause-based approach on ATI - is what makes the performance so good.
Yes, making it "blocky" is really important. On my first try I used a 512x1 block, so access pattern is worse. By changing to a 32x16 block, performance improved by more than 50%.
Is it possible to separate-out the computation time and the file-write time?
It's possible in compute shader but more difficult in pixel shader because there doesn't seem to be a concrete way to make sure all outstanding operations were completed. However, file-write time is at most 20 ~ 30 ms.
I'm also thinking for real benchmarking it'd be worthwhile doing more than one denoise. I suppose the alternative would be to use a monster 24MP image.
Yeah, actually any pictures can be tested. Because it uses texture, so the texture size limitation can be a problem (on DX10 level hardware it should be 8192x8192 IIRC).
I tested a 3648x2736 image and it takes about 1.98 second.
For OpenCL this might be a useful leg-up:
http://developer.amd.com/gpu/ATIStreamSDK/ImageConvolutionOpenCL/Pages/ImageConvolutionUsingOpenCL.aspx
Though the article is only written with a CPU as a target and doesn't touch local memory. It does some vectorisation and unrolling - though there's no unroll pragma in AMD's compiler it seems, whereas there is in NVidia's.
Whole pile of OpenCL related webinars coming up:
http://developer.nvidia.com/object/gpu_computing_online.html
At the very bottom of the page is a WMV and PDF for "Best Practices for OpenCL Programming". The audio quality's a bit ropey though.
Thanks for the links :)
Now I tested the new shader on GTX 285, apparently, unroll the internal loops help a lot (perhaps because it replaces the indexed arrays with non-indexed registers). Now it takes only ~ 170 ms on a GTX 285. I've updated the files on the server with the new shader. It'd be interesting to see how a RV870 performs with this new shader :)
Now I'm very interested in the potential performance of an OpenCL version, with everything preloaded with shared memory. It could be even faster. I'll start porting it when I have some spare time.
Now I tested the new shader on GTX 285, apparently, unroll the internal loops help a lot (perhaps because it replaces the indexed arrays with non-indexed registers). Now it takes only ~ 170 ms on a GTX 285. I've updated the files on the server with the new shader. It'd be interesting to see how a RV870 performs with this new shader :)
:shock: Hang on, the vectorised shader, PSMain2, is now running without running into register allocation or registers-in-memory problems? Wow, that's good, unrolling it made it more than 10x faster :shock: 170ms really is shifting, too.
I noticed you put an unroll on the total_color FOR loop, too.
Now I'm very interested in the potential performance of an OpenCL version, with everything preloaded with shared memory. It could be even faster. I'll start porting it when I have some spare time.
If it isn't faster there's something going on :razz:
Looking at the use of the g_gaussian constant in the ATI assembly I see it generates an extra load in both the scalar and vector versions of the pixel shader. This is the VFETCH instruction. EDIT: I guess this splats the same result across all strands. So I suppose it only needs 1 ALU instruction to hide this latency, instead of 4 for normal fetches. Although there might not be enough register file write bandwidth to make it that fast, hmm.
That looks like a good candidate for returning to a computed weight. The calculation is effectively free on ATI as the shader's fetch bound.
Jawed
CarstenS
18-Oct-2009, 19:50
Couldn't you, just for sake of benchmarking, omit writing an output-file altogether?
:shock: Hang on, the vectorised shader, PSMain2, is now running without running into register allocation or registers-in-memory problems? Wow, that's good, unrolling it made it more than 10x faster :shock: 170ms really is shifting, too.
I noticed you put an unroll on the total_color FOR loop, too.
Yeah, this is really interesting because by some rough counting I thought GTX 285 is near it's ALU limit, and the first "cached" version performs so bad on a GTX 285, I think it's impossible to make it this faster. I just wanted to test it on a Radeon to see whether it's faster or not. Well, now we know that nothing is really impossible :P
If it isn't faster there's something going on :razz:
Of course, it could be really ALU limited now on GTX 285. But I still want to try that. My plan is to write a first version which, just like the CS version, uses texture. Then a "cached" version just like the second pixel shader version. Finally a "cache" shared memory version. This is for comparison reason because there needs to be some baseline for comparison between CS and OpenCL implementations.
Looking at the use of the g_gaussian constant in the ATI assembly I see it generates an extra load in both the scalar and vector versions of the pixel shader. This is the VFETCH instruction. EDIT: I guess this splats the same result across all strands. So I suppose it only needs 1 ALU instruction to hide this latency, instead of 4 for normal fetches. Although there might not be enough register file write bandwidth to make it that fast, hmm.
That looks like a good candidate for returning to a computed weight. The calculation is effectively free on ATI as the shader's fetch bound.
Since this is a gaussian, so basically it's like
weight = exp(-r * (k * k + l * l));
I suspect that it'll be expensive. Unfortunately, since my Radeon is at home it'll have to wait XD
Couldn't you, just for sake of benchmarking, omit writing an output-file altogether?
I can try to copy the texture to a staging buffer, lock it, but not actually read it. I think that'll do.
[EDIT] A new version is uploaded with a new -b switch, which disables writing to file (for a 2M file it takes around 20 ms on my computer, so it's basically 20 ms quicker than previous times). The switch can be used with other switches. i.e.
nlm_cs -b -p2 img_0025.jpg output.bmp
produces something like this on my computer
sigma: 0.02
Using pixel shader.
Setup time: 34ms
Load file time: 72ms
Benchmark mode, do not actually output file.
Denoise time: 151ms
Silent_Buddha
19-Oct-2009, 08:59
Tried the new version on my 5870 and it has a HUGE setup time for the first run...
NLM denoise using DirectX 11
Ping-Che Chen
sigma: 0.02
Using pixel shader.
Setup time: 25980ms
Load file time: 41ms
Denoise and write file time: 211ms
Second run setup is back down to 30-45 though. :D Also the results vary from 180-220 if I just run it a bunch of a times.
Holy cow, it's FAST once I take the write out of the loop. Using -b switch.
NLM denoise using DirectX 11
Ping-Che Chen
sigma: 0.02
Using pixel shader.
Setup time: 41ms
Load file time: 36ms
Benchmark mode, do not actually output file.
Denoise time: 121ms
Takes it from being slower than your GTX 285 to faster than your GTX 285. Cuts off a good 60-100 ms. Running multiple times it's usually 120-122 ms. But every once in a while I'll get a 132-133 result (about once every 10 times run).
Oddly enough -b switch only reduces the CS version by 20-30 ms. But that one is now quite slow compared to the -p2 switch.
Regards,
SB
Tried the new version on my 5870 and it has a HUGE setup time for the first run...
Second run setup is back down to 30-45 though. :D Also the results vary from 180-220 if I just run it a bunch of a times.
Yeah, on my home computer (Core 2 3.0GHz) it takes 40 seconds to compile. On my Core i7 920 it takes about 30 seconds or so. However, it saves the compiled binary into a .bin file so it doesn't have to compile again next time.
Takes it from being slower than your GTX 285 to faster than your GTX 285. Cuts off a good 60-100 ms. Running multiple times it's usually 120-122 ms. But every once in a while I'll get a 132-133 result (about once every 10 times run).
Oddly enough -b switch only reduces the CS version by 20-30 ms. But that one is now quite slow compared to the -p2 switch.
This is a bit weird though. The CS version uses my own code to write BMP file, while the pixel shader version uses D3DX to write resulting texture to a file. Both versions take about 20 ms on my computer. It's very strange that writing a texture to a BMP file (which does not require any compression) would take 100 ms.
In the benchmark mode, the texture is copied to another texture which is accessible by the CPU, then the texture is locked and unlocked to make sure all previous operations are completed.
CarstenS
19-Oct-2009, 09:42
I can try to copy the texture to a staging buffer, lock it, but not actually read it. I think that'll do.
[EDIT] A new version is uploaded with a new -b switch, which disables writing to file (for a 2M file it takes around 20 ms on my computer, so it's basically 20 ms quicker than previous times). The switch can be used with other switches. i.e.
nlm_cs -b -p2 img_0025.jpg output.bmp
produces something like this on my computer
sigma: 0.02
Using pixel shader.
Setup time: 34ms
Load file time: 72ms
Benchmark mode, do not actually output file.
Denoise time: 151ms
Great! I've had the problem, that my denoise & write times varied quite a bit between 280 and 310ish ms. A larger picture (a g80-die shot which i shrunk previously to 8.192² pixels) then produced some 20 seconds of d&w time, but only a very short spike in GPU-load.
Will try the new version as soon as i get home today!
Since this is a gaussian, so basically it's like
weight = exp(-r * (k * k + l * l));
I suspect that it'll be expensive. Unfortunately, since my Radeon is at home it'll have to wait XD
I noticed snippets of code left behind:
//float3 cd = (c2 - c1) * float3(0.299, 0.587, 0.114);
//float weight = exp(-r * (k * k + l * l));
I've been fiddling with either calculating it or changing the g_gaussian fetch into static code.
I made a function to return the weight, since there's only 10 distinct weights:
float GetWeight(int k, int l)
{
k=abs(k) ;
l=abs(l) ;
if (k>l) {
int m = l ;
l = k ;
k = m ;
}
if (k==0 && l==0) return 0.0307091 ;
if (k==1 && l==1) return 0.0274797 ;
if (k==2 && l==2) return 0.0196901 ;
if (k==3 && l==3) return 0.0112973 ;
if (k==0) {
if (l==1) return 0.0290496 ;
if (l==2) return 0.0245899 ;
if (l==3) return 0.018626 ;
}
if (k==1) {
if (l==2) return 0.0232611 ;
if (l==3) return 0.0176195 ;
}
if (k==2 && l==3) return 0.0149145 ;
return -99999.f ; //erroneous return, don't want to see this ever
}
I then changed the vector code, PSMain2, as follows:
int row = 0 ;
[unroll]
for(j = -kernel_half; j <= kernel_half; j++) {
float3 c1s[kernel_half * 4 + 1];
float3 c2s[kernel_half * 2 + 1];
[loop]
for(k = -kernel_half*2; k <= kernel_half*2; k++) {
c1s[k + kernel_half*2] = Input.Load(int3(coord.x + k, i + j, 0));
}
[loop]
for(k = -kernel_half; k <= kernel_half; k++) {
c2s[k + kernel_half] = Input.Load(int3(coord.x + k, coord.y + j, 0));
}
[loop]
for(k = 0; k <= kernel_half*2; k++) {
[unroll]
for(l = 0; l <= kernel_half*2; l++) {
float3 cd = c2s[l] - c1s[k + l];
float weight = GetWeight(row-kernel_half,l-kernel_half);
dist[k] += (dot(cd, cd) * weight);
}
}
row += 1 ;
}
Note the pragmas are quite different - this is crucial. Because of the unrolling in the j and l loops, the GetWeight function is optimised down to literals in the resulting D3D assembly. Other combinations of pragmas produce horrid code as the GetWeight function isn't optimised-out.
The ATI assembly results in 41 vec4 registers (resulting in 6 threads, a decent number to hide latency), with a single loop of 725 ALUs and 147 fetches. So that's the best ALU:fetch so far, 4.9.
There's 343 DOT4s that are really DOT3s and some really classy instructions such as 431 X: MOV R20.x, R20.x, so utilisation looks like it's just under 80%.
Having reduced the count of fetches, increased ALU:fetch and increased the number of threads in flight, this version should be faster on ATI - fingers crossed. Oh, and erm, presuming I haven't broken the weighting calculation...
Jawed
This is interesting... does that mean in ATI's architecture there are some sort of constant slot embedded in the instruction stream?
On NVIDIA's architecture, it's more like a traditional RISC instruction set, i.e. constants have to be read from constant memory anyway. So this optimization probably won't do much on NVIDIA's architecture. But it'd be interesting to see how it works on ATI's architecture though. Unfortunately, right now I don't have access to one :(
I'm considering buying one 5750 or 5770 which I just saw some local stores are selling at resonable prices. However, I'm not sure if it's possible to have both GTX 285 and 5750 on my motherboard though (it's has 3 PCIe x16 slots, but I'm worried about driver issues).
This is interesting... does that mean in ATI's architecture there are some sort of constant slot embedded in the instruction stream?
Yes, e.g.:
t: MULADD_e R4.w, PV427.x, (0x3CFB91A7, 0.03070910089f).y, R5.w VEC_021
The VLIW instruction grows in length to accommodate such literals.
On NVIDIA's architecture, it's more like a traditional RISC instruction set, i.e. constants have to be read from constant memory anyway. So this optimization probably won't do much on NVIDIA's architecture.
4.5.2 of the PTX guide:
http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf
says that literals are fine, e.g.:
mov.f32 $f3, 0F3f800000;
so I would hope that the driver compiler can take the literals in the D3D assembly and make them literals in the PTX. Though I suppose it's still possible that the driver converts these to constants that are fetched :???:
I'm considering buying one 5750 or 5770 which I just saw some local stores are selling at resonable prices. However, I'm not sure if it's possible to have both GTX 285 and 5750 on my motherboard though (it's has 3 PCIe x16 slots, but I'm worried about driver issues).
Only HD5800 cards have double precision support - that might be an issue for you.
Jawed
CarstenS
19-Oct-2009, 19:02
Holy Cow!
With the new version I'm getting (with my G80 test image) 9720ish ms using compute shader and - brace for impact - 3820ish when using pixelshader via -p2 switch.
Not using the -b switch degrades performance to 11.6 vs 6.2 seconds (cs, ps).
With the default image, I'm getting 110 vs. 290 ms (again: ps, cs).
WTH? :)
willardjuice
19-Oct-2009, 20:12
I'm not sure if it's possible to have both GTX 285 and 5750 on my motherboard though
You need Windows 7 (well anything but Vista), but the combination works (I have both R700 and G92 running simultaneously on my machine). Be warned, PhysX no longer works in this configuration, not sure if that's a "deal breaker ladies" for you. :razz:
Yes, e.g.:
t: MULADD_e R4.w, PV427.x, (0x3CFB91A7, 0.03070910089f).y, R5.w VEC_021
The VLIW instruction grows in length to accommodate such literals.
That's interesting. This is really worth trying.
4.5.2 of the PTX guide:
http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf
says that literals are fine, e.g.:
mov.f32 $f3, 0F3f800000;
so I would hope that the driver compiler can take the literals in the D3D assembly and make them literals in the PTX. Though I suppose it's still possible that the driver converts these to constants that are fetched :???:
I'm not sure about this, but IIRC when I write CUDA programs, the constants in programs (i.e. literal constants) are all converted into a constant area in the cubin file. So even if you don't use any constant array, it still has a constant area in the cubin file.
Only HD5800 cards have double precision support - that might be an issue for you.
Yeah, that could be an issue.
Not using the -b switch degrades performance to 11.6 vs 6.2 seconds (cs, ps)
Since your test image is 8192x8192, that means the written file is pretty big, like 256MB. So it would take a few seconds to write that. Basically the speed difference between the old shader and the new shader is proportional.
You need Windows 7 (well anything but Vista), but the combination works (I have both R700 and G92 running simultaneously on my machine). Be warned, PhysX no longer works in this configuration, not sure if that's a "deal breaker ladies" for you.
Well, I just want to do GPGPU programs for both on the same machine, I don't like have multiple machines hanging around, or having to switch boards frequently :) However, if I have to buy a 5850 or 5870 there are two problems, first they are still quite expensive, second my power supply is probably not big enough to handle both GTX 285 and a RV870 XD
I'll have to think about it a bit more thoroughly :)
[EDIT] I thought about another issue: OpenCL. Right now, the beta OpenCL driver from NVIDIA put a file named OpenCL.dll into system directory. I suspect that sometime in the future AMD is going to do the same thing. That seems to mean that if you have cards from both vendor, their OpenCL implementation may conflict in the same machine. Last time I've heard that the vendors were talking about this for a solution, but then I didn't heard anything about that later. It'd be nice to have a solution to integrate OpenCL implementations from multiple vendors in the same machine on Windows, just like in Snow Leopard.
CarstenS
20-Oct-2009, 08:39
Since your test image is 8192x8192, that means the written file is pretty big, like 256MB. So it would take a few seconds to write that. Basically the speed difference between the old shader and the new shader is proportional.
Sorry, I think I did not express it clearly enough: Both times were with the b-switch but different shader-types. The slower would run the CS, the faster the PS.
trinibwoy
21-Oct-2009, 15:14
Using:
http://developer.amd.com/gpu/shader/Pages/default.aspx
Do you need AMD hardware and Catalyst installed to use the shader analyzer or is it standalone?
I'm reasonably sure it's stand-alone as it contains historical versions of the compiler from the different Catalyst versions. I don't have a system without ATI installed, to test on, though.
It says "Requirements: Windows XP or Vista, Microsoft DirectX SDK (April 07 or later). " I don't remember installing the DirectX SDK - that might be something that any system with a reasonably up-to-date DirectX 9.0 install has. Dunno.
Jawed
I can run it fine on my system with only GTX 285 installed, so you don't need ATI's hardware installed to run it.
trinibwoy
21-Oct-2009, 15:44
Oh cool, thanks guys. Will have to check it out sometime.
I've finally made some time to port the NLM denoise algorithm to OpenCL. The major point here is to use shared memory to reduce memory reads, rather than using texture cache.
However, for some reason, the kernel can't be run with more than 16x16 work items per work group on my GTX 285. I suspect that the compiler may have used too many registers. The performance is therefore not great. On my GTX 285 it takes around 450 ms to run (including allocate buffers and copy data to/from GPU, real kernel run time is about 435 ms). I think if it's possible to make more work item such as the originally planned 32x16 it could be better.
Unfortunately right now the CUDA 3.0 beta toolkit doesn't contain a OpenCL visual profiler (there are links in the startup menu but no files).
If anyone wants to try this, the binary can be downloaded here (http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/nlm_cl.rar). It should be compatible with AMD's OpenCL SDK.
Tim Murray
18-Nov-2009, 21:11
Unfortunately right now the CUDA 3.0 beta toolkit doesn't contain a OpenCL visual profiler (there are links in the startup menu but no files).
that seems like a mistake; I'll check on it when I'm not on vacation (Monday).
Crashes on HD5870 (900MHz GPU) and Cat 9.11 WHQL with Stream 2.0 SDK:
NLM denoise OpenCL version
Copyright(c) Ping-Che Chen
Width: 1600 Height: 1200
For test only: Expires on Sun Feb 28 00:00:00 2010
Find device: Cypress (Advanced Micro Devices, Inc.)
Time used: 480
Problem Event Name: APPCRASH
Application Name: nlm_cl.exe
Application Version: 0.0.0.0
Application Timestamp: 4b04657a
Fault Module Name: OpenCL.dll
Fault Module Version: 1.0.0.1
that seems like a mistake; I'll check on it when I'm not on vacation (Monday).
Thanks, looking forward to it :)
Crashes on HD5870 and Cat 9.11 WHQL with Stream 2.0 SDK:
That's weird. Does it produce a good output file? Since "Time used" is printed, it seems like the denoise operations are already done. It could therefore only crash when writing files or releasing OpenCL contexts. As it crashed in OpenCL.dll I suspect the latter.
I uploaded a new binary here (http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/nlm_cl.rar) (and source here (http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/nlm_cl_src.rar)). This time it has a new kernel for non-vectorized architectures (such as NVIDIA's devices). It's too bad that OpenCL does not support float3. The "de-vectorized" version runs faster on my GTX 285, around 375 ms. The new version also has a -p switch to enable profiling mode.
That's weird. Does it produce a good output file? Since "Time used" is printed, it seems like the denoise operations are already done. It could therefore only crash when writing files or releasing OpenCL contexts. As it crashed in OpenCL.dll I suspect the latter.
Yep, it does, I forgot about it, but the output BMP is quite blurry -- should it be that way?
I uploaded a new binary here (http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/nlm_cl.rar) (and source here (http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/nlm_cl_src.rar)). This time it has a new kernel for non-vectorized architectures (such as NVIDIA's devices). It's too bad that OpenCL does not support float3. The "de-vectorized" version runs faster on my GTX 285, around 375 ms. The new version also has a -p switch to enable profiling mode.
This one outputs complete stat's, but the app error msg remains:
NLM denoise OpenCL version
Copyright(c) Ping-Che Chen
Sigma: 0.02
Width: 1600 Height: 1200
For test only: Expires on Sun Feb 28 00:00:00 2010
Find device: Cypress (Advanced Micro Devices, Inc.)
Setup time: 0.28 s
Denoise time: 0.48 s
Write file time: 0.02 s
Yep, it does, I forgot about it, but the output BMP is quite blurry -- should it be that way?
If it's very blurry then something is wrong. The result image should be very similar to the original, with only a bit blurry (assuming sigma is set to 0.02). Basically, only noises should be "blurred out."
This one outputs complete stat's, but the app error msg remains:
I guess I'll have to try it at home where I have a Radeon 4850.
I also tested this on a GeForce 8800GT, which run time is about 735ms.
[EDIT] I just manually unrolled the inner most loop, and the run time on GeForce 8800GT decreases to 516ms (kernel time ~493ms). Unfortunately there is no unroll directives in OpenCL. Maybe I can write something to generate the unrolled program at run time.
[EDIT2] I found that there's OpenCL visual profiler in 32 bits CUDA toolkits. It seems that it's only absent in 64 bits toolkits.
I uploaded a new version, which contains loop unrolling codes (the program automatically generates unrolled codes). It also detects and adjusts work group size automatically, ranging from 32x32 (1024 work items) down to 1x1. There is also a "force using CPU" option, but currently there is no kernel designed for running on a CPU. I also put an executable compiled with ATI's SDK, maybe it'll be more compatible with ATI's OpenCL implementation.
My GeForce GTX 285 runs the unrolled version with 32x16 work items, takes 236 ms (kernel time 213 ms). Now the performance is close to the compute shader version, but still not there yet. I tried to unroll the loop more but it will take the number of work items down to 128 and the performance is actually worse this way.
Now the vectorized version is also unrolled, and I think it'd run better on ATI's hardware.
I also tried it on my Mac mini, which has a lowly GeForce 9400M. It takes about 3990 ms to run. I suspect a CPU optimized version of NLM probably takes roughly the same time to run on the Mac mini (which has a 2.0GHz Core 2 Duo).
Executable (http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/nlm_cl.rar)
Source (http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/nlm_cl_src.rar)
It's all fine now, boss. :lol:
Sigma: 0.02
Width: 1600 Height: 1200
For test only: Expires on Sun Feb 28 00:00:00 2010
Find device [0]: Cypress (Advanced Micro Devices, Inc.)
Work group size: 256
Profile enabled. Kernel time: 422820194 ns
Setup time: 0.63 s
Denoise time: 0.48 s
Write file time: 0.05 s
Thanks. Does the ATI executable necessary now, or both are the same?
Also I run it on a GeForce 8800GT:
NLM denoise OpenCL version
Copyright(c) Ping-Che Chen
Sigma: 0.02
Width: 1600 Height: 1200
Find device [0]: GeForce 8800 GT (NVIDIA Corporation)
Work group size: 256
Profile enabled. Kernel time: 425196960 ns
Setup time: 1.64 s
Denoise time: 0.453 s
Write file time: 0.047 s
I run it with the OpenCL visual profiler (32 bits version), there is no warp serialization (every thread reads from different bank at the same time), divergent branch is few (only at the "load image" stage I believe), and all load/stores in the denoise kernel are coalesced. The only problem is the low occupancy rate (0.333), however, that's mostly because the number of work item is limited because the register usage. Also I don't think low occupancy rate is a serious problem here, because the kernel does not actually spend much time reading/writing to global memory, thus there's very little latency to hide.
Replace exp with native_exp makes the kernel a bit faster (416ms vs 425 ms), but the result images are different with many pixels different more than 1, and that's outside of my "comfort" zone, so I decided to not use this optimization. (Using half_exp is as fast as exp on GeForce 8800GT, with the same result image)
On Radeon HD 4850:
NLM denoise OpenCL version
Copyright(c) Ping-Che Chen
Sigma: 0.02
Width: 1600 Height: 1200
For test only: Expires on Sun Feb 28 00:00:00 2010
Find device [0]: ATI RV770 (Advanced Micro Devices, Inc.)
Work group size: 64
Profile enabled. Kernel time: 1226922460 ns
Setup time: 2.719 s
Denoise time: 1.505 s
Write file time: 0.028 s
It's better than I expected since it does not have real shared memory AFAIK. The result image is a bit weird though.
[EDIT] I have found a few bugs which are the reasons behind the weird results.
The first bug is related to ATI's implementation. Apparently it does not actually put constant values in the declared array (i.e. the __constant float gaussian[49] = { ... }; is not filled with the actual values, but zeros). That's why the resulting image is very blurry. A solution is to put the __constant into function parameter and allocate a memory object for it. This method works on NVIDIA's hardware too and there doesn't seem to be any performance hit.
Another bug is in my shader, which failed to fill the shared memory completely when the number of work items is too small (namely less than 16x16). This does not affect the result when the number of work items is larger than 16x16 (256).
The fixed version takes basically the same time to run on my Radeon 4850 but with correct results.
There also doesn't seem to be any need for a separate ATI executable anymore. I suspect the original crash problem is related to incorrect retain/release implementation regarding to OpenCL contexts.
CarstenS
22-Nov-2009, 10:13
Did i get that right? Denoise is faster on 8800GT than on Cypress? Or did you two use different pictures?
Did i get that right? Denoise is faster on 8800GT than on Cypress? Or did you two use different pictures?
The image is the same. I don't know the exact problem, but considering it takes a shared-memory-less 4850 about 1.2 seconds to run, it should be much faster on a 5870, instead of just about thrice as fast. I suspect that there could be some problems in the shared memory access pattern or something else. For example, the access pattern in the shader is designed to be bank conflict free on NVIDIA's GPU. However, I don't know how Cypress' shared memory is arranged.
Also I've written a CPU specific shader for CPU devices, and it runs very slowly with current AMD's OpenCL implementation. I'll have to compare it with Apple's implementation later, but I suspect that Apple's implementation should be much faster.
trinibwoy
23-Nov-2009, 16:21
For example, the access pattern in the shader is designed to be bank conflict free on NVIDIA's GPU. However, I don't know how Cypress' shared memory is arranged.
Egad, this sorta stuff is just asking for shader replacement.
Replace exp with native_exp makes the kernel a bit faster (416ms vs 425 ms), but the result images are different with many pixels different more than 1, and that's outside of my "comfort" zone, so I decided to not use this optimization. (Using half_exp is as fast as exp on GeForce 8800GT, with the same result image)
Is that the difference between running it on the SFU (native_exp) and ALU?
Good work by the way :grin:
Egad, this sorta stuff is just asking for shader replacement.
I think there shouldn't be a bank conflict on shared memory on Cypress. After all, the access pattern is simple. I just make sure the shader doesn't read from the same column in different threads. I guess that would be suffice for Cypress too, though I'd like to know more details. I think this deserves further investigation.
Is that the difference between running it on the SFU (native_exp) and ALU?
I don't remember seeing a different on my 4850 though (including performance). The exp operation is outside of the inner most loop, so it's only called 49 times for each pixel.
Also I ran the CPU shader on my Mac mini, and it's slow too. The shader takes like 27 seconds to run on Mac mini (Core 2 Duo 2.0GHz). It takes about the same time to run with ATI's implementation on my Core 2 Duo 3.0GHz. Although, ATI's implementation is in beta and it's pretty old. I think there are still many optimization opportunities.
Unfortunately, this also means that in current state it's not possible to get reasonable performance from CPU implementations. It could change though, although right now GPU implementations are much faster.
I did some further experiments on some of my ideas. One of the ideas is to remove conversion in the inner loops since they look redundant. However, simply changing the shared memory types from uchar4 to float4 won't work because the bank conflict will definitely kill performance.
Therefore, I decided to de-vectorize everything, making three float arrays. This shouldn't affect performance much because NVIDIA's GPU are already scalar in nature. The first attempt didn't go very well because it uses more registers and the number of work items were cut in half. I used OpenCL visual profiler and it showed that each work item uses 33 registers, just one too much.
So I changed the kernel to reduce register usage and managed to make it 32 registers, therefore the number of work items remain the same (256 for each work group). However, performance is still worse (about 480ms in kernel instead of the original 416 ms on a 8800GT). The number of instructions also increases. A possible reason is that reading from shared memory is not as fast as I think.
However, some tricks used during these experiments could be useful, such as making the two dimensional array into one dimensional. Although if the OpenCL compiler is any good it should already remove a lot of redundant computations.
Since AMD updated the Stream SDK and that requires some changes in the program to make it work, I updated this NLM denoise program so it works with AMD's OpenCL now. A new option is added, "-platform" allows the user to select which OpenCL platform to use (right now most systems have only one platform).
Executable (http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/nlm_cl.rar)
Source (http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/nlm_cl_src.rar)
I tried AMD's CPU implementation on my Core i7 920, and it takes about 11.2 seconds to denoise the image. Note that it's not exactly fair because for this kind of image processing it's generally possible to use integers to do most calculations and that could be much faster on CPU.
Radeon HD 5870 @ 900/5000:
NLM denoise OpenCL version
Copyright(c) Ping-Che Chen
Sigma: 0.02
Width: 1600 Height: 1200
Platform [0]: ATI Stream
Select platform 0
Find device [0]: Cypress (Advanced Micro Devices, Inc.)
Work group size: 256
Setup time: 0.89 s
Denoise time: 0.47 s
Write file time: 0.03 s
Q9450 @ 3608MHz:
NLM denoise OpenCL version
Copyright(c) Ping-Che Chen
Sigma: 0.02
Width: 1600 Height: 1200
Platform [0]: ATI Stream
Select platform 0
Find device [0]: Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz (GenuineIntel)
Work group size: 32
Setup time: 0.22 s
Denoise time: 12.232 s
Write file time: 0.02 s
256 size of the work group -- would be nice to see a parameter switch for this to set other values, or just hard-code it to 64 anyway. ;)
The output image is looking crisp and sharp this time.
Since this program uses local (shared) memory to reduce global memory bandwidth requirement, the work group size should be as large as possible. However, this program is also pretty computation intensive, so after a certain size it doesn't matter anymore. Anyway, for the sake of scientific experiments :) I added an option for forcing the work group size (of course, it still can't exceed the maximum value allowed by the OpenCL implementation, also it has to be power of 2).
I also modified the program to use directly initialized constant data for better compatibility with Apple's OpenCL. Previously I did not use this because ATI's old implementation has problem with this. With ATI's new Stream SDK I believe this bug has been corrected so I changed it back (I'm not completely sure about this so this may need some tests).
A MacOS X executable is also in the new file.
On my Mac mini (GeForce 9400) it takes 4.04 seconds to denoise the image. Interestingly, on the first run, the GPU may not be clocked up yet and it would take about 7.7 seconds to run.
Yup, it's slightly faster with 64 wg size: 0.47 vs. 0.42~0.40 sec.
Arnold Beckenbauer
05-Jan-2010, 21:42
C:\Users\Denis x64\Downloads\nlm_cs(3)>nlm_cs img_0025.jpg c.bmp
NLM denoise using DirectX 11
Ping-Che Chen
sigma: 0.02
Using compute shader.
Setup time: 46ms
Load file time: 63ms
Denoise and write file time: 702ms
C:\Users\Denis x64\Downloads\nlm_cs(3)>cd C:\Users\Denis x64\Downloads\nlm_cl
C:\Users\Denis x64\Downloads\nlm_cl>nlm_cl img_0025.bmp k.bmp
NLM denoise OpenCL version
Copyright(c) Ping-Che Chen
Sigma: 0.02
Width: 1600 Height: 1200
Platform [0]: ATI Stream
Select platform 0
Find device [0]: ATI RV770 (Advanced Micro Devices, Inc.)
Work group size: 64
Setup time: 2.683 s
Denoise time: 1.545 s
Write file time: 0.031 s
Note that although RV770 now supports DirectCompute, the old nlm_cs still runs faster in the second pixel shader path (use -p2 option) as the DirectCompute path is not updated with later optimizations.
RV770 is going to be slow with the OpenCL version because the major point of the OpenCL version is to use shared memory to reduce memory bandwidth, but RV770 doesn't support shared memory in OpenCL.
I also tried a few loop unrolling on my GTX 285, but they don't fare very well (NVIDIA's OpenCL compiler, for some reason, uses a lot of registers and even private memory for the doubly unrolled code). I suspect that AMD's GPU may be better at handling these, so I'm going to make a doubly unrolled version for AMD's code path later.
Arnold Beckenbauer
06-Jan-2010, 15:57
Note that although RV770 now supports DirectCompute, the old nlm_cs still runs faster in the second pixel shader path (use -p2 option) as the DirectCompute path is not updated with later optimizations.
....
C:\Users\Denis x64\Downloads\nlm_cs(3)>nlm_cs -p2 img_0025.jpg r.bmp
NLM denoise using DirectX 11
Ping-Che Chen
sigma: 0.02
Using pixel shader.
Setup time: 45630ms
Load file time: 130ms
Denoise and write file time: 340ms
C:\Users\Denis x64\Downloads\nlm_cs(3)>nlm_cs -p img_0025.jpg r.bmp
NLM denoise using DirectX 11
Ping-Che Chen
sigma: 0.02
Using pixel shader.
Setup time: 80ms
Load file time: 70ms
Denoise and write file time: 800ms
C:\Users\Denis x64\Downloads\nlm_cs(3)>
:?:
The -p2 path is a heavily unrolled path which also uses private memory to reduce redundant texture fetch, so it's faster than the -p path. The -p path is basically just the same as the DirectCompute path.
Arnold Beckenbauer
06-Jan-2010, 17:59
The setup time confused me.
For some reason, Microsoft's HLSL compiler takes a huge amount of time to compile a heavily unrolled shader. So this program stored the compiled binary to avoid compiling it again. :)
prunedtree
10-Jan-2010, 19:22
I might be able to give some perspective to the optimizations you are experimenting with:
I'm afraid your kernel is an order of magnitude slower than it needs (algorithmically).
As a proof of concept, It's possible to achieve the same results using a quite simple implementation (C++, single-threaded) in mere seconds.
Why did you choose to do so much redundant computations ? Did you expect the algorithm to be less ALU-bound than it is ?
Why did you choose to do so much redundant computations ? Did you expect the algorithm to be less ALU-bound than it is ?
Oh, it's just because it's came from an earlier pixel shader based kernel (not DX11) so it can't do inter-pixel operations, unfortunately.
With OpenCL or DX11 DirectCompute it's possible to do so by saving the Gaussian differences to avoid redundant computation, I think. However, the current OpenCL kernel was suppoed to be a comparison to the pixel shader version to see whether using local memory (shared memory in CUDA) can be helpful :P
GPU's local memory are not big enough to store all Gaussian differences, so it will have to go to the global memory. This could be another interesting experiment, thanks for the suggestion :)
[EDIT] After some considerations I think using global memory is a bad idea. I have some ideas on how to put them into local memory, though. I'll get back to this when I have more free time.
prunedtree
11-Jan-2010, 15:37
Indeed, pixel shaders are quite limited. However by doing a lots of redundant computation, it's easy to end up making GPUs look artificially better than CPUs.
I don't see why this couldn't run fine using global memory. You can cut the amount of computation by an order of magnitude. If you end up with a memory bandwidth bottleneck, then that means you must write and read 7x7 fp32 values per pixel, about ~800 MB of data (of course you don't need that much memory, tiling is natural here) for a 2 megapixel image, thus taking less than 10 ms on HD4870 for instance.
Note that unorm8 is actually far enough (there is no significant difference) so even HD5870 can be ALU-bound on this in practice.
Indeed, pixel shaders are quite limited. However by doing a lots of redundant computation, it's easy to end up making GPUs look artificially better than CPUs.
Well, my CPU implementation (I didn't post that) performs NLM on YUV color space, so the redundant computation is basically just a subtraction and multiplication ((a - b) * (a - b)). So it doesn't make much sense saving these (they are done using SSE int). Technically the CPU version does less work and is less accurate (the gaussian is stored as 16 bit fixed number and the exp is done by table lookup) yet it's still slower than the "redundant" GPU version.
On the GPU version it's different because the redundant computation is a dot3 of the differences of two vectors, which is more expensive than a simple subtraction and multiplication.
I don't see why this couldn't run fine using global memory. You can cut the amount of computation by an order of magnitude. If you end up with a memory bandwidth bottleneck, then that means you must write and read 7x7 fp32 values per pixel, about ~800 MB of data (of course you don't need that much memory, tiling is natural here) for a 2 megapixel image, thus taking less than 10 ms on HD4870 for instance.
The problem is that for each pixel a different Gaussian coefficient is used on the same dot3, so using global memory may cause memory access pattern problems. And it's not just 7x7 fp32, it's 7x7x7x7 for each pixel. Also, read/writing global memory is probably not much cheaper than simply redoing the dot3.
I think it worth a try using local memory to store the dot3.
[EDIT] I made a preliminary version using local memory to store the dot3. It runs about twice as fast as previous version on my GTX 285 (0.109 second vs 0.236 second). I didn't modify the OpenCL CPU version yet. I'll post this version later.
[EDIT2] New version uploaded. Executable (http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/nlm_cl.rar) Source (http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/nlm_cl_src.rar)
I also tested it on my Mac mini, which takes around 1.79 second to run, also roughly twice as fast as previous version.
prunedtree
11-Jan-2010, 18:23
And it's not just 7x7 fp32, it's 7x7x7x7 for each pixel. Also, read/writing global memory is probably not much cheaper than simply redoing the dot3.
Once you invert the two outer loops, the inner kernel is just a gaussian blur. Your implementation uses 3x7x7 mads, caching dot3s would bring it to ~7x7 mads, a separated gaussian blur would take ~2x7 mads. A 10.5x difference.
Of course, you need to be able to work on reasonably large blocks to achieve sufficient amortization of the computations. For instance, on ATI hardware, you have enough RF space to process 8x8 values per lane (and only have about ~27% ALU overhead)
EDIT: You might want to use bigger kernels if you want to take into account CPU<->GPU transfers, otherwise, you may have dismal speedups (I believe a quad core could do this under 100 ms, as CPUs benefit massively from such optimizations with their caches)
Once you invert the two outer loops, the inner kernel is just gaussian blurs. Your implementation uses 3x7x7 mads, a naive implementation caching dot3s would take ~7x7 mads, a separated gaussian blur would take ~2x7 mads. A 10.5x difference.
Yeah, I just done that in the updated version. It now takes around 0.084s to run. It doesn't have 10.5x difference because there are some overheads regarding to synchronizations of threads in the kernel. Also, the time it takes to copy data from and to the CPU memory to the GPU memory now becomes significant. Judging from the OpenCL Visual Profiler, it seems that there is a 1.5ms overhead of launching a kernel or memory copy operations.
[EDIT] I just tried the new kernel on my Mac mini, and strangely, it runs slower than the non-separated version (2.00 second vs 1.79 second). There is no warp serialization on GTX 285 and occupancy rate is not that bad either. Then I thought that maybe Apple's OpenCL compiler doesn't do loop unrolling, so I manually unroll the two loops, and it's faster now (1.192second). (NVIDIA's OpenCL compiler unrolls the loops automatically)
prunedtree
12-Jan-2010, 03:18
BTW, I noticed I forgot to take into account symmetry (it's trivial to exploit if you invert the outer loops), thus the original kernel is doing about 21x more work than needed.
I suppose a really optimized implementation could end up taking under 10 ms, and at this point the GPU<->CPU transfer (and control overheads ?) could indeed be the bottleneck
BTW, I noticed I forgot to take into account symmetry (it's trivial to exploit if you invert the outer loops), thus the original kernel is doing about 21x more work than needed.
I suppose a really optimized implementation could end up taking under 10 ms, and at this point the GPU<->CPU transfer (and control overheads ?) could indeed be the bottleneck
I thought about the symmetry, but I didn't come up with an easier enough solution yet (the easiest way is to unroll the outer loop by two, but that seems to introduce more register usage on NVIDIA's compiler and that's bad for occupancy, I'm still looking into that).
On the transfer overhead, I'm thinking about using zero-copy, i.e. let the GPU write the output image into pinned memory directly (since the output image is written only once). This may shave a few ms further down.
Once you invert the two outer loops, the inner kernel is just a gaussian blur.
An awfully compact method of explaining the concept (works better than with math).
PS. an option for an unweighted distance calculation would be nice too if this is implemented, would give another 7x reduction in calculations on top.
prunedtree
12-Jan-2010, 15:34
PS. an option for an unweighted distance calculation would be nice too if this is implemented, would give another 7x reduction in calculations on top.
Yes, that's why I mentioned that this was in essence a 7x7 kernel. A gaussian stencil on a regular grid can be executed in O(1) time (regardless of their size). However, I'm afraid you won't get much of a speedup in this case, compared to such a short truncated gaussian blur.
Yah, it's probably not worth it. AFAICS LDS is definitely worth it though, getting this bandwidth limited shouldn't be too hard with global memory (or rather texture sampler limited).
Personally I would implement this with a circular buffer constisting of 7 lines of cached dot3 RGB results (for a fixed shift, ie. with inverted outer loops), 1 line to store the intermediate of the vertical gaussian filter, 1 line for the original pixel colors ... so 9 lines in total. Have to work on the image in strips of course.
No OpenCL capable card at the moment though :/
PS. actually a couple more lines so you can use async copies to hide memory access latencies, relying on multithreading to hide latency when everything is this predictable doesn't make much sense ... are the async copies in opencl efficient at the moment?
Personally I would implement this with a circular buffer constisting of 7 lines of cached dot3 RGB results (for a fixed shift, ie. with inverted outer loops), 1 line to store the intermediate of the vertical gaussian filter, 1 line for the original pixel colors ... so 9 lines in total. Have to work on the image in strips of course.
I've thought about using circular buffer before, but I decided to use a simpler block instead. Although there are some redundant reads with blocks, it's easier to manage. Also, obviously it would be better to use a 8 lines (or 16 lines) circular buffer so it's easier to compute circular index.
PS. actually a couple more lines so you can use async copies to hide memory access latencies, relying on multithreading to hide latency when everything is this predictable doesn't make much sense ... are the async copies in opencl efficient at the moment?
I haven't tried async copy before, so I am not sure about this. But GPU still has to rely on multi-threading to hide ALU latency and branch latency, so there still has to be a large enough number of threads.
Is there any list of instruction latencies for the GPUs? I assume that for the simple stuff like MADs the latency is already covered by the 4 cycle wavefront execution at least for ATI? (ie. effectively it's single cycle from the point of view of the shader.) Hell, 4 cycles at the relatively low clocks ATI runs should be enough to do branching too.
Is there any list of instruction latencies for the GPUs? I assume that for the simple stuff like MADs the latency is already covered by the 4 cycle wavefront execution at least for ATI? (ie. effectively it's single cycle from the point of view of the shader.) Hell, 4 cycles at the relatively low clocks ATI runs should be enough to do branching too.
To my understanding, NVIDIA's GPU need at least 192 work items (i.e. 6 warps) per work group to cover all ALU latency. For AMD's GPU, one wavefront is good enough for covering ALU latency, so that means the number of work items in a work group should be at least 64.
rpg.314
12-Jan-2010, 19:52
Is there any list of instruction latencies for the GPUs? I assume that for the simple stuff like MADs the latency is already covered by the 4 cycle wavefront execution at least for ATI? (ie. effectively it's single cycle from the point of view of the shader.) Hell, 4 cycles at the relatively low clocks ATI runs should be enough to do branching too.
Nah. 4 cycles/warp is the ALU throughput, sorta like how SSEx used to be 64 bit before conroe. The logical SIMD width is 32 while the physical SIMD width is 8 on G80 and GT200, while the physical SIMD width has been upped to 16 in GF100.
There is a 22 cycle latency for ALUs on nv hw on gt200 and g80.
There is a 22 cycle latency for ALUs on nv hw on gt200 and g80.
Frankly a ridiculous amount of latency.
For AMD's GPU, one wavefront is good enough for covering ALU latency
All ALU instructions? (ie. including exp.)
All ALU instructions? (ie. including exp.)
Yeah, everything is "one cycle" on ATI. But one hardware thread will leave the ALUs idle for half the cycles, so the minimum is actually 2.
Any ALU instruction that uses LDS (on Evergreen only), constant-buffer or indexed registers runs the risk of stalling. Depends on the access pattern and contention.
Clause boundaries cost 40 cycles. This means that as soon as the program contains more than one clause, it'll need more than 2 hardware threads to hide latency (even if there's no texturing). So then it's all a matter of hiding the total program latency, whatever its source: clause switches or memory fetches and stores.
Jawed
To my understanding, NVIDIA's GPU need at least 192 work items (i.e. 6 warps) per work group to cover all ALU latency.
Technically less will work. 6 is required to cover worst-case register read-after-write latency. If the code has more instruction level parallelism then less hardware threads will work fine.
I haven't studied this thread:
http://forums.nvidia.com/index.php?showtopic=152828
MUL on the MI ALU (SF) complicates testing.
Jawed
Do async copies simply get turned into a texture clause only kernel?
What kind of async copy?
In OpenCL, async copy is either from global to local or vice-versa. I haven't looked at the ISA for this, but I suspect what happens (on Evergreen) is that the access to global memory is performed by a Control Flow instruction (i.e. a burst read or write). Access to local memory can only be effected by ALU instructions. So there's at least one ALU clause.
I'm not sure what you mean by async copy, though, because once the kernel ends local memory is invalid - and a kernel cannot safely inherit the status of local memory when it starts. At least, not under OpenCL.
Under CAL, maybe...
Jawed
async_work_group_copy ... ideally it would run as a separate kernel (hardware wise, not OpenCL wise) but I guess that's not going to happen.
async_work_group_copy ... ideally it would run as a separate kernel (hardware wise, not OpenCL wise) but I guess that's not going to happen.
It's very difficult to make this run as a separate kernel because the synchronization requirements. Some hardwares may be able to do this asynchronously (CELL's SPE is an obvious candidate, actually, this call seems to map well to SPE's DMA call), but I don't know about current GPU.
Some of NVIDIA's GPU can do concurrent kernel execution and memory transfer, but only when they are independent (i.e. a memory transfer request and a kernel execution request).
mhouston
12-Jan-2010, 23:45
async_work_group_copy on all GPUs implementations I'm aware of turns into a loop of texture/memory fetch operations. That functionality was put in for current and future architectures that have DMA engines attached to each core.
Convolution code is not very elegant in float4s if you want to avoid LDS bank conflicts is it?
Now I got my Radeon 5850 working I decided to check on the nlm program again, to see why this program runs slower on Cypress.
Even with all these optimizations, it still takes about 170ms to run the denoise kernel, compared to ~60ms on GTX 285. So I decided to do some more loop unrolling (there are two loops which basically just for handling the corner cases). After unrolling one loop it takes around 140ms to run, but of course on GTX 285 it's also faster to around 40ms. Then I unroll the second loop, and now it takes around 115ms to run (mysteriously, unrolling the second loop makes GTX 285 slower to about 45ms).
Then I used ATI stream profiler to see some profiling registers for more hints. There are two major problems with the kernel: the first one is the ALU packing rate, which is only around 60%. This is probably due to the face that the convolution part is not packed. However, this is not easy to do. Another problem is that there is LSD bank conflicts. Actually it's quite big, around 63%. There is no shared memory bank conflicts on GTX 285, so I looked up the bank configurations of the LDS on Cypress.
According to AMD's document, the LDS is 32 banked. Since the work group size on Cypress is 16x16, it means that there is probably many bank conflicts on the arrays. One way to handle this problem is to make the work group size as 32x8 instead of 16x16. However, this would create more redundant computations on the Y axis. Another way is to make sure all local memory arrays (there are three of them) have strides which is 16 (mod 32). So I decided to "patch" the strides by hard coding numbers on them (e.g. the 'shared' array was originally THREAD_X + KERNEL_OFFSET which is 28 when THREAD_X is 16, so I added 20 to it to make it 48). By patching these strides, now there is no LDS bank conflicts. However, the result is only a little faster, to about 110ms.
Maybe it's time to write a DirectCompute 5.0 version using shared memory...
Lightman
29-Jan-2010, 10:46
Now I got my Radeon 5850 working I decided to check on the nlm program again, to see why this program runs slower on Cypress.
Even with all these optimizations, it still takes about 170ms to run the denoise kernel, compared to ~60ms on GTX 285. So I decided to do some more loop unrolling (there are two loops which basically just for handling the corner cases). After unrolling one loop it takes around 140ms to run, but of course on GTX 285 it's also faster to around 40ms. Then I unroll the second loop, and now it takes around 115ms to run (mysteriously, unrolling the second loop makes GTX 285 slower to about 45ms).
Then I used ATI stream profiler to see some profiling registers for more hints. There are two major problems with the kernel: the first one is the ALU packing rate, which is only around 60%. This is probably due to the face that the convolution part is not packed. However, this is not easy to do. Another problem is that there is LSD bank conflicts. Actually it's quite big, around 63%. There is no shared memory bank conflicts on GTX 285, so I looked up the bank configurations of the LDS on Cypress.
According to AMD's document, the LDS is 32 banked. Since the work group size on Cypress is 16x16, it means that there is probably many bank conflicts on the arrays. One way to handle this problem is to make the work group size as 32x8 instead of 16x16. However, this would create more redundant computations on the Y axis. Another way is to make sure all local memory arrays (there are three of them) have strides which is 16 (mod 32). So I decided to "patch" the strides by hard coding numbers on them (e.g. the 'shared' array was originally THREAD_X + KERNEL_OFFSET which is 28 when THREAD_X is 16, so I added 20 to it to make it 48). By patching these strides, now there is no LDS bank conflicts. However, the result is only a little faster, to about 110ms.
Maybe it's time to write a DirectCompute 5.0 version using shared memory...
Yes, please do!
I'm looking forward to your findings! :grin:
I had another idea that since 5850 doesn't seem to be bothered too much by LDS bank conflict and it has a larger LDS, maybe it's a good idea to use a float4 shared array instead of using uchar4 array and convert them to float4 every time. GTX 285 doesn't like that idea too much probably because it's convert_float4 is quick and using too much shared memory decreases occupany, which limits its ability to hide latency.
Anyway, so I tried to use float4 directly and it's faster! But only by a little :P Now the denoise kernel takes around 100 ms instead of 110 ms, i.e. 10% faster :)
EduardoS
29-Jan-2010, 18:52
Hum... Using float4 may causes bank conflicts, how about using two float2 arrays and joining them?
Hum... Using float4 may causes bank conflicts, how about using two float2 arrays and joining them?
Right now almost every shared memory access causes LDS bank conflict :P
However, in earlier version, removing bank conflict does not improve performance much, so I suspect that by spliting the array into multiple ones is probably going to be slower (however, this is important for GTX 285 because it does not like bank conflict at all, so when I did that on GTX 285 I used three arrays, for R, G, B each).
[EDIT] I did the three arrays method to remove all bank conflicts, and it improves performance more than I thought. Now the denoise kernel takes about 92 ms to run, which is another 8% improvement over the ~100ms time. However, the ALU packing rate is now lower, at 53%.
Lightman
29-Jan-2010, 19:39
At this rate you will get that filter under 50ms by tomorrow :lol:
Ok, I basically copy the OpenCL version to Compute shader version, using my old compute shader main program. So basically only the shader is changed (the main program also needs some changes to support CS 5.0). Unfortunately, it's pretty slow for some unknown reason. Apparently the compiler tries to unroll the outer loops, because it thinks "GroupMemoryBarrierWithGroupSync()" shouldn't be inside a loop (?)
It also takes an extremely long time to compile the shader (nearly three minutes on my Core i7 920).
The bad news is, the CS 5.0 version is not fast at all. It takes 746ms to run the shader. Using the GPU shader analyzer, apparently the crazy unroll has caused the shader to take a big amount of scratch memory (i.e. they can't be fit in the register file so they have to be in the global memory). That's may be why it's so slow.
I also identified a redundant group memory barrier in the shader. However, it shouldn't have much impact on performance.
OpenGL guy
29-Jan-2010, 21:21
Ok, I basically copy the OpenCL version to Compute shader version, using my old compute shader main program. So basically only the shader is changed (the main program also needs some changes to support CS 5.0). Unfortunately, it's pretty slow for some unknown reason. Apparently the compiler tries to unroll the outer loops, because it thinks "GroupMemoryBarrierWithGroupSync()" shouldn't be inside a loop (?)
It also takes an extremely long time to compile the shader (nearly three minutes on my Core i7 920).
The bad news is, the CS 5.0 version is not fast at all. It takes 746ms to run the shader. Using the GPU shader analyzer, apparently the crazy unroll has caused the shader to take a big amount of scratch memory (i.e. they can't be fit in the register file so they have to be in the global memory). That's may be why it's so slow.
I think there's a compiler hint "[nounroll]" you can put before the loop in question to prevent the HLSL compiler from unrolling it. If that doesn't help, then it might be the driver's compiler that's unrolling the loop. Can you dump the DX tokens from GSA? That should give a hint where the unrolling is taking place.
I think there's a compiler hint "[nounroll]" you can put before the loop in question to prevent the HLSL compiler from unrolling it. If that doesn't help, then it might be the driver's compiler that's unrolling the loop. Can you dump the DX tokens from GSA? That should give a hint where the unrolling is taking place.
I put [loop] before the loops but the HLSL compiler complained about something like "synchronization operations can't be used in varying flow control."
OpenGL guy
29-Jan-2010, 21:51
I put [loop] before the loops but the HLSL compiler complained about something like "synchronization operations can't be used in varying flow control."
Ok so it sounds like it's unrolling the loop so that the sync operation (a barrier?) won't be inside flow control.
Ok so it sounds like it's unrolling the loop so that the sync operation (a barrier?) won't be inside flow control.
Yeah it's a GroupMemoryBarrierWithGroupSync(). I'm surprised that DirectCompute doesn't allow this in a predictable loop where every thread in the group go through (both OpenCL and CUDA allow this).
Andrew Lauritzen
15-May-2010, 23:08
Ok I grabbed the code you posted in the other thread, removed the "optimization level 3" flag (faster compile, and you really don't want to use these flags anyways... counter-intuitive, but the less messing with the code the HLSL compiler does the better it is for the IHV backend compilers usually :)) and ran it on two similar but not identical systems:
ATI 5870:
sigma: 0.02
Using compute shader.
Setup time: 46ms
Load file time: 41ms
Denoise and write file time: 42ms
NVIDIA GTX 480:
sigma: 0.02
Using compute shader.
Setup time: 46ms
Load file time: 47ms
Denoise and write file time: 47ms
Similar results it seems. To really benchmark the kernel you'd probably want to put just the denoise kernel in a loop and average over a pile of frames, but if the purpose is just to get this single operation as fast as possible then the overheads are indeed just as important :)
But buried right now but I'll see if I get a chance to play around with the kernel at all. What sorts of speeds are you expecting/going for or just "as fast as possible? :)
There should be a -b option I think it is, for benchmarking only. Also if you use a nice big image (12-25MP, what you'd get from a digital camera) then you obviate any host<->device latencies.
I have a suspicion this should run in <25ms. pcchen was reporting 39ms on HD5850 and I think the texturing tweak I mentioned should improve things another ~30%+. Though the effect on Fermi is questionable.
Andrew Lauritzen
16-May-2010, 01:25
There should be a -b option I think it is, for benchmarking only. Also if you use a nice big image (12-25MP, what you'd get from a digital camera) then you obviate any host<->device latencies.
-b => 33ms on 5870. Should definitely iterate a pile of times with -b though as even run-to-run there are significant outliers. Will run it a bit later on Fermi and I'll see if I can find a huge image (any good links w/ noise so I can verify that it's working properly?).
I have a suspicion this should run in <25ms. pcchen was reporting 39ms on HD5850 and I think the texturing tweak I mentioned should improve things another ~30%+. Though the effect on Fermi is questionable.
Agreed. I see no reason that this shouldn't run quite quickly on both architectures.
I back ported the modifications on the CS 5.0 version to the OpenCL version. On Radeon 5850, it now takes 57 ms to run the kernel (previously it's ~7x ms). On GTX 285, the same modification takes 35 ms to run (previously it's 39 ms). These times are reported by the OpenCL implementation for both kernels (the "rearrangement kernel" and denoise kernel) instead of wall clock.
Source & executables (http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/nlm_cl_src.rar)
[EDIT] I did some minor optimizations and now it takes 55 ms on 5850 and 30 ms on GTX 285.
prunedtree
05-Feb-2011, 19:09
I mentioned earlier in this thread I believe this could be quite faster on a CPU. Here's an implementation:
http://www.sendspace.com/file/48m6we
Note that this is a windows 64-bit binary. It also requires SSSE3.
Please test and report your results, along with CPU used.
Andrew Lauritzen
05-Feb-2011, 21:02
Please test and report your results, along with CPU used.
Core i7 940 on the image from pcchen's zip:
27.5641273466863 ms
69.6557513267627 Megapixel/sec
Core i3 530 on the image from pcchen's zip:
52.60545212719536 ms
36.49811801555491 Megapixel/sec
Intel q9550 @2.8Mhz pcchen's img_0025
** Non-local means 7x7 **
45.19541825253106 ms
42.48218235910396 Megapixel/sec
Stock C2Q 9400 (2.6ghz) img_0025
50.6933305787758 ms
37.874804793431 MP/s
Betanumerical
06-Feb-2011, 06:43
stock E8500 (3.166ghz)
** Non-local means 7x7 **
84.15211030231343 ms
22.81582711476241 Megapixel/sec
Core i7-920 @ 3995MHz
** Non-local means 7x7 **
19.64221910888496 ms
97.74862958999923 Megapixel/sec
I mentioned earlier in this thread I believe this could be quite faster on a CPU. Here's an implementation:
http://www.sendspace.com/file/48m6we
Note that this is a windows 64-bit binary. It also requires SSSE3.
Please test and report your results, along with CPU used.
Why the cap at 16 threads? (24 threads here:))
Is the source code available?
prunedtree
08-Feb-2011, 19:30
Why the cap at 16 threads? (24 threads here:))
Is the source code available?
Why ? Laziness. I could fix that if someone asks for it, though.
The source code is not available yet.
EDIT: here's a version that will spawn twice as many threads as you have logical CPUs:
http://www.sendspace.com/file/j4sszn
Here's my results with a dual Xeon at 2.67 GHz, HT on:
Original version
** Non-local means 7x7 **
15.86520222584356 ms
121.0195730674282 Megapixel/sec
"Many threads"-version
** Non-local means 7x7 **
37.51848853979317 ms
51.1747694197114 Megapixel/sec
So there's a performance bug somewhere in the second version somewhere, at least with my 12-core system. With the original version I probably would have had better results with HT disabled.
prunedtree
11-Feb-2011, 11:41
Yeah it seems not to scale all that well for 2P systems. On a system with two X5550 this doesn't seem to get past 100 MP/s (it is a bit faster with the r4 though).
I didn't tweak multi-threading much as my goal was mainly to match Chen's GTX285 with a i7 920 (my back of the envelope calculation showed I had a shot)
15 ms is practically twice as fast already ;)
If you are interested, I could try to make it scale better for > 4 cores, but that wasn't my original intent.
What I'm the most curious about now is Sandy Bridge performance.
Betanumerical
05-Mar-2011, 06:29
i know this reply is heaps late, but i thought id try this out on my new laptop
CPU: i7 - 2635QM (sandy bridge)
** Non-local means 7x7 **
28.53713568481275 ms
67.28075379414514 Megapixel/sec
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.