Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 01-Oct-2009, 03:08   #1
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
Default NLM denoise in DX11

I've written a simple NLM denoise program with DX11 compute shader 4.0. It works on my GTX 285, but I don't know if it works on anything else...

Anyway, if anyone's interested it can be downloaded here. On my GTX 285 it took around 420 ms to denoise the sample image in the file.

For those interested in the source code (which is pretty boring), it can be downloaded here.

Note: it can only output BMP files right now.


EDIT: A slight modification improves the speed to around 280ms.
pcchen is offline   Reply With Quote
Old 07-Oct-2009, 09:36   #2
MDolenc
Member
 
Join Date: May 2002
Location: Slovenia
Posts: 420
Default

What driver are you using? I can't run any direct compute stuff on my machine and it's a properly configured Vista 64 with latest DX SDK and everything. Are you on Win7?
MDolenc is offline   Reply With Quote
Old 07-Oct-2009, 10:57   #3
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
Default

I use the previous 190.62 driver on Windows 7 x64. I've heard that in order to enable compute shader on Vista, a registry has to be modified, but I don't know which one (I'll have to check NVIDIA's GPU computing SDK for that). I don't know whether the latest driver (191.07) still needs that though.

Also the latest version takes around 250ms on my GTX 285. Basically I modified the type of the colors from float4 to float3, which helps a bit for NVIDIA's scalar architecture.
pcchen is offline   Reply With Quote
Old 07-Oct-2009, 15:08   #4
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
Default

I checked the release notes and it says registry keys names "D3D_39482904" should be deleted (there are about 2 instances of them).
pcchen is offline   Reply With Quote
Old 07-Oct-2009, 20:23   #5
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
Default

I did some tuning on this and now it takes a GTX 285 223 ms to denoise. I also modified the program to write the compiled shader into a file named shader.bin so it don't need to compile it again next time.

A friend tested it for me on a GeForce 9600GT and it takes 717 ms.

Right now, my shader relies on texture cache to reduce memory bandwidth requirement. It's possible to use shared memory to do this, but it's much more complicated.
pcchen is offline   Reply With Quote
Old 08-Oct-2009, 04:50   #6
sc3252
Junior Member
 
Join Date: Jun 2008
Posts: 35
Default

Just tested it on my 5850. Are both images supposed to look the same?
output says
setup time: 561
Load file time: 47
Denoise and write file time:234
edit: never mind, it seems to work.
my 5850 is at 775 core and 1100 for memory.

Last edited by sc3252; 08-Oct-2009 at 05:22.
sc3252 is offline   Reply With Quote
Old 08-Oct-2009, 06:27   #7
MDolenc
Member
 
Join Date: May 2002
Location: Slovenia
Posts: 420
Default

Thanks Chen! That helped.
Setup time: 553
Load file time: 113
Denoise and write file time: 290
On a GTX 280 with Vista 64 with 191.03 drivers.
MDolenc is offline   Reply With Quote
Old 08-Oct-2009, 09:07   #8
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
Default

Quote:
Originally Posted by sc3252 View Post
Just tested it on my 5850. Are both images supposed to look the same?
They should be quite similar because this is just a denoise program. The source image is from an iPhone 3G's camera, which is quite noisy. After denoise, it becomes a little blurry but most random noise is gone. Of course, you can use your own images from other camera, but the size has to be able to fit inside a texture (which on current cards should be 4096x4096 or something).

Quote:
output says
setup time: 561
Load file time: 47
Denoise and write file time:234
edit: never mind, it seems to work.
my 5850 is at 775 core and 1100 for memory.
That seems to be ok. The kernel is not very vectorized (as you can see in the shader code) so it's not very well suited for RV870. I'll see what I can do when I get a RV870 on my home computer, or when a compute shader enabled driver for RV770 is publicly available.
pcchen is offline   Reply With Quote
Old 15-Oct-2009, 01:10   #9
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
Default

I have updated the program with support for a pixel shader code path, so now it should run on any video card with pixel shader 4.0 support (that means DX10 feature level). Although it still requires DX11 runtime to run.

From what I've seen on my GTX 285, the performance is almost the same (since the shader is, well, almost the same). The only downside of the pixel shader path is that I used D3DX to write the texture directly to a BMP file, and it decided to write it in 32 bpp mode rather than 24 bpp mode, so the resulting file is larger.

Programming-wise, the pixel shader path is much more annoying than I anticipated, partly because I've not used D3D10 doing 3D rendering for quite a while, and D3D11 is apparently more complex than D3D10. Other than that, it doesn't seem to have any nasty surprises, which is a good thing.

I've tried a few ways to utilized the shared memory to reduce the amount of texture loads. However, it increases pressure on ALU and at least on my GTX 285 it's already nearly limited by the ALU so performance is always worse. Maybe on RV870 it could be a different story.
pcchen is offline   Reply With Quote
Old 15-Oct-2009, 09:36   #10
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,821
Send a message via Skype™ to fellix
Default

Code:
NLM denoise using DirectX 11
Ping-Che Chen
sigma: 0.02
Using pixel shader.
Setup time: 20ms
Load file time: 30ms
Denoise and write file time: 560ms
Radeon HD 4890 @ 950MHz GPU;
Win7 x86 (7600.20510), Cat 9.11b;
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 15-Oct-2009, 10:12   #11
CarstenS
Senior Member
 
Join Date: May 2002
Location: Germany
Posts: 2,842
Send a message via ICQ to CarstenS
Default

Can someone - please! - re-upload a working archive for pcchen's program? Everytime I try downloading it from the original link, I get an error message when trying to decompress the rar-archive (which is only a few kByte).

edit:
Never mind, it was only me trying to save the file via rightclick - save as... - and that did not work.
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts.
Work| Recreation
Warning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration!
CarstenS is offline   Reply With Quote
Old 15-Oct-2009, 23:50   #12
Silent_Buddha
Regular
 
Join Date: Mar 2007
Posts: 8,992
Default

Hmmm, doesn't appear my 5870 is using the compute shader version? Is there a way to force it to use the compute shader?

My results...

Code:
NLM denoise using DirectX 11
Ping-Che Chen

sigma: 0.02
Using pixel shader.
Setup time: 20ms
Load file time: 36ms
Denoise and write file time: 415ms
Regards,
SB
Silent_Buddha is offline   Reply With Quote
Old 16-Oct-2009, 01:32   #13
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
Default

There's a bug in compute shader detection, it should be fixed now. If there's any problem please let me know, thanks.
pcchen is offline   Reply With Quote
Old 16-Oct-2009, 10:09   #14
Silent_Buddha
Regular
 
Join Date: Mar 2007
Posts: 8,992
Default

Tried the new version and seems to work fine now as far as I can tell. Small speedup from PS version...

Code:
NLM denoise using DirectX 11
Ping-Che Chen

sigma: 0.02
Using compute shader.
Setup time: 16ms
Load file time: 36ms
Denoise and write file time: 368ms
Seems a lot slower than the 5850 score above though. Not sure why.

Regards,
SB
Silent_Buddha is offline   Reply With Quote
Old 16-Oct-2009, 11:30   #15
CarstenS
Senior Member
 
Join Date: May 2002
Location: Germany
Posts: 2,842
Send a message via ICQ to CarstenS
Default

Same thing with most DX11-samples from the SDK you can run in Feature-Level 10.x mode also. Maybe GDS isn't activated in the drivers yet?
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts.
Work| Recreation
Warning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration!
CarstenS is offline   Reply With Quote
Old 16-Oct-2009, 15:31   #16
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
Default

I updated it with a new pixel shader, which reduces the amount of texture load by preloading it into a local array. This is extremely slow on GTX 285 (about 1800 ms), perhaps because GTX 285 does not support indexed array in registers. On my Radeon HD 4850 however, it runs much faster. The original pixel shader takes about 900ms, while the new shader takes only 520ms. This also confirms my suspicion that RV770/RV870 is limited by texture load in the original shader.

If anyone with a Radeon wants to test the new shader, use -p2 switch, i.e.

nlm_cs -p2 IMG_0025.JPG output.bmp
pcchen is offline   Reply With Quote
Old 16-Oct-2009, 17:30   #17
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,864
Send a message via Skype™ to Jawed
Default

Using:

http://developer.amd.com/gpu/shader/Pages/default.aspx

Looking at shader_ps.hlsl, the PSMain shader has a really low ALU:fetch in each of the loops, 6 ALU and 3 fetch.

Looking at PSMain2, the ATI compiler is allocating 43 vec4 registers (I haven't counted how many of the effective 172 scalar registers are actually being used). That means there's only 5 threads of 64 pixels that can be in flight. The low ALU:fetch, 2.7:1, resulting from 49 vfetch instructions, 27 loads and 206 ALU instructions in the main loop, means ATI is still fetch limited.

I guess PSMain2 uses too many registers for NVidia's architecture, so it is forced to spill to main memory.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 16-Oct-2009, 20:37   #18
Silent_Buddha
Regular
 
Join Date: Mar 2007
Posts: 8,992
Default

Whoa that's a dramatic speed increase for the new PS path.

Code:
NLM denoise using DirectX 11
Ping-Che Chen

sigma: 0.02
Using pixel shader.
Setup time: 32ms
Load file time: 42ms
Denoise and write file time: 261ms
Regards,
SB
Silent_Buddha is offline   Reply With Quote
Old 16-Oct-2009, 20:43   #19
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
Default

Quote:
Originally Posted by Jawed View Post
Using:

http://developer.amd.com/gpu/shader/Pages/default.aspx

Looking at shader_ps.hlsl, the PSMain shader has a really low ALU:fetch in each of the loops, 6 ALU and 3 fetch.
Yeah, I have used that and that's why I decided to modify the shader to reduce texture load. Apparently, GTX 285 has enough number of texture units so it's easy to use it as a cached read-only memory. In this shader, for every pixel there are 4851 texture loads. That's quite a crazy amount of texture loads.

Quote:
Looking at PSMain2, the ATI compiler is allocating 43 vec4 registers (I haven't counted how many of the effective 172 scalar registers are actually being used). That means there's only 5 threads of 64 pixels that can be in flight. The low ALU:fetch, 2.7:1, resulting from 49 vfetch instructions, 27 loads and 206 ALU instructions in the main loop, means ATI is still fetch limited.
This shader reduces the amount of texture loads, from the previous 4851 loads to 1029 loads. That's a near 5 times reduction. However, as your observation, the amount of registers it requires reduces the number of threads in flight, therefore the performance suffers. Ideally, per pixel only 169 texture loads are required, but that would need a staggering 2704 bytes memory per pixel to store those textures.

The ideal way to do this is to use shared memory, which can further reduce texture load by a great number (for example, a 32x16 threads group only needs 968 loads for all these threads, rather than 169 per thread if there's no shared memory). Unfortunately, the restrictions of cs 4.0 make this very inconvenient. If I get a RV870 I can do that with cs 5.0. Another way is to use OpenCL, which does not have this restriction.

Quote:
I guess PSMain2 uses too many registers for NVidia's architecture, so it is forced to spill to main memory.
I suspect that the main reason behind the poor performance on NVIDIA's architecture is that registers in NVIDIA's architecture can't be indexed, so they have to use video memory to handle arrays (unless you unroll every loops to make all array access non-indexed, but it probably going to be worse).
pcchen is offline   Reply With Quote
Old 16-Oct-2009, 22:27   #20
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,864
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by pcchen View Post
I suspect that the main reason behind the poor performance on NVIDIA's architecture is that registers in NVIDIA's architecture can't be indexed, so they have to use video memory to handle arrays (unless you unroll every loops to make all array access non-indexed, but it probably going to be worse).
Only the dist array (7 elements) actually uses indexed registers in the ATI code. Have to admit, I didn't even notice there were any indexed registers in the assembly first time around.

Bizarrely, the ATI code uses purely static indexing. Seems like a compiler bug - it should see these are static indexes and allocate fixed registers. Might be a side-effect of the indexing produced by the HLSL compiler, now I've looked at the D3D assembly.

Generally I was under the impression that NVidia GPUs do support indexed registers, as long as there's not too many and they can be statically allocated.

You could actually unroll the inner computation loop:

Code:
   [unroll]
   for(k = 0; k <= kernel_half*2; k++) {
    [unroll]
    for(l = 0; l <= kernel_half*2; l++) {
     float3 cd = c2s[l] - c1s[k + l];
     float weight = g_gaussian[p + l];
     dist[k] += (dot(cd, cd) * weight);
    }
   }
The assembly this results in has an inner loop with 27 fetches and 123 ALUs, i.e. a reasonable 4.6 ALU:fetch.

This results in an allocation of 52 vec4 registers, which means only 4 threads. Slide 23:

http://gpgpu.org/wp/wp-content/uploa...chitecture.pdf

says 3 is minimum, but 5 is better. So 4 threads might be pushing it somewhat.

Worth a try.

On HD5870 the DOT4 instructions would become DOT3s, which could lower the ALU:fetch. Until GPUSA is updated for R800 it's just a guessing game though. I really don't understand why AMD doesn't have a revised version out there already.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 16-Oct-2009, 22:55   #21
Arnold Beckenbauer
Senior Member
 
Join Date: Oct 2006
Location: Germany
Posts: 1,003
Default

sigma: 0.02
Using pixel shader.
Setup time: 78ms
Load file time: 63ms
Denoise and write file time: 483ms

HD4850 (700/993 MHz), PCIe 1.0 (16x)
__________________
Hail Brothers and Sisters! Coranon Silaria, Ozoo Mahoke
Eta Kooram Nah Smech!

Find Chuck Norris.
Arnold Beckenbauer is offline   Reply With Quote
Old 16-Oct-2009, 23:14   #22
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
Default

Quote:
Originally Posted by Jawed View Post
You could actually unroll the inner computation loop:

Code:
   [unroll]
   for(k = 0; k <= kernel_half*2; k++) {
    [unroll]
    for(l = 0; l <= kernel_half*2; l++) {
     float3 cd = c2s[l] - c1s[k + l];
     float weight = g_gaussian[p + l];
     dist[k] += (dot(cd, cd) * weight);
    }
   }
assembly this results in has an inner loop with 27 fetches and 123 ALUs, i.e. a reasonable 4.6 ALU:fetch.
I was under impression that the compiler should automatically determine whether to unroll loops. Anyway, in general unroll such a long loop is not a good idea on NVIDIA's architecture, but apparently it's good for ATI's architecture at least in this particular instance.

The unrolled shader takes very long to compile (on my computer it took almost 43 seconds). However, the result is much better, at 340 ms on my Radeon HD 4850 (previously about 520 ms). The compile time is not a very big issue because the program saves compiled shaders into binary files for later use.

If anyone wants to try this, just added [unroll] to the two loops as the above code in shader_ps.hlsl. Remember to delete all .bin files in the directory though, as they are possibly binary codes of old shaders. I'll update the files on the server as soon as possible.

Quote:
Generally I was under the impression that NVidia GPUs do support indexed registers, as long as there's not too many and they can be statically allocated.
My guess is that NVIDIA uses shared memory for indexed array if they are not very big. I do this in many of my CUDA programs as well. Since shared memory is quite small (16KB), it can't handle very large array.
pcchen is offline   Reply With Quote
Old 16-Oct-2009, 23:40   #23
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,864
Send a message via Skype™ to Jawed
Default

Nice, that's a good speed-up, approaching 3x from the original pixel shader

Your feedback time was almost as fast as I could have done it myself if I had all the stuff required

I'm amazed by the compilation time though. Here it only takes a couple of seconds on my crappy A64 3500 (2GHz X2) in GPUSA.

Check the compilation time in GPUSA, it should be faster. I wonder what's different.

I think there are heuristics for unrolling, e.g. detecting that there's 16 or less static iterations of a loop (I've seen this behaviour in Brook+) but I've not really seen any documentation on this subject. It might be in the hands of what the HLSL compiler outputs.

e.g. I have just commented out all the loop-related pragmas and told GPUSA to avoid control flow. The resulting D3D assembly contains only a single loop, is 1398 instructions and uses no indexable registers. The assembly uses 113 vec4 registers but the loop has an ALU:fetch of 4.27. Seems unlikely it would be faster, with only 2 threads.

If I take your code from before my suggestion, and comment out the pragmas, the default HLSL compilation in GPUSA produces the same code as with your pragmas - so it is seeing the 13-long and 7-long static loops. I've just noticed that this D3D assembly contains 3 indexable registers, but the resulting hardware assembly only has 1.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 17-Oct-2009, 00:38   #24
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
Default

Yeah, the compile time is puzzling. But I've seen that before. It's probably a problem with D3D's HLSL compiler. GPUSA is so fast that it's default behavior is just to compile even after you typed one character into the source code window :P

To my understanding, the compiled binary is platform agnostic, as it's basically just the assembly form (although in SM4.0 it's no longer possible to directly write in assembly form). So basically the long compile time is actually not a big problem.

A more practical problem is that there is no obvious way to decide which shader to use other than plainly doing a benchmark (or by vendor ID... but I don't really like that because it's too restrictive).

Now I'm more interested in doing an OpenCL version, maybe using shared memory. It'd be interesting to compare it with the Compute shader version.

By the way, I wrote a CPU version a year ago, using SSE instructions. It's not completely the same, the CPU version works in YUV 4:2:0 color space rather than RGB, so it's less work. It's also multi-threaded. On my Core i7 920, it takes about 1.1 second to denoise the same image, slower than a lowly GeForce 9600GT. I think that shows the amazing potential of GPU in doing image processing works like this
pcchen is offline   Reply With Quote
Old 17-Oct-2009, 11:29   #25
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,864
Send a message via Skype™ to Jawed
Default

I've had 5 minute+ compile times with Brook+. Unrolled stuff is normally what kills it. Brook+ seems to use the HLSL compiler though, so I've not split it out to identify which compiler is thrashing.

This is a good example of why developing for a single console makes for optimised code.

I do also think that being able to see the assembly and statistics relating to hardware execution is a vital part of tuning for performance. Sure you could suck it and see for each combination of unrolls in this shader, but the insights you can gain on the hardware execution model are pretty useful.

In the original pixel shader, I suspect that cache access patterns are really important. I also think that NVidia's more fluid instruction scheduling: the ability for loads to complete individually and in any order and for the dependent ALUs to execute immediately - as opposed to the strictly clause-based approach on ATI - is what makes the performance so good.

Is it possible to separate-out the computation time and the file-write time?

I'm also thinking for real benchmarking it'd be worthwhile doing more than one denoise. I suppose the alternative would be to use a monster 24MP image.

For OpenCL this might be a useful leg-up:

http://developer.amd.com/gpu/ATIStre...ingOpenCL.aspx

Though the article is only written with a CPU as a target and doesn't touch local memory. It does some vectorisation and unrolling - though there's no unroll pragma in AMD's compiler it seems, whereas there is in NVidia's.

Whole pile of OpenCL related webinars coming up:

http://developer.nvidia.com/object/g...ng_online.html

At the very bottom of the page is a WMV and PDF for "Best Practices for OpenCL Programming". The audio quality's a bit ropey though.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 15:25.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.