NLM denoise in DX11

Discussion in 'GPGPU Technology & Programming' started by pcchen, Oct 1, 2009.

  1. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I'm reasonably sure it's stand-alone as it contains historical versions of the compiler from the different Catalyst versions. I don't have a system without ATI installed, to test on, though.

    It says "Requirements: Windows XP or Vista, Microsoft DirectX SDK (April 07 or later). " I don't remember installing the DirectX SDK - that might be something that any system with a reasonably up-to-date DirectX 9.0 install has. Dunno.

    Jawed
     
  3. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    581
    Location:
    Taiwan
    I can run it fine on my system with only GTX 285 installed, so you don't need ATI's hardware installed to run it.
     
  4. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    Oh cool, thanks guys. Will have to check it out sometime.
     
  5. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    581
    Location:
    Taiwan
    I've finally made some time to port the NLM denoise algorithm to OpenCL. The major point here is to use shared memory to reduce memory reads, rather than using texture cache.

    However, for some reason, the kernel can't be run with more than 16x16 work items per work group on my GTX 285. I suspect that the compiler may have used too many registers. The performance is therefore not great. On my GTX 285 it takes around 450 ms to run (including allocate buffers and copy data to/from GPU, real kernel run time is about 435 ms). I think if it's possible to make more work item such as the originally planned 32x16 it could be better.

    Unfortunately right now the CUDA 3.0 beta toolkit doesn't contain a OpenCL visual profiler (there are links in the startup menu but no files).

    If anyone wants to try this, the binary can be downloaded here. It should be compatible with AMD's OpenCL SDK.
     
  6. Tim Murray

    Tim Murray the Windom Earle of mobile SOCs
    Veteran

    Joined:
    May 25, 2003
    Messages:
    3,278
    Likes Received:
    66
    Location:
    Mountain View, CA
    that seems like a mistake; I'll check on it when I'm not on vacation (Monday).
     
  7. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    Crashes on HD5870 (900MHz GPU) and Cat 9.11 WHQL with Stream 2.0 SDK:
    Code:
    NLM denoise OpenCL version
    Copyright(c) Ping-Che Chen
    Width: 1600 Height: 1200
    For test only: Expires on Sun Feb 28 00:00:00 2010
    Find device: Cypress (Advanced Micro Devices, Inc.)
    Time used: 480
    
    Code:
    Problem Event Name:    APPCRASH
    Application Name:    nlm_cl.exe
    Application Version:    0.0.0.0
    Application Timestamp:    4b04657a
    Fault Module Name:    OpenCL.dll
    Fault Module Version:    1.0.0.1
    
     
    #47 fellix, Nov 18, 2009
    Last edited by a moderator: Nov 18, 2009
  8. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    581
    Location:
    Taiwan
    Thanks, looking forward to it :)

    That's weird. Does it produce a good output file? Since "Time used" is printed, it seems like the denoise operations are already done. It could therefore only crash when writing files or releasing OpenCL contexts. As it crashed in OpenCL.dll I suspect the latter.

    I uploaded a new binary here (and source here). This time it has a new kernel for non-vectorized architectures (such as NVIDIA's devices). It's too bad that OpenCL does not support float3. The "de-vectorized" version runs faster on my GTX 285, around 375 ms. The new version also has a -p switch to enable profiling mode.
     
  9. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    Yep, it does, I forgot about it, but the output BMP is quite blurry -- should it be that way?
    This one outputs complete stat's, but the app error msg remains:
    Code:
    NLM denoise OpenCL version
    Copyright(c) Ping-Che Chen
    Sigma: 0.02
    Width: 1600 Height: 1200
    For test only: Expires on Sun Feb 28 00:00:00 2010
    Find device: Cypress (Advanced Micro Devices, Inc.)
    Setup time: 0.28 s
    Denoise time: 0.48 s
    Write file time: 0.02 s
    
     
    #49 fellix, Nov 18, 2009
    Last edited by a moderator: Nov 18, 2009
  10. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    581
    Location:
    Taiwan
    If it's very blurry then something is wrong. The result image should be very similar to the original, with only a bit blurry (assuming sigma is set to 0.02). Basically, only noises should be "blurred out."

    I guess I'll have to try it at home where I have a Radeon 4850.

    I also tested this on a GeForce 8800GT, which run time is about 735ms.

    [EDIT] I just manually unrolled the inner most loop, and the run time on GeForce 8800GT decreases to 516ms (kernel time ~493ms). Unfortunately there is no unroll directives in OpenCL. Maybe I can write something to generate the unrolled program at run time.

    [EDIT2] I found that there's OpenCL visual profiler in 32 bits CUDA toolkits. It seems that it's only absent in 64 bits toolkits.
     
  11. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    581
    Location:
    Taiwan
    I uploaded a new version, which contains loop unrolling codes (the program automatically generates unrolled codes). It also detects and adjusts work group size automatically, ranging from 32x32 (1024 work items) down to 1x1. There is also a "force using CPU" option, but currently there is no kernel designed for running on a CPU. I also put an executable compiled with ATI's SDK, maybe it'll be more compatible with ATI's OpenCL implementation.

    My GeForce GTX 285 runs the unrolled version with 32x16 work items, takes 236 ms (kernel time 213 ms). Now the performance is close to the compute shader version, but still not there yet. I tried to unroll the loop more but it will take the number of work items down to 128 and the performance is actually worse this way.

    Now the vectorized version is also unrolled, and I think it'd run better on ATI's hardware.

    I also tried it on my Mac mini, which has a lowly GeForce 9400M. It takes about 3990 ms to run. I suspect a CPU optimized version of NLM probably takes roughly the same time to run on the Mac mini (which has a 2.0GHz Core 2 Duo).

    Executable
    Source
     
  12. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    It's all fine now, boss. :lol:
    Code:
    Sigma: 0.02
    Width: 1600 Height: 1200
    For test only: Expires on Sun Feb 28 00:00:00 2010
    Find device [0]: Cypress (Advanced Micro Devices, Inc.)
    Work group size: 256
    Profile enabled. Kernel time: 422820194 ns
    Setup time: 0.63 s
    Denoise time: 0.48 s
    Write file time: 0.05 s
    
     
  13. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    581
    Location:
    Taiwan
    Thanks. Does the ATI executable necessary now, or both are the same?

    Also I run it on a GeForce 8800GT:

    Code:
    NLM denoise OpenCL version
    Copyright(c) Ping-Che Chen
    Sigma: 0.02
    Width: 1600 Height: 1200
    Find device [0]: GeForce 8800 GT (NVIDIA Corporation)
    Work group size: 256
    Profile enabled. Kernel time: 425196960 ns
    Setup time: 1.64 s
    Denoise time: 0.453 s
    Write file time: 0.047 s
    I run it with the OpenCL visual profiler (32 bits version), there is no warp serialization (every thread reads from different bank at the same time), divergent branch is few (only at the "load image" stage I believe), and all load/stores in the denoise kernel are coalesced. The only problem is the low occupancy rate (0.333), however, that's mostly because the number of work item is limited because the register usage. Also I don't think low occupancy rate is a serious problem here, because the kernel does not actually spend much time reading/writing to global memory, thus there's very little latency to hide.

    Replace exp with native_exp makes the kernel a bit faster (416ms vs 425 ms), but the result images are different with many pixels different more than 1, and that's outside of my "comfort" zone, so I decided to not use this optimization. (Using half_exp is as fast as exp on GeForce 8800GT, with the same result image)
     
  14. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    581
    Location:
    Taiwan
    On Radeon HD 4850:

    Code:
    NLM denoise OpenCL version
    Copyright(c) Ping-Che Chen
    Sigma: 0.02
    Width: 1600 Height: 1200
    For test only: Expires on Sun Feb 28 00:00:00 2010
    Find device [0]: ATI RV770 (Advanced Micro Devices, Inc.)
    Work group size: 64
    Profile enabled. Kernel time: 1226922460 ns
    Setup time: 2.719 s
    Denoise time: 1.505 s
    Write file time: 0.028 s
    It's better than I expected since it does not have real shared memory AFAIK. The result image is a bit weird though.

    [EDIT] I have found a few bugs which are the reasons behind the weird results.
    The first bug is related to ATI's implementation. Apparently it does not actually put constant values in the declared array (i.e. the __constant float gaussian[49] = { ... }; is not filled with the actual values, but zeros). That's why the resulting image is very blurry. A solution is to put the __constant into function parameter and allocate a memory object for it. This method works on NVIDIA's hardware too and there doesn't seem to be any performance hit.
    Another bug is in my shader, which failed to fill the shared memory completely when the number of work items is too small (namely less than 16x16). This does not affect the result when the number of work items is larger than 16x16 (256).
    The fixed version takes basically the same time to run on my Radeon 4850 but with correct results.
    There also doesn't seem to be any need for a separate ATI executable anymore. I suspect the original crash problem is related to incorrect retain/release implementation regarding to OpenCL contexts.
     
  15. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Did i get that right? Denoise is faster on 8800GT than on Cypress? Or did you two use different pictures?
     
  16. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    581
    Location:
    Taiwan
    The image is the same. I don't know the exact problem, but considering it takes a shared-memory-less 4850 about 1.2 seconds to run, it should be much faster on a 5870, instead of just about thrice as fast. I suspect that there could be some problems in the shared memory access pattern or something else. For example, the access pattern in the shader is designed to be bank conflict free on NVIDIA's GPU. However, I don't know how Cypress' shared memory is arranged.

    Also I've written a CPU specific shader for CPU devices, and it runs very slowly with current AMD's OpenCL implementation. I'll have to compare it with Apple's implementation later, but I suspect that Apple's implementation should be much faster.
     
  17. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    Egad, this sorta stuff is just asking for shader replacement.

    Is that the difference between running it on the SFU (native_exp) and ALU?

    Good work by the way :grin:
     
  18. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    581
    Location:
    Taiwan
    I think there shouldn't be a bank conflict on shared memory on Cypress. After all, the access pattern is simple. I just make sure the shader doesn't read from the same column in different threads. I guess that would be suffice for Cypress too, though I'd like to know more details. I think this deserves further investigation.

    I don't remember seeing a different on my 4850 though (including performance). The exp operation is outside of the inner most loop, so it's only called 49 times for each pixel.

    Also I ran the CPU shader on my Mac mini, and it's slow too. The shader takes like 27 seconds to run on Mac mini (Core 2 Duo 2.0GHz). It takes about the same time to run with ATI's implementation on my Core 2 Duo 3.0GHz. Although, ATI's implementation is in beta and it's pretty old. I think there are still many optimization opportunities.

    Unfortunately, this also means that in current state it's not possible to get reasonable performance from CPU implementations. It could change though, although right now GPU implementations are much faster.
     
  19. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    581
    Location:
    Taiwan
    I did some further experiments on some of my ideas. One of the ideas is to remove conversion in the inner loops since they look redundant. However, simply changing the shared memory types from uchar4 to float4 won't work because the bank conflict will definitely kill performance.

    Therefore, I decided to de-vectorize everything, making three float arrays. This shouldn't affect performance much because NVIDIA's GPU are already scalar in nature. The first attempt didn't go very well because it uses more registers and the number of work items were cut in half. I used OpenCL visual profiler and it showed that each work item uses 33 registers, just one too much.

    So I changed the kernel to reduce register usage and managed to make it 32 registers, therefore the number of work items remain the same (256 for each work group). However, performance is still worse (about 480ms in kernel instead of the original 416 ms on a 8800GT). The number of instructions also increases. A possible reason is that reading from shared memory is not as fast as I think.

    However, some tricks used during these experiments could be useful, such as making the two dimensional array into one dimensional. Although if the OpenCL compiler is any good it should already remove a lot of redundant computations.
     
  20. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    581
    Location:
    Taiwan
    Since AMD updated the Stream SDK and that requires some changes in the program to make it work, I updated this NLM denoise program so it works with AMD's OpenCL now. A new option is added, "-platform" allows the user to select which OpenCL platform to use (right now most systems have only one platform).

    Executable
    Source

    I tried AMD's CPU implementation on my Core i7 920, and it takes about 11.2 seconds to denoise the image. Note that it's not exactly fair because for this kind of image processing it's generally possible to use integers to do most calculations and that could be much faster on CPU.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...