NLM denoise in DX11

Discussion in 'GPGPU Technology & Programming' started by pcchen, Oct 1, 2009.

  1. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,772
    Likes Received:
    153
    Location:
    Taiwan
    I've written a simple NLM denoise program with DX11 compute shader 4.0. It works on my GTX 285, but I don't know if it works on anything else...

    Anyway, if anyone's interested it can be downloaded here. On my GTX 285 it took around 420 ms to denoise the sample image in the file.

    For those interested in the source code (which is pretty boring), it can be downloaded here.

    Note: it can only output BMP files right now.


    EDIT: A slight modification improves the speed to around 280ms.
     
  2. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    What driver are you using? I can't run any direct compute stuff on my machine and it's a properly configured Vista 64 with latest DX SDK and everything. Are you on Win7?
     
  3. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,772
    Likes Received:
    153
    Location:
    Taiwan
    I use the previous 190.62 driver on Windows 7 x64. I've heard that in order to enable compute shader on Vista, a registry has to be modified, but I don't know which one (I'll have to check NVIDIA's GPU computing SDK for that). I don't know whether the latest driver (191.07) still needs that though.

    Also the latest version takes around 250ms on my GTX 285. Basically I modified the type of the colors from float4 to float3, which helps a bit for NVIDIA's scalar architecture.
     
  4. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,772
    Likes Received:
    153
    Location:
    Taiwan
    I checked the release notes and it says registry keys names "D3D_39482904" should be deleted (there are about 2 instances of them).
     
  5. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,772
    Likes Received:
    153
    Location:
    Taiwan
    I did some tuning on this and now it takes a GTX 285 223 ms to denoise. I also modified the program to write the compiled shader into a file named shader.bin so it don't need to compile it again next time.

    A friend tested it for me on a GeForce 9600GT and it takes 717 ms.

    Right now, my shader relies on texture cache to reduce memory bandwidth requirement. It's possible to use shared memory to do this, but it's much more complicated.
     
  6. sc3252

    Newcomer

    Joined:
    Jun 6, 2008
    Messages:
    35
    Likes Received:
    0
    Just tested it on my 5850. Are both images supposed to look the same?
    output says
    setup time: 561
    Load file time: 47
    Denoise and write file time:234
    edit: never mind, it seems to work.
    my 5850 is at 775 core and 1100 for memory.
     
    #6 sc3252, Oct 8, 2009
    Last edited by a moderator: Oct 8, 2009
  7. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    Thanks Chen! That helped.
    Setup time: 553
    Load file time: 113
    Denoise and write file time: 290
    On a GTX 280 with Vista 64 with 191.03 drivers.
     
  8. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,772
    Likes Received:
    153
    Location:
    Taiwan
    They should be quite similar because this is just a denoise program. The source image is from an iPhone 3G's camera, which is quite noisy. After denoise, it becomes a little blurry but most random noise is gone. Of course, you can use your own images from other camera, but the size has to be able to fit inside a texture (which on current cards should be 4096x4096 or something).

    That seems to be ok. The kernel is not very vectorized (as you can see in the shader code) so it's not very well suited for RV870. I'll see what I can do when I get a RV870 on my home computer, or when a compute shader enabled driver for RV770 is publicly available.
     
  9. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,772
    Likes Received:
    153
    Location:
    Taiwan
    I have updated the program with support for a pixel shader code path, so now it should run on any video card with pixel shader 4.0 support (that means DX10 feature level). Although it still requires DX11 runtime to run.

    From what I've seen on my GTX 285, the performance is almost the same (since the shader is, well, almost the same). The only downside of the pixel shader path is that I used D3DX to write the texture directly to a BMP file, and it decided to write it in 32 bpp mode rather than 24 bpp mode, so the resulting file is larger.

    Programming-wise, the pixel shader path is much more annoying than I anticipated, partly because I've not used D3D10 doing 3D rendering for quite a while, and D3D11 is apparently more complex than D3D10. Other than that, it doesn't seem to have any nasty surprises, which is a good thing.

    I've tried a few ways to utilized the shared memory to reduce the amount of texture loads. However, it increases pressure on ALU and at least on my GTX 285 it's already nearly limited by the ALU so performance is always worse. Maybe on RV870 it could be a different story.
     
  10. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,494
    Likes Received:
    405
    Location:
    Varna, Bulgaria
    Code:
    NLM denoise using DirectX 11
    Ping-Che Chen
    sigma: 0.02
    Using pixel shader.
    Setup time: 20ms
    Load file time: 30ms
    Denoise and write file time: 560ms
    
    Radeon HD 4890 @ 950MHz GPU;
    Win7 x86 (7600.20510), Cat 9.11b;
     
  11. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,803
    Likes Received:
    2,064
    Location:
    Germany
    Can someone - please! - re-upload a working archive for pcchen's program? Everytime I try downloading it from the original link, I get an error message when trying to decompress the rar-archive (which is only a few kByte).

    edit:
    Never mind, it was only me trying to save the file via rightclick - save as... - and that did not work.
     
  12. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    16,326
    Likes Received:
    5,288
    Hmmm, doesn't appear my 5870 is using the compute shader version? Is there a way to force it to use the compute shader?

    My results...

    Code:
    NLM denoise using DirectX 11
    Ping-Che Chen
    
    sigma: 0.02
    Using pixel shader.
    Setup time: 20ms
    Load file time: 36ms
    Denoise and write file time: 415ms
    Regards,
    SB
     
  13. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,772
    Likes Received:
    153
    Location:
    Taiwan
    There's a bug in compute shader detection, it should be fixed now. If there's any problem please let me know, thanks.
     
  14. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    16,326
    Likes Received:
    5,288
    Tried the new version and seems to work fine now as far as I can tell. Small speedup from PS version...

    Code:
    NLM denoise using DirectX 11
    Ping-Che Chen
    
    sigma: 0.02
    Using compute shader.
    Setup time: 16ms
    Load file time: 36ms
    Denoise and write file time: 368ms
    Seems a lot slower than the 5850 score above though. Not sure why.

    Regards,
    SB
     
  15. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,803
    Likes Received:
    2,064
    Location:
    Germany
    Same thing with most DX11-samples from the SDK you can run in Feature-Level 10.x mode also. Maybe GDS isn't activated in the drivers yet?
     
  16. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,772
    Likes Received:
    153
    Location:
    Taiwan
    I updated it with a new pixel shader, which reduces the amount of texture load by preloading it into a local array. This is extremely slow on GTX 285 (about 1800 ms), perhaps because GTX 285 does not support indexed array in registers. On my Radeon HD 4850 however, it runs much faster. The original pixel shader takes about 900ms, while the new shader takes only 520ms. This also confirms my suspicion that RV770/RV870 is limited by texture load in the original shader.

    If anyone with a Radeon wants to test the new shader, use -p2 switch, i.e.

    nlm_cs -p2 IMG_0025.JPG output.bmp
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Using:

    http://developer.amd.com/gpu/shader/Pages/default.aspx

    Looking at shader_ps.hlsl, the PSMain shader has a really low ALU:fetch in each of the loops, 6 ALU and 3 fetch.

    Looking at PSMain2, the ATI compiler is allocating 43 vec4 registers (I haven't counted how many of the effective 172 scalar registers are actually being used). That means there's only 5 threads of 64 pixels that can be in flight. The low ALU:fetch, 2.7:1, resulting from 49 vfetch instructions, 27 loads and 206 ALU instructions in the main loop, means ATI is still fetch limited.

    I guess PSMain2 uses too many registers for NVidia's architecture, so it is forced to spill to main memory.

    Jawed
     
  18. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    16,326
    Likes Received:
    5,288
    Whoa that's a dramatic speed increase for the new PS path.

    Code:
    NLM denoise using DirectX 11
    Ping-Che Chen
    
    sigma: 0.02
    Using pixel shader.
    Setup time: 32ms
    Load file time: 42ms
    Denoise and write file time: 261ms
    Regards,
    SB
     
  19. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,772
    Likes Received:
    153
    Location:
    Taiwan
    Yeah, I have used that and that's why I decided to modify the shader to reduce texture load. Apparently, GTX 285 has enough number of texture units so it's easy to use it as a cached read-only memory. In this shader, for every pixel there are 4851 texture loads. That's quite a crazy amount of texture loads. :)

    This shader reduces the amount of texture loads, from the previous 4851 loads to 1029 loads. That's a near 5 times reduction. However, as your observation, the amount of registers it requires reduces the number of threads in flight, therefore the performance suffers. Ideally, per pixel only 169 texture loads are required, but that would need a staggering 2704 bytes memory per pixel to store those textures.

    The ideal way to do this is to use shared memory, which can further reduce texture load by a great number (for example, a 32x16 threads group only needs 968 loads for all these threads, rather than 169 per thread if there's no shared memory). Unfortunately, the restrictions of cs 4.0 make this very inconvenient. If I get a RV870 I can do that with cs 5.0. Another way is to use OpenCL, which does not have this restriction.

    I suspect that the main reason behind the poor performance on NVIDIA's architecture is that registers in NVIDIA's architecture can't be indexed, so they have to use video memory to handle arrays (unless you unroll every loops to make all array access non-indexed, but it probably going to be worse).
     
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Only the dist array (7 elements) actually uses indexed registers in the ATI code. Have to admit, I didn't even notice there were any indexed registers in the assembly first time around.

    Bizarrely, the ATI code uses purely static indexing. Seems like a compiler bug - it should see these are static indexes and allocate fixed registers. Might be a side-effect of the indexing produced by the HLSL compiler, now I've looked at the D3D assembly.

    Generally I was under the impression that NVidia GPUs do support indexed registers, as long as there's not too many and they can be statically allocated.

    You could actually unroll the inner computation loop:

    Code:
       [unroll]
       for(k = 0; k <= kernel_half*2; k++) {
        [unroll]
        for(l = 0; l <= kernel_half*2; l++) {
         float3 cd = c2s[l] - c1s[k + l];
         float weight = g_gaussian[p + l];
         dist[k] += (dot(cd, cd) * weight);
        }
       }
    
    The assembly this results in has an inner loop with 27 fetches and 123 ALUs, i.e. a reasonable 4.6 ALU:fetch.

    This results in an allocation of 52 vec4 registers, which means only 4 threads. Slide 23:

    http://gpgpu.org/wp/wp-content/uploads/2009/09/E1-OpenCL-Architecture.pdf

    says 3 is minimum, but 5 is better. So 4 threads might be pushing it somewhat.

    Worth a try.

    On HD5870 the DOT4 instructions would become DOT3s, which could lower the ALU:fetch. Until GPUSA is updated for R800 it's just a guessing game though. I really don't understand why AMD doesn't have a revised version out there already.

    Jawed
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...