NLM denoise in DX11

Discussion in 'GPGPU Technology & Programming' started by pcchen, Oct 1, 2009.

  1. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,772
    Likes Received:
    153
    Location:
    Taiwan
    Ok, I basically copy the OpenCL version to Compute shader version, using my old compute shader main program. So basically only the shader is changed (the main program also needs some changes to support CS 5.0). Unfortunately, it's pretty slow for some unknown reason. Apparently the compiler tries to unroll the outer loops, because it thinks "GroupMemoryBarrierWithGroupSync()" shouldn't be inside a loop (?)
    It also takes an extremely long time to compile the shader (nearly three minutes on my Core i7 920).

    The bad news is, the CS 5.0 version is not fast at all. It takes 746ms to run the shader. Using the GPU shader analyzer, apparently the crazy unroll has caused the shader to take a big amount of scratch memory (i.e. they can't be fit in the register file so they have to be in the global memory). That's may be why it's so slow.

    I also identified a redundant group memory barrier in the shader. However, it shouldn't have much impact on performance.
     
  2. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    I think there's a compiler hint "[nounroll]" you can put before the loop in question to prevent the HLSL compiler from unrolling it. If that doesn't help, then it might be the driver's compiler that's unrolling the loop. Can you dump the DX tokens from GSA? That should give a hint where the unrolling is taking place.
     
  3. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,772
    Likes Received:
    153
    Location:
    Taiwan
    I put [loop] before the loops but the HLSL compiler complained about something like "synchronization operations can't be used in varying flow control."
     
  4. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    Ok so it sounds like it's unrolling the loop so that the sync operation (a barrier?) won't be inside flow control.
     
  5. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,772
    Likes Received:
    153
    Location:
    Taiwan
    Yeah it's a GroupMemoryBarrierWithGroupSync(). I'm surprised that DirectCompute doesn't allow this in a predictable loop where every thread in the group go through (both OpenCL and CUDA allow this).
     
  6. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Ok I grabbed the code you posted in the other thread, removed the "optimization level 3" flag (faster compile, and you really don't want to use these flags anyways... counter-intuitive, but the less messing with the code the HLSL compiler does the better it is for the IHV backend compilers usually :)) and ran it on two similar but not identical systems:

    ATI 5870:
    sigma: 0.02
    Using compute shader.
    Setup time: 46ms
    Load file time: 41ms
    Denoise and write file time: 42ms

    NVIDIA GTX 480:
    sigma: 0.02
    Using compute shader.
    Setup time: 46ms
    Load file time: 47ms
    Denoise and write file time: 47ms

    Similar results it seems. To really benchmark the kernel you'd probably want to put just the denoise kernel in a loop and average over a pile of frames, but if the purpose is just to get this single operation as fast as possible then the overheads are indeed just as important :)

    But buried right now but I'll see if I get a chance to play around with the kernel at all. What sorts of speeds are you expecting/going for or just "as fast as possible? :)
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    There should be a -b option I think it is, for benchmarking only. Also if you use a nice big image (12-25MP, what you'd get from a digital camera) then you obviate any host<->device latencies.

    I have a suspicion this should run in <25ms. pcchen was reporting 39ms on HD5850 and I think the texturing tweak I mentioned should improve things another ~30%+. Though the effect on Fermi is questionable.
     
  8. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    -b => 33ms on 5870. Should definitely iterate a pile of times with -b though as even run-to-run there are significant outliers. Will run it a bit later on Fermi and I'll see if I can find a huge image (any good links w/ noise so I can verify that it's working properly?).

    Agreed. I see no reason that this shouldn't run quite quickly on both architectures.
     
    #108 Andrew Lauritzen, May 16, 2010
    Last edited by a moderator: May 25, 2010
  9. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,772
    Likes Received:
    153
    Location:
    Taiwan
    I back ported the modifications on the CS 5.0 version to the OpenCL version. On Radeon 5850, it now takes 57 ms to run the kernel (previously it's ~7x ms). On GTX 285, the same modification takes 35 ms to run (previously it's 39 ms). These times are reported by the OpenCL implementation for both kernels (the "rearrangement kernel" and denoise kernel) instead of wall clock.

    Source & executables

    [EDIT] I did some minor optimizations and now it takes 55 ms on 5850 and 30 ms on GTX 285.
     
  10. prunedtree

    Newcomer

    Joined:
    Aug 8, 2009
    Messages:
    27
    Likes Received:
    0
    I mentioned earlier in this thread I believe this could be quite faster on a CPU. Here's an implementation:

    http://www.sendspace.com/file/48m6we

    Note that this is a windows 64-bit binary. It also requires SSSE3.
    Please test and report your results, along with CPU used.
     
  11. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Core i7 940 on the image from pcchen's zip:
    27.5641273466863 ms
    69.6557513267627 Megapixel/sec
     
  12. BRiT

    BRiT (╯°□°)╯
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    12,785
    Likes Received:
    9,107
    Location:
    Cleveland
    Core i3 530 on the image from pcchen's zip:
    52.60545212719536 ms
    36.49811801555491 Megapixel/sec
     
  13. DeanoC

    DeanoC Trust me, I'm a renderer person!
    Veteran Subscriber

    Joined:
    Feb 6, 2003
    Messages:
    1,469
    Likes Received:
    185
    Location:
    Viking lands
    Intel q9550 @2.8Mhz pcchen's img_0025

    ** Non-local means 7x7 **
    45.19541825253106 ms
    42.48218235910396 Megapixel/sec
     
  14. doob

    Regular

    Joined:
    May 21, 2005
    Messages:
    392
    Likes Received:
    4
    Stock C2Q 9400 (2.6ghz) img_0025
    50.6933305787758 ms
    37.874804793431 MP/s
     
  15. Betanumerical

    Veteran

    Joined:
    Aug 20, 2007
    Messages:
    1,544
    Likes Received:
    10
    Location:
    In the land of the drop bears
    stock E8500 (3.166ghz)

    ** Non-local means 7x7 **
    84.15211030231343 ms
    22.81582711476241 Megapixel/sec
     
  16. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,494
    Likes Received:
    405
    Location:
    Varna, Bulgaria
    Core i7-920 @ 3995MHz

    ** Non-local means 7x7 **
    19.64221910888496 ms
    97.74862958999923 Megapixel/sec
     
  17. ahu

    ahu
    Newcomer

    Joined:
    Jul 19, 2008
    Messages:
    56
    Likes Received:
    2
    Why the cap at 16 threads? (24 threads here:))

    Is the source code available?
     
  18. prunedtree

    Newcomer

    Joined:
    Aug 8, 2009
    Messages:
    27
    Likes Received:
    0
    Why ? Laziness. I could fix that if someone asks for it, though.
    The source code is not available yet.

    EDIT: here's a version that will spawn twice as many threads as you have logical CPUs:
    http://www.sendspace.com/file/j4sszn
     
  19. ahu

    ahu
    Newcomer

    Joined:
    Jul 19, 2008
    Messages:
    56
    Likes Received:
    2
    Here's my results with a dual Xeon at 2.67 GHz, HT on:

    Original version
    ** Non-local means 7x7 **
    15.86520222584356 ms
    121.0195730674282 Megapixel/sec

    "Many threads"-version
    ** Non-local means 7x7 **
    37.51848853979317 ms
    51.1747694197114 Megapixel/sec


    So there's a performance bug somewhere in the second version somewhere, at least with my 12-core system. With the original version I probably would have had better results with HT disabled.
     
  20. prunedtree

    Newcomer

    Joined:
    Aug 8, 2009
    Messages:
    27
    Likes Received:
    0
    Yeah it seems not to scale all that well for 2P systems. On a system with two X5550 this doesn't seem to get past 100 MP/s (it is a bit faster with the r4 though).

    I didn't tweak multi-threading much as my goal was mainly to match Chen's GTX285 with a i7 920 (my back of the envelope calculation showed I had a shot)

    15 ms is practically twice as fast already ;)

    If you are interested, I could try to make it scale better for > 4 cores, but that wasn't my original intent.

    What I'm the most curious about now is Sandy Bridge performance.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...