If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
|
I've written a simple NLM denoise program with DX11 compute shader 4.0. It works on my GTX 285, but I don't know if it works on anything else...
Anyway, if anyone's interested it can be downloaded here. On my GTX 285 it took around 420 ms to denoise the sample image in the file. For those interested in the source code (which is pretty boring), it can be downloaded here. Note: it can only output BMP files right now. EDIT: A slight modification improves the speed to around 280ms. |
|
|
|
|
|
#2 |
|
Member
Join Date: May 2002
Location: Slovenia
Posts: 420
|
What driver are you using? I can't run any direct compute stuff on my machine and it's a properly configured Vista 64 with latest DX SDK and everything. Are you on Win7?
|
|
|
|
|
|
#3 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
|
I use the previous 190.62 driver on Windows 7 x64. I've heard that in order to enable compute shader on Vista, a registry has to be modified, but I don't know which one (I'll have to check NVIDIA's GPU computing SDK for that). I don't know whether the latest driver (191.07) still needs that though.
Also the latest version takes around 250ms on my GTX 285. Basically I modified the type of the colors from float4 to float3, which helps a bit for NVIDIA's scalar architecture. |
|
|
|
|
|
#4 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
|
I checked the release notes and it says registry keys names "D3D_39482904" should be deleted (there are about 2 instances of them).
|
|
|
|
|
|
#5 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
|
I did some tuning on this and now it takes a GTX 285 223 ms to denoise. I also modified the program to write the compiled shader into a file named shader.bin so it don't need to compile it again next time.
A friend tested it for me on a GeForce 9600GT and it takes 717 ms. Right now, my shader relies on texture cache to reduce memory bandwidth requirement. It's possible to use shared memory to do this, but it's much more complicated. |
|
|
|
|
|
#6 |
|
Junior Member
Join Date: Jun 2008
Posts: 35
|
Just tested it on my 5850. Are both images supposed to look the same?
output says setup time: 561 Load file time: 47 Denoise and write file time:234 edit: never mind, it seems to work. my 5850 is at 775 core and 1100 for memory. Last edited by sc3252; 08-Oct-2009 at 05:22. |
|
|
|
|
|
#7 |
|
Member
Join Date: May 2002
Location: Slovenia
Posts: 420
|
Thanks Chen! That helped.
Setup time: 553 Load file time: 113 Denoise and write file time: 290 On a GTX 280 with Vista 64 with 191.03 drivers. |
|
|
|
|
|
#8 | ||
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
|
Quote:
Quote:
|
||
|
|
|
|
|
#9 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
|
I have updated the program with support for a pixel shader code path, so now it should run on any video card with pixel shader 4.0 support (that means DX10 feature level). Although it still requires DX11 runtime to run.
From what I've seen on my GTX 285, the performance is almost the same (since the shader is, well, almost the same). The only downside of the pixel shader path is that I used D3DX to write the texture directly to a BMP file, and it decided to write it in 32 bpp mode rather than 24 bpp mode, so the resulting file is larger. Programming-wise, the pixel shader path is much more annoying than I anticipated, partly because I've not used D3D10 doing 3D rendering for quite a while, and D3D11 is apparently more complex than D3D10. Other than that, it doesn't seem to have any nasty surprises, which is a good thing. I've tried a few ways to utilized the shared memory to reduce the amount of texture loads. However, it increases pressure on ALU and at least on my GTX 285 it's already nearly limited by the ALU so performance is always worse. Maybe on RV870 it could be a different story. |
|
|
|
|
|
#10 |
|
Senior Member
|
Code:
NLM denoise using DirectX 11 Ping-Che Chen sigma: 0.02 Using pixel shader. Setup time: 20ms Load file time: 30ms Denoise and write file time: 560ms Win7 x86 (7600.20510), Cat 9.11b;
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#11 |
|
Senior Member
|
Can someone - please! - re-upload a working archive for pcchen's program? Everytime I try downloading it from the original link, I get an error message when trying to decompress the rar-archive (which is only a few kByte).
edit: Never mind, it was only me trying to save the file via rightclick - save as... - and that did not work.
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts. Work| RecreationWarning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration! |
|
|
|
|
|
#12 |
|
Regular
Join Date: Mar 2007
Posts: 8,992
|
Hmmm, doesn't appear my 5870 is using the compute shader version? Is there a way to force it to use the compute shader?
My results... Code:
NLM denoise using DirectX 11 Ping-Che Chen sigma: 0.02 Using pixel shader. Setup time: 20ms Load file time: 36ms Denoise and write file time: 415ms SB |
|
|
|
|
|
#13 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
|
There's a bug in compute shader detection, it should be fixed now. If there's any problem please let me know, thanks.
|
|
|
|
|
|
#14 |
|
Regular
Join Date: Mar 2007
Posts: 8,992
|
Tried the new version and seems to work fine now as far as I can tell. Small speedup from PS version...
Code:
NLM denoise using DirectX 11 Ping-Che Chen sigma: 0.02 Using compute shader. Setup time: 16ms Load file time: 36ms Denoise and write file time: 368ms Regards, SB |
|
|
|
|
|
#15 |
|
Senior Member
|
Same thing with most DX11-samples from the SDK you can run in Feature-Level 10.x mode also. Maybe GDS isn't activated in the drivers yet?
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts. Work| RecreationWarning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration! |
|
|
|
|
|
#16 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
|
I updated it with a new pixel shader, which reduces the amount of texture load by preloading it into a local array. This is extremely slow on GTX 285 (about 1800 ms), perhaps because GTX 285 does not support indexed array in registers. On my Radeon HD 4850 however, it runs much faster. The original pixel shader takes about 900ms, while the new shader takes only 520ms. This also confirms my suspicion that RV770/RV870 is limited by texture load in the original shader.
If anyone with a Radeon wants to test the new shader, use -p2 switch, i.e. nlm_cs -p2 IMG_0025.JPG output.bmp |
|
|
|
|
|
#17 |
|
Regular
|
Using:
http://developer.amd.com/gpu/shader/Pages/default.aspx Looking at shader_ps.hlsl, the PSMain shader has a really low ALU:fetch in each of the loops, 6 ALU and 3 fetch. Looking at PSMain2, the ATI compiler is allocating 43 vec4 registers (I haven't counted how many of the effective 172 scalar registers are actually being used). That means there's only 5 threads of 64 pixels that can be in flight. The low ALU:fetch, 2.7:1, resulting from 49 vfetch instructions, 27 loads and 206 ALU instructions in the main loop, means ATI is still fetch limited. I guess PSMain2 uses too many registers for NVidia's architecture, so it is forced to spill to main memory. Jawed
__________________
Can it play WoW? |
|
|
|
|
|
#18 |
|
Regular
Join Date: Mar 2007
Posts: 8,992
|
Whoa that's a dramatic speed increase for the new PS path.
Code:
NLM denoise using DirectX 11 Ping-Che Chen sigma: 0.02 Using pixel shader. Setup time: 32ms Load file time: 42ms Denoise and write file time: 261ms SB |
|
|
|
|
|
#19 | |||
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
|
Quote:
Quote:
The ideal way to do this is to use shared memory, which can further reduce texture load by a great number (for example, a 32x16 threads group only needs 968 loads for all these threads, rather than 169 per thread if there's no shared memory). Unfortunately, the restrictions of cs 4.0 make this very inconvenient. If I get a RV870 I can do that with cs 5.0. Another way is to use OpenCL, which does not have this restriction. Quote:
|
|||
|
|
|
|
|
#20 | |
|
Regular
|
Quote:
Bizarrely, the ATI code uses purely static indexing. Seems like a compiler bug - it should see these are static indexes and allocate fixed registers. Might be a side-effect of the indexing produced by the HLSL compiler, now I've looked at the D3D assembly. Generally I was under the impression that NVidia GPUs do support indexed registers, as long as there's not too many and they can be statically allocated. You could actually unroll the inner computation loop: Code:
[unroll]
for(k = 0; k <= kernel_half*2; k++) {
[unroll]
for(l = 0; l <= kernel_half*2; l++) {
float3 cd = c2s[l] - c1s[k + l];
float weight = g_gaussian[p + l];
dist[k] += (dot(cd, cd) * weight);
}
}
This results in an allocation of 52 vec4 registers, which means only 4 threads. Slide 23: http://gpgpu.org/wp/wp-content/uploa...chitecture.pdf says 3 is minimum, but 5 is better. So 4 threads might be pushing it somewhat. Worth a try. On HD5870 the DOT4 instructions would become DOT3s, which could lower the ALU:fetch. Until GPUSA is updated for R800 it's just a guessing game though. I really don't understand why AMD doesn't have a revised version out there already. Jawed
__________________
Can it play WoW? |
|
|
|
|
|
|
#21 |
|
Senior Member
Join Date: Oct 2006
Location: Germany
Posts: 1,003
|
sigma: 0.02
Using pixel shader. Setup time: 78ms Load file time: 63ms Denoise and write file time: 483ms HD4850 (700/993 MHz), PCIe 1.0 (16x)
__________________
Hail Brothers and Sisters! Coranon Silaria, Ozoo Mahoke Eta Kooram Nah Smech! Find Chuck Norris. |
|
|
|
|
|
#22 | ||
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
|
Quote:
The unrolled shader takes very long to compile (on my computer it took almost 43 seconds). However, the result is much better, at 340 ms on my Radeon HD 4850 (previously about 520 ms). The compile time is not a very big issue because the program saves compiled shaders into binary files for later use. If anyone wants to try this, just added [unroll] to the two loops as the above code in shader_ps.hlsl. Remember to delete all .bin files in the directory though, as they are possibly binary codes of old shaders. I'll update the files on the server as soon as possible. Quote:
|
||
|
|
|
|
|
#23 |
|
Regular
|
Nice, that's a good speed-up, approaching 3x from the original pixel shader
Your feedback time was almost as fast as I could have done it myself if I had all the stuff required I'm amazed by the compilation time though. Here it only takes a couple of seconds on my crappy A64 3500 (2GHz X2) in GPUSA. Check the compilation time in GPUSA, it should be faster. I wonder what's different. I think there are heuristics for unrolling, e.g. detecting that there's 16 or less static iterations of a loop (I've seen this behaviour in Brook+) but I've not really seen any documentation on this subject. It might be in the hands of what the HLSL compiler outputs. e.g. I have just commented out all the loop-related pragmas and told GPUSA to avoid control flow. The resulting D3D assembly contains only a single loop, is 1398 instructions and uses no indexable registers. The assembly uses 113 vec4 registers If I take your code from before my suggestion, and comment out the pragmas, the default HLSL compilation in GPUSA produces the same code as with your pragmas - so it is seeing the 13-long and 7-long static loops. I've just noticed that this D3D assembly contains 3 indexable registers, but the resulting hardware assembly only has 1. Jawed
__________________
Can it play WoW? |
|
|
|
|
|
#24 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
|
Yeah, the compile time is puzzling. But I've seen that before. It's probably a problem with D3D's HLSL compiler. GPUSA is so fast that it's default behavior is just to compile even after you typed one character into the source code window :P
To my understanding, the compiled binary is platform agnostic, as it's basically just the assembly form (although in SM4.0 it's no longer possible to directly write in assembly form). So basically the long compile time is actually not a big problem. A more practical problem is that there is no obvious way to decide which shader to use other than plainly doing a benchmark (or by vendor ID... but I don't really like that because it's too restrictive). Now I'm more interested in doing an OpenCL version, maybe using shared memory. It'd be interesting to compare it with the Compute shader version. By the way, I wrote a CPU version a year ago, using SSE instructions. It's not completely the same, the CPU version works in YUV 4:2:0 color space rather than RGB, so it's less work. It's also multi-threaded. On my Core i7 920, it takes about 1.1 second to denoise the same image, slower than a lowly GeForce 9600GT. I think that shows the amazing potential of GPU in doing image processing works like this |
|
|
|
|
|
#25 |
|
Regular
|
I've had 5 minute+ compile times with Brook+. Unrolled stuff is normally what kills it. Brook+ seems to use the HLSL compiler though, so I've not split it out to identify which compiler is thrashing.
This is a good example of why developing for a single console makes for optimised code. I do also think that being able to see the assembly and statistics relating to hardware execution is a vital part of tuning for performance. Sure you could suck it and see for each combination of unrolls in this shader, but the insights you can gain on the hardware execution model are pretty useful. In the original pixel shader, I suspect that cache access patterns are really important. I also think that NVidia's more fluid instruction scheduling: the ability for loads to complete individually and in any order and for the dependent ALUs to execute immediately - as opposed to the strictly clause-based approach on ATI - is what makes the performance so good. Is it possible to separate-out the computation time and the file-write time? I'm also thinking for real benchmarking it'd be worthwhile doing more than one denoise. I suppose the alternative would be to use a monster 24MP image. For OpenCL this might be a useful leg-up: http://developer.amd.com/gpu/ATIStre...ingOpenCL.aspx Though the article is only written with a CPU as a target and doesn't touch local memory. It does some vectorisation and unrolling - though there's no unroll pragma in AMD's compiler it seems, whereas there is in NVidia's. Whole pile of OpenCL related webinars coming up: http://developer.nvidia.com/object/g...ng_online.html At the very bottom of the page is a WMV and PDF for "Best Practices for OpenCL Programming". The audio quality's a bit ropey though. Jawed
__________________
Can it play WoW? |
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|