If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 | |
|
Junior Member
Join Date: Jun 2007
Posts: 91
|
Hello,
I'm writing a perlin noise generator kernel, and i'm trying to optimize it. The kernel uses 2 small tables (8KB total) with precomputed random values (part of the perlin noise algorithm). Each thread needs 30 read accesses to the tables, to very random locations. At the end, the kernel writes a single float value to the global memory, as result. I have tried 3 different versions of the kernel, placing the tables in different locations: in the global memory, as constants, and in textures. The execution time of the 3 methods is almost the same (less than 1% of difference). I'm using the CUDA Visual profiler, and this is the result: Global Memory ![]() ![]() Constants ![]() ![]() Texture Fetching ![]() ![]() The benchmark tries all the possible <<numBlocks, blockSize>> combinations, and it selects the best: Quote:
Global memory: 77% gld coalesced / 22% instructions. GPU Time: 2213 / Occupancy: 0.25 Constants: 68% warp serialize / 30% instructions. GPU Time: 1657 / Occupancy: 0.75 Textures: 2% gst coalesced / 97% instructions. GPU Time: 1118 / Occupancy: 0.25 I'm really confused. This code is going to be part of a personal project: http://www.coopdb.com/modules.php?name=BR2fsaa&op=Info Please, i need advice to optimize my code. I run a quad core Xeon 3350 @ 3.6 GHz & an eVGA GTX 285 SSC. Btw, the code runs 27x times faster on the GPU than in the CPU, but, i think that it could be faster. Thank you very much ! |
|
|
|
|
|
|
#2 |
|
Junior Member
Join Date: Jun 2007
Posts: 91
|
Ok, i was totally wrong.
I was running the verification test at <<256, 256>>, and i was not doing error checking with the rest of kernel sizes, and i was selecting 'bad kernel sizes'. Then, the texture fetch version has the best performance. But, now i have a terrible fear. The best result with <<1024, 64>> is 0.255724s, and my multi-threaded SSE3 perlin version needs 0.65s on my quad to run the same test. It's only 2.5x times faster than my CPU, and i've got one of the fastest GPUs atm |
|
|
|
|
|
#3 | |
|
Junior Member
Join Date: Jun 2007
Posts: 91
|
Quote:
|
|
|
|
|
|
|
#4 |
|
Member
Join Date: Apr 2004
Posts: 416
|
Numbers look about right. Your best bet is to reduce the number of registers in the Texture path, or reduce the number of dependent texture reads (without seeing your code, I can't tell which is the problem). If you can get it down to 20 registers, that would be best.
Alternatively, can you make your tables total 2 KB for both together? The Constant path should look better that way. Alternative #3: Copy the two tables in shared memory, then read from there. That should give you the best performance, if you can fit your tables in there. Hope this helps.
__________________
Vincent: G80 is designed for time to market, whereas the R600 is specialized in the rich feature. |
|
|
|
|
|
#5 |
|
Senior Member
|
Ya shared mem should be better as you are doing random look ups. Constant mem is optimized for broadcast.
|
|
|
|
|
|
#6 | |
|
Member
|
Quote:
Also what a amazes me is that there is so little difference between single block and multiple block kernel launches - wouldn't that mean the algorim used is not ALU bound at all? |
|
|
|
|
|
|
#7 |
|
Regular
|
3DMark06's Perlin Noise shader is 447 ALU instructions and 48 texture instructions - those are the raw D3D assembly statistics. It is most definitely ALU bound on NVidia.
Jawed
__________________
Can it play WoW? |
|
|
|
|
|
#8 |
|
Member
|
Yeah, I knew that as well. That's why I don't know what to think of those results - shouldn't num of blocks mean number of TPCs (or SMs, i don't really know but it doesn't change much) to use in the kernel launch? if 1 is almoust just as good as 2, 16 etc then the ALUs in the extra clusters aren't really needed, right?
That's what i was trying to say. |
|
|
|
|
|
#9 | |
|
Senior Member
|
Quote:
|
|
|
|
|
|
|
#10 | |
|
Senior Member
|
Quote:
|
|
|
|
|
|
|
#11 |
|
Regular
|
Using:
http://developer.download.nvidia.com...calculator.xls with Compute capability set to 1.2 or 1.3, his reported 2048 blocks each of 32 threads (with 21 registers and 64 bytes per thread in shared memory) implies that there are only 8 warps running on each multiprocessor. That will hide some memory latency, but not much. Since all the fetches from memory are random what's needed is maximum latency hiding. The "Varying Block Size" graph indicates that 192 and 384 threads per block are optimal. These both result in there being 24 warps running on each multiprocessor. Much much healthier for hiding latency In summary it seems the benchmark he chose to run is too granular in its steppings, so he missed 192 and 384 threads per block as the best candidates. The 3DMark06 shader only uses 8 registers so perhaps he can get the register allocation of his code down? With 16 registers per thread and a block size of 128 he should theoretically get 32 warps (the maximum) per multiprocessor thus maximising the latency hiding. According to this: http://parlab.eecs.berkeley.edu/pubs...nchmarking.pdf each cluster in 8800GTX has 5KB of cache. So if his lookup tables could be squeezed down to that or smaller in total he could avoid the worst memory latency entirely. Not sure what this size is in GT200 GPUs such as GTX285. Jawed
__________________
Can it play WoW? |
|
|
|
|
|
#12 |
|
Regular
|
Whoops, forgot that D3D registers are vec4 whereas CUDA registers are scalar.
Jawed
__________________
Can it play WoW? |
|
|
|
|
|
#13 |
|
Member
|
Hmm... thanks, don't know where i got the ideea that a single block runs on a SM from.
|
|
|
|
|
|
#14 | |
|
Regular
|
Quote:
My probability theory muscles are poorly excercised, so could someone else give the average number of duplicates when you roll a 16 sided dice 16 times? That's the amount of extra cycles this access would take on average. |
|
|
|
|
|
|
#15 |
|
Junior Member
Join Date: Jun 2007
Posts: 91
|
The occupancy in the best case is 0.25, so, i guess that the main problem is the memory latency.
In my SSE3 version for the CPU, i used a lot of prefetching to hide it. I have some questions. I have not tried the shared memory yet, but, i did read that it should be as fast as the registers. Is it possible to 'preload' the tables in the shared memory, before launching the kernel ? My tables are read only. If not, what if i load the 2 tables in the constants, and then copy the tables to the shared memory ? I think that this could be a disaster. I have a 2nd version of the algorithm that lowers a little the memory accesses count (from 30 to 28), but the tables are bigger. So, i guess that this won't be a good idea. & Thank you very much to everybody |
|
|
|
|
|
#16 |
|
chaos dunk
Join Date: May 2003
Location: Mountain View, CA
Posts: 3,274
|
"The occupancy in the best case is 0.25, so, i guess that the main problem is the memory latency."
not necessarily true; depends entirely on how many loads you are hiding. if you're preloading 50 registers/shmem variables, it's certainly possible to hide memory latency with 0.25 occupancy (especially on Compute 1.2 or higher devices with the higher occupancy limit). What you should do is look at the bandwidth you're using in your kernel; basically, remove all arithmetic and just do loads and stores. Compute (taking into account things like uncoalesced accesses) what percentage of theoretical memory bandwidth you're getting. If it's near peak, then the obvious answer is to fix coalescing. |
|
|
|
|
|
#17 | |
|
Senior Member
|
Quote:
|
|
|
|
|
|
|
#18 | ||
|
Member
Join Date: Mar 2008
Posts: 154
|
Quote:
Quote:
The best thing to do is to simply do coalesced copies from global memory. Nothing fancy required. |
||
|
|
|
|
|
#19 | |
|
Junior Member
Join Date: Jun 2007
Posts: 91
|
I have a new version.
It uses 2 methods: texture fetching / shared memory. Quote:
![]() It's 6.5x times faster than the CPU. It will be hard to make it faster. You can leech it here: http://www.speedyshare.com/455357158.html I have problems to run it on my old G80. If somebody can try it, i would like to know if it works with other cards. |
|
|
|
|
|
|
#20 |
|
Senior Member
|
Is that with G80 - your results?
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts. Work| RecreationWarning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration! |
|
|
|
|
|
#21 |
|
Junior Member
Join Date: Jun 2007
Posts: 91
|
No, i did run the test on my GTX 285.
I can't make it work on my old 8800GTX. I get a kernel invocation error when i launch the kernel. |
|
|
|
|
|
#22 |
|
Senior Member
|
Ah - ok. I was runnning it on my under-clocked GTX280 an got lower scores than you. Since I thought you used a G80, I was quite surprised.
Thanks for clarification!
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts. Work| RecreationWarning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration! |
|
|
|
|
|
#23 |
|
Junior Member
Join Date: Jun 2007
Posts: 91
|
It's one of those overclocked cards: eVGA GTX 285 SSC (702 / 1584 / 2646). I can run my fav games at (750/1584/2800), but, if i run the perlin benchmark with these clocks, my system hangs in few seconds. I guess that the benchmark could be used as stability tester too.
|
|
|
|
|
|
#24 | |
|
Junior Member
Join Date: Jun 2007
Posts: 91
|
A new "beta" version:
http://www.speedyshare.com/366621969.html It will benchmark your CPU vs your GPU. It supports multi-GPU rigs too. You can specify from the command line, the number of GPUs to use. You will need to disable SLI to use multiple GPUs in CUDA, according to nVidia papers. Examples: br2perlin 1 5 -> This will use just 1 GPU br2perlin 2 5 -> This will use 2 GPUs The library also supports mixing the CPU & GPU at the same time. In theory, when i designed it, i thought that CPU+GPU was going to be faster, but, due to the asynchronous nature of CUDA, it ends slower than the CPU or GPU alone. My BR2 Patch is using the new CUDA code now, and the perlin effects run on the GPU now. Unluckily, if you only have 1 gfx card, this is not a good idea, because the framerate is lower due to the resources used for the CUDA calculations. But, if you have 2 gfx cards, you won't lose any fps, and the perlin code will run faster in the GPU (bigger & more complex effects). Basically, i've written this to use my old 8800GTX to run the Perlin effects, and my GTX285 to render the shiny graphics at 1920x1200 SSAA 2x The results of my Xeon 3350 @ 3.6 GHz + eVGA GTX 285 SSC: Quote:
|
|
|
|
|
|
|
#25 |
|
Junior Member
Join Date: Jun 2007
Posts: 91
|
A quick fix (0.46): http://www.speedyshare.com/633582280.html
It seems like some (budget) cards do not launch the kernel properly. This version adds kernel launch error detection. In my old 8800GTX it needs 0.22s, which seems ok, compared to the GTX285. |
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|