Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 21-Feb-2009, 17:05   #1
Rayne
Junior Member
 
Join Date: Jun 2007
Posts: 91
Icon Question CUDA: Global Memory vs Constant vs Texture Fetch Performance

Hello,

I'm writing a perlin noise generator kernel, and i'm trying to optimize it.

The kernel uses 2 small tables (8KB total) with precomputed random values (part of the perlin noise algorithm).

Each thread needs 30 read accesses to the tables, to very random locations.

At the end, the kernel writes a single float value to the global memory, as result.

I have tried 3 different versions of the kernel, placing the tables in different locations: in the global memory, as constants, and in textures.

The execution time of the 3 methods is almost the same (less than 1% of difference).

I'm using the CUDA Visual profiler, and this is the result:

Global Memory



Constants



Texture Fetching




The benchmark tries all the possible <<numBlocks, blockSize>> combinations, and it selects the best:

Quote:
BloodRayne 2 FSAA Patch - CUDA Perlin Benchmark Tool 0.1 Alpha (Global Memory)
--------------------------------------------------------------

Running Benchmarks ...
----------------------
[1, 65536] Total Time: 0.024465s
[2, 32768] Total Time: 0.024471s
[4, 16384] Total Time: 0.024478s
[8, 8192] Total Time: 0.024522s
[16, 4096] Total Time: 0.024645s
[32, 2048] Total Time: 0.024602s
[64, 1024] Total Time: 0.024712s
[128, 512] Total Time: 0.538523s
[256, 256] Total Time: 0.540375s
[512, 128] Total Time: 0.544998s
[1024, 64] Total Time: 0.537881s
[2048, 32] Total Time: 0.520117s
[4096, 16] Total Time: 0.512108s
[8192, 8] Total Time: 0.536764s
[16384, 4] Total Time: 0.638648s
[32768, 2] Total Time: 1.100332s
[65536, 1] Total Time: 0.024361s

Best Config [65536, 1]: 0.024361s

Running Verification Tests ...
------------------------------
Everything OK

Z:\code\Visual Studio Projects\BloodRayne 2\br2cudaPerlin\Debug>br2cudaperlin 20
0 256

BloodRayne 2 FSAA Patch - CUDA Perlin Benchmark Tool 0.1 Alpha (Constants)
--------------------------------------------------------------

Running Benchmarks ...
----------------------
[1, 65536] Total Time: 0.024708s
[2, 32768] Total Time: 0.024606s
[4, 16384] Total Time: 0.024629s
[8, 8192] Total Time: 0.024564s
[16, 4096] Total Time: 0.024349s
[32, 2048] Total Time: 0.024581s
[64, 1024] Total Time: 0.024389s
[128, 512] Total Time: 0.397907s
[256, 256] Total Time: 0.414112s
[512, 128] Total Time: 0.387517s
[1024, 64] Total Time: 0.409437s
[2048, 32] Total Time: 0.458779s
[4096, 16] Total Time: 0.570715s
[8192, 8] Total Time: 0.577919s
[16384, 4] Total Time: 0.576570s
[32768, 2] Total Time: 0.740538s
[65536, 1] Total Time: 0.024476s

Best Config [16, 4096]: 0.024349s

Running Verification Tests ...
------------------------------
Everything OK

Z:\code\Visual Studio Projects\BloodRayne 2\br2cudaPerlin\Debug>br2cudaperlin 20
0 256

BloodRayne 2 FSAA Patch - CUDA Perlin Benchmark Tool 0.1 Alpha (Textures)
--------------------------------------------------------------

Running Benchmarks ...
----------------------
[1, 65536] Total Time: 0.024669s
[2, 32768] Total Time: 0.024617s
[4, 16384] Total Time: 0.024719s
[8, 8192] Total Time: 0.024570s
[16, 4096] Total Time: 0.024710s
[32, 2048] Total Time: 0.024474s
[64, 1024] Total Time: 0.024700s
[128, 512] Total Time: 0.258591s
[256, 256] Total Time: 0.256918s
[512, 128] Total Time: 0.256123s
[1024, 64] Total Time: 0.255724s
[2048, 32] Total Time: 0.251797s
[4096, 16] Total Time: 0.266790s
[8192, 8] Total Time: 0.305139s
[16384, 4] Total Time: 0.470133s
[32768, 2] Total Time: 0.905707s
[65536, 1] Total Time: 0.024506s

Best Config [32, 2048]: 0.024474s

Running Verification Tests ...
------------------------------
Everything OK
As you can see, the execution times are almost the same with the 3 methods.

Global memory: 77% gld coalesced / 22% instructions. GPU Time: 2213 / Occupancy: 0.25
Constants: 68% warp serialize / 30% instructions. GPU Time: 1657 / Occupancy: 0.75
Textures: 2% gst coalesced / 97% instructions. GPU Time: 1118 / Occupancy: 0.25

I'm really confused.

This code is going to be part of a personal project: http://www.coopdb.com/modules.php?name=BR2fsaa&op=Info

Please, i need advice to optimize my code.

I run a quad core Xeon 3350 @ 3.6 GHz & an eVGA GTX 285 SSC.

Btw, the code runs 27x times faster on the GPU than in the CPU, but, i think that it could be faster.

Thank you very much !
Rayne is offline   Reply With Quote
Old 21-Feb-2009, 18:28   #2
Rayne
Junior Member
 
Join Date: Jun 2007
Posts: 91
Default

Ok, i was totally wrong.

I was running the verification test at <<256, 256>>, and i was not doing error checking with the rest of kernel sizes, and i was selecting 'bad kernel sizes'.

Then, the texture fetch version has the best performance.

But, now i have a terrible fear. The best result with <<1024, 64>> is 0.255724s, and my multi-threaded SSE3 perlin version needs 0.65s on my quad to run the same test. It's only 2.5x times faster than my CPU, and i've got one of the fastest GPUs atm
Rayne is offline   Reply With Quote
Old 21-Feb-2009, 19:29   #3
Rayne
Junior Member
 
Join Date: Jun 2007
Posts: 91
Default

Quote:
Z:\code\Visual Studio Projects\BloodRayne 2\br2cudaPerlin\Debug>br2cudaperlin 20
0 256

BloodRayne 2 FSAA Patch - CUDA Perlin Benchmark Tool 0.11 Alpha
---------------------------------------------------------------

Running Benchmarks ...
----------------------
[128, 512] Total Time: 0.263607s
[256, 256] Total Time: 0.256560s
[512, 128] Total Time: 0.256693s
[1024, 64] Total Time: 0.255465s
[2048, 32] Total Time: 0.251548s
[4096, 16] Total Time: 0.267286s
[8192, 8] Total Time: 0.306301s
[16384, 4] Total Time: 0.469901s
[32768, 2] Total Time: 0.908473s

Best Config [2048, 32]: 0.251548s

Running Verification Test at [2048, 32] ...
--------------------------------------------
Everything OK
Now, it is ok, but, the performance isn't that great as expected.
Rayne is offline   Reply With Quote
Old 21-Feb-2009, 20:08   #4
Bob
Member
 
Join Date: Apr 2004
Posts: 420
Default

Numbers look about right. Your best bet is to reduce the number of registers in the Texture path, or reduce the number of dependent texture reads (without seeing your code, I can't tell which is the problem). If you can get it down to 20 registers, that would be best.

Alternatively, can you make your tables total 2 KB for both together? The Constant path should look better that way.

Alternative #3: Copy the two tables in shared memory, then read from there. That should give you the best performance, if you can fit your tables in there.

Hope this helps.
__________________
Vincent: G80 is designed for time to market, whereas the R600 is specialized in the rich feature.
Bob is offline   Reply With Quote
Old 22-Feb-2009, 02:36   #5
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,230
Send a message via Skype™ to rpg.314
Default

Ya shared mem should be better as you are doing random look ups. Constant mem is optimized for broadcast.
rpg.314 is offline   Reply With Quote
Old 22-Feb-2009, 09:15   #6
entity279
Member
 
Join Date: May 2008
Location: Romania
Posts: 436
Send a message via Yahoo to entity279
Default

Quote:
Originally Posted by rpg.314 View Post
Ya shared mem should be better as you are doing random look ups. Constant mem is optimized for broadcast.
If he's doing random look-ups, shouldn't problems arrise when multiple treads are accessing the same data?


Also what a amazes me is that there is so little difference between single block and multiple block kernel launches - wouldn't that mean the algorim used is not ALU bound at all?
entity279 is offline   Reply With Quote
Old 22-Feb-2009, 10:19   #7
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,948
Send a message via Skype™ to Jawed
Default

3DMark06's Perlin Noise shader is 447 ALU instructions and 48 texture instructions - those are the raw D3D assembly statistics. It is most definitely ALU bound on NVidia.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 22-Feb-2009, 13:30   #8
entity279
Member
 
Join Date: May 2008
Location: Romania
Posts: 436
Send a message via Yahoo to entity279
Default

Yeah, I knew that as well. That's why I don't know what to think of those results - shouldn't num of blocks mean number of TPCs (or SMs, i don't really know but it doesn't change much) to use in the kernel launch? if 1 is almoust just as good as 2, 16 etc then the ALUs in the extra clusters aren't really needed, right?

That's what i was trying to say.
entity279 is offline   Reply With Quote
Old 22-Feb-2009, 13:52   #9
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,230
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by entity279 View Post
If he's doing random look-ups, shouldn't problems arrise when multiple treads are accessing the same data?


Also what a amazes me is that there is so little difference between single block and multiple block kernel launches - wouldn't that mean the algorim used is not ALU bound at all?
He's only reading the data and not writing to it. It is a look up table, so there would be no problems. Multiple blocks are executed in an embarrassingly parallel manner. So the question of ALU boundedness does not arise.
rpg.314 is offline   Reply With Quote
Old 22-Feb-2009, 13:55   #10
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,230
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by entity279 View Post
Yeah, I knew that as well. That's why I don't know what to think of those results - shouldn't num of blocks mean number of TPCs (or SMs, i don't really know but it doesn't change much) to use in the kernel launch? if 1 is almoust just as good as 2, 16 etc then the ALUs in the extra clusters aren't really needed, right?

That's what i was trying to say.
No, multiple blocks may be run on on SM simultaneously if reg and shared mem allow it. Ofcourse, one will never be able to know if the other is running too. IOW, they run simultaneously but are invisible to each other.
rpg.314 is offline   Reply With Quote
Old 22-Feb-2009, 14:41   #11
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,948
Send a message via Skype™ to Jawed
Default

Using:

http://developer.download.nvidia.com...calculator.xls

with Compute capability set to 1.2 or 1.3, his reported 2048 blocks each of 32 threads (with 21 registers and 64 bytes per thread in shared memory) implies that there are only 8 warps running on each multiprocessor. That will hide some memory latency, but not much. Since all the fetches from memory are random what's needed is maximum latency hiding.

The "Varying Block Size" graph indicates that 192 and 384 threads per block are optimal. These both result in there being 24 warps running on each multiprocessor. Much much healthier for hiding latency

In summary it seems the benchmark he chose to run is too granular in its steppings, so he missed 192 and 384 threads per block as the best candidates.

The 3DMark06 shader only uses 8 registers so perhaps he can get the register allocation of his code down? With 16 registers per thread and a block size of 128 he should theoretically get 32 warps (the maximum) per multiprocessor thus maximising the latency hiding.

According to this:

http://parlab.eecs.berkeley.edu/pubs...nchmarking.pdf

each cluster in 8800GTX has 5KB of cache. So if his lookup tables could be squeezed down to that or smaller in total he could avoid the worst memory latency entirely. Not sure what this size is in GT200 GPUs such as GTX285.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 22-Feb-2009, 14:48   #12
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,948
Send a message via Skype™ to Jawed
Default

Whoops, forgot that D3D registers are vec4 whereas CUDA registers are scalar.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 22-Feb-2009, 17:34   #13
entity279
Member
 
Join Date: May 2008
Location: Romania
Posts: 436
Send a message via Yahoo to entity279
Default

Quote:
Originally Posted by rpg.314 View Post
No, multiple blocks may be run on on SM simultaneously if reg and shared mem allow it. Ofcourse, one will never be able to know if the other is running too. IOW, they run simultaneously but are invisible to each other.
Hmm... thanks, don't know where i got the ideea that a single block runs on a SM from.
entity279 is offline   Reply With Quote
Old 22-Feb-2009, 18:13   #14
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,543
Send a message via ICQ to MfA
Default

Quote:
Originally Posted by rpg.314 View Post
He's only reading the data and not writing to it. It is a look up table, so there would be no problems.
There will still be bank conflicts regardless ...

My probability theory muscles are poorly excercised, so could someone else give the average number of duplicates when you roll a 16 sided dice 16 times? That's the amount of extra cycles this access would take on average.
MfA is offline   Reply With Quote
Old 22-Feb-2009, 21:05   #15
Rayne
Junior Member
 
Join Date: Jun 2007
Posts: 91
Default

The occupancy in the best case is 0.25, so, i guess that the main problem is the memory latency.

In my SSE3 version for the CPU, i used a lot of prefetching to hide it.

I have some questions.

I have not tried the shared memory yet, but, i did read that it should be as fast as the registers.

Is it possible to 'preload' the tables in the shared memory, before launching the kernel ? My tables are read only.

If not, what if i load the 2 tables in the constants, and then copy the tables to the shared memory ? I think that this could be a disaster.

I have a 2nd version of the algorithm that lowers a little the memory accesses count (from 30 to 28), but the tables are bigger. So, i guess that this won't be a good idea.

& Thank you very much to everybody
Rayne is offline   Reply With Quote
Old 22-Feb-2009, 22:13   #16
Tim Murray
the Windom Earle of GPUs
 
Join Date: May 2003
Location: Mountain View, CA
Posts: 3,277
Default

"The occupancy in the best case is 0.25, so, i guess that the main problem is the memory latency."

not necessarily true; depends entirely on how many loads you are hiding. if you're preloading 50 registers/shmem variables, it's certainly possible to hide memory latency with 0.25 occupancy (especially on Compute 1.2 or higher devices with the higher occupancy limit).

What you should do is look at the bandwidth you're using in your kernel; basically, remove all arithmetic and just do loads and stores. Compute (taking into account things like uncoalesced accesses) what percentage of theoretical memory bandwidth you're getting. If it's near peak, then the obvious answer is to fix coalescing.
Tim Murray is offline   Reply With Quote
Old 23-Feb-2009, 05:19   #17
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,230
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by MfA View Post
There will still be bank conflicts regardless ...

My probability theory muscles are poorly excercised, so could someone else give the average number of duplicates when you roll a 16 sided dice 16 times? That's the amount of extra cycles this access would take on average.
Yup, but constant mem allows all reading from 1 only at full speed. Shared allows 16 different bank reads in a single clock at full speed. So shared should still be faster than constant.
rpg.314 is offline   Reply With Quote
Old 23-Feb-2009, 09:33   #18
T.B.
Member
 
Join Date: Mar 2008
Posts: 156
Default

Quote:
Originally Posted by Rayne View Post
Is it possible to 'preload' the tables in the shared memory, before launching the kernel ? My tables are read only.
No, the state of the shared mem is undefined when your kernel starts. I guess some screen-update jobs may run in between.

Quote:
If not, what if i load the 2 tables in the constants, and then copy the tables to the shared memory ? I think that this could be a disaster.
Well, at least it won't help. What you want to do is to load the data into shared mem at the start of the kernel. As your entire block will collaborate on that, you want nice coalesced reads from global memory. If you use constant mem, you won't be using broadcasts, so the accesses get serialized. The actual loads will still probably be coalesced, as the are the updates to the constant cache (Tim, throw rocks if this is wrong), so you may not even feel it.

The best thing to do is to simply do coalesced copies from global memory. Nothing fancy required.
T.B. is offline   Reply With Quote
Old 26-Feb-2009, 21:24   #19
Rayne
Junior Member
 
Join Date: Jun 2007
Posts: 91
Default

I have a new version.

It uses 2 methods: texture fetching / shared memory.

Quote:
BloodRayne 2 FSAA Patch - CUDA Perlin Benchmark Tool 0.15 Alpha
---------------------------------------------------------------

Running Benchmarks ...
----------------------
TF [128, 512] Total Time: 0.192191s
SM [128, 512] Total Time: 0.107221s
TF [256, 256] Total Time: 0.191223s
SM [256, 256] Total Time: 0.103024s
TF [512, 128] Total Time: 0.190813s
SM [512, 128] Total Time: 0.126796s
TF [1024, 64] Total Time: 0.189704s
SM [1024, 64] Total Time: 0.189470s
TF [2048, 32] Total Time: 0.189634s
SM [2048, 32] Total Time: 0.390741s
TF [4096, 16] Total Time: 0.198291s
SM [4096, 16] Total Time: 0.942939s
TF [8192, 8] Total Time: 0.255677s
SM [8192, 8] Total Time: 2.238887s
TF [16384, 4] Total Time: 0.435167s
SM [16384, 4] Total Time: 6.500768s
TF [32768, 2] Total Time: 0.856746s
SM [32768, 2] Total Time: 22.913676s

Best Config (Shared Memory) [256, 256]: 0.103024s

Running Verification Test at (Shared Memory) [256, 256] ...
------------------------------------------------------------
Everything OK

BloodRayne 2 FSAA Patch - CUDA Perlin Benchmark Tool 0.15 Alpha
---------------------------------------------------------------

Running Benchmarks ...
----------------------
TF [512, 512] Total Time: 0.697142s
SM [512, 512] Total Time: 0.370065s
TF [1024, 256] Total Time: 0.692771s
SM [1024, 256] Total Time: 0.374493s
TF [2048, 128] Total Time: 0.690623s
SM [2048, 128] Total Time: 0.464357s
TF [4096, 64] Total Time: 0.688960s
SM [4096, 64] Total Time: 0.712639s
TF [8192, 32] Total Time: 0.690871s
SM [8192, 32] Total Time: 1.504626s
TF [16384, 16] Total Time: 0.702908s
SM [16384, 16] Total Time: 3.673776s
TF [32768, 8] Total Time: 0.974903s
SM [32768, 8] Total Time: 8.863379s

Best Config (Shared Memory) [512, 512]: 0.370065s

Running Verification Test at (Shared Memory) [512, 512] ...
------------------------------------------------------------
Everything OK

BloodRayne 2 FSAA Patch - CUDA Perlin Benchmark Tool 0.15 Alpha
---------------------------------------------------------------

Running Benchmarks ...
----------------------
TF [2048, 512] Total Time: 2.361321s
SM [2048, 512] Total Time: 1.282897s
TF [4096, 256] Total Time: 2.369100s
SM [4096, 256] Total Time: 1.336673s
TF [8192, 128] Total Time: 2.361914s
SM [8192, 128] Total Time: 1.677807s
TF [16384, 64] Total Time: 2.360700s
SM [16384, 64] Total Time: 2.639611s
TF [32768, 32] Total Time: 2.360139s
SM [32768, 32] Total Time: 5.692210s

Best Config (Shared Memory) [2048, 512]: 1.282897s

Running Verification Test at (Shared Memory) [2048, 512] ...
------------------------------------------------------------
Everything OK


It's 6.5x times faster than the CPU. It will be hard to make it faster.

You can leech it here: http://www.speedyshare.com/455357158.html

I have problems to run it on my old G80. If somebody can try it, i would like to know if it works with other cards.
Rayne is offline   Reply With Quote
Old 27-Feb-2009, 19:14   #20
CarstenS
Senior Member
 
Join Date: May 2002
Location: Germany
Posts: 2,964
Send a message via ICQ to CarstenS
Default

Is that with G80 - your results?
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts.
Work| Recreation
Warning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration!
CarstenS is offline   Reply With Quote
Old 27-Feb-2009, 23:24   #21
Rayne
Junior Member
 
Join Date: Jun 2007
Posts: 91
Default

No, i did run the test on my GTX 285.

I can't make it work on my old 8800GTX. I get a kernel invocation error when i launch the kernel.
Rayne is offline   Reply With Quote
Old 28-Feb-2009, 09:16   #22
CarstenS
Senior Member
 
Join Date: May 2002
Location: Germany
Posts: 2,964
Send a message via ICQ to CarstenS
Default

Ah - ok. I was runnning it on my under-clocked GTX280 an got lower scores than you. Since I thought you used a G80, I was quite surprised.

Thanks for clarification!
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts.
Work| Recreation
Warning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration!
CarstenS is offline   Reply With Quote
Old 28-Feb-2009, 13:27   #23
Rayne
Junior Member
 
Join Date: Jun 2007
Posts: 91
Default

It's one of those overclocked cards: eVGA GTX 285 SSC (702 / 1584 / 2646). I can run my fav games at (750/1584/2800), but, if i run the perlin benchmark with these clocks, my system hangs in few seconds. I guess that the benchmark could be used as stability tester too.
Rayne is offline   Reply With Quote
Old 04-Mar-2009, 20:35   #24
Rayne
Junior Member
 
Join Date: Jun 2007
Posts: 91
Default

A new "beta" version:
http://www.speedyshare.com/366621969.html

It will benchmark your CPU vs your GPU.

It supports multi-GPU rigs too.

You can specify from the command line, the number of GPUs to use. You will need to disable SLI to use multiple GPUs in CUDA, according to nVidia papers.

Examples:

br2perlin 1 5 -> This will use just 1 GPU
br2perlin 2 5 -> This will use 2 GPUs

The library also supports mixing the CPU & GPU at the same time. In theory, when i designed it, i thought that CPU+GPU was going to be faster, but, due to the asynchronous nature of CUDA, it ends slower than the CPU or GPU alone.

My BR2 Patch is using the new CUDA code now, and the perlin effects run on the GPU now.

Unluckily, if you only have 1 gfx card, this is not a good idea, because the framerate is lower due to the resources used for the CUDA calculations. But, if you have 2 gfx cards, you won't lose any fps, and the perlin code will run faster in the GPU (bigger & more complex effects).

Basically, i've written this to use my old 8800GTX to run the Perlin effects, and my GTX285 to render the shiny graphics at 1920x1200 SSAA 2x

The results of my Xeon 3350 @ 3.6 GHz + eVGA GTX 285 SSC:
Quote:
CPU SSE3 4 Threads
Total Time: 0.660127, Min: -0.699944, Max: 0.798931, Range: 1.498875
GPU
Total Time: 0.106165
In my system, the GPU is 6.5x times faster than the CPU.
Rayne is offline   Reply With Quote
Old 05-Mar-2009, 14:18   #25
Rayne
Junior Member
 
Join Date: Jun 2007
Posts: 91
Default

A quick fix (0.46): http://www.speedyshare.com/633582280.html

It seems like some (budget) cards do not launch the kernel properly. This version adds kernel launch error detection.

In my old 8800GTX it needs 0.22s, which seems ok, compared to the GTX285.
Rayne is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 02:04.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.