CUDA: Global Memory vs Constant vs Texture Fetch Performance

Discussion in 'GPGPU Technology & Programming' started by Rayne, Feb 21, 2009.

  1. Rayne

    Newcomer

    Joined:
    Jun 23, 2007
    Messages:
    91
    Hello,

    I'm writing a perlin noise generator kernel, and i'm trying to optimize it.

    The kernel uses 2 small tables (8KB total) with precomputed random values (part of the perlin noise algorithm).

    Each thread needs 30 read accesses to the tables, to very random locations.

    At the end, the kernel writes a single float value to the global memory, as result.

    I have tried 3 different versions of the kernel, placing the tables in different locations: in the global memory, as constants, and in textures.

    The execution time of the 3 methods is almost the same (less than 1% of difference).

    I'm using the CUDA Visual profiler, and this is the result:

    Global Memory

    [​IMG][​IMG]

    Constants

    [​IMG][​IMG]

    Texture Fetching

    [​IMG][​IMG]


    The benchmark tries all the possible <<numBlocks, blockSize>> combinations, and it selects the best:

    As you can see, the execution times are almost the same with the 3 methods.

    Global memory: 77% gld coalesced / 22% instructions. GPU Time: 2213 / Occupancy: 0.25
    Constants: 68% warp serialize / 30% instructions. GPU Time: 1657 / Occupancy: 0.75
    Textures: 2% gst coalesced / 97% instructions. GPU Time: 1118 / Occupancy: 0.25

    I'm really confused.

    This code is going to be part of a personal project: http://www.coopdb.com/modules.php?name=BR2fsaa&op=Info

    Please, i need advice to optimize my code.

    I run a quad core Xeon 3350 @ 3.6 GHz & an eVGA GTX 285 SSC.

    Btw, the code runs 27x times faster on the GPU than in the CPU, but, i think that it could be faster.

    Thank you very much !
     
  2. Rayne

    Newcomer

    Joined:
    Jun 23, 2007
    Messages:
    91
    Ok, i was totally wrong.

    I was running the verification test at <<256, 256>>, and i was not doing error checking with the rest of kernel sizes, and i was selecting 'bad kernel sizes'.

    Then, the texture fetch version has the best performance.

    But, now i have a terrible fear. The best result with <<1024, 64>> is 0.255724s, and my multi-threaded SSE3 perlin version needs 0.65s on my quad to run the same test. It's only 2.5x times faster than my CPU, and i've got one of the fastest GPUs atm :(
     
  3. Rayne

    Newcomer

    Joined:
    Jun 23, 2007
    Messages:
    91
    Now, it is ok, but, the performance isn't that great as expected.
     
  4. Bob

    Bob
    Regular Subscriber

    Joined:
    Apr 22, 2004
    Messages:
    424
    Numbers look about right. Your best bet is to reduce the number of registers in the Texture path, or reduce the number of dependent texture reads (without seeing your code, I can't tell which is the problem). If you can get it down to 20 registers, that would be best.

    Alternatively, can you make your tables total 2 KB for both together? The Constant path should look better that way.

    Alternative #3: Copy the two tables in shared memory, then read from there. That should give you the best performance, if you can fit your tables in there.

    Hope this helps.
     
  5. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    Ya shared mem should be better as you are doing random look ups. Constant mem is optimized for broadcast.
     
  6. entity279

    Regular

    Joined:
    May 12, 2008
    Messages:
    536
    Location:
    Romania
    If he's doing random look-ups, shouldn't problems arrise when multiple treads are accessing the same data?


    Also what a amazes me is that there is so little difference between single block and multiple block kernel launches - wouldn't that mean the algorim used is not ALU bound at all?
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,662
    Location:
    London
    3DMark06's Perlin Noise shader is 447 ALU instructions and 48 texture instructions - those are the raw D3D assembly statistics. It is most definitely ALU bound on NVidia.

    Jawed
     
  8. entity279

    Regular

    Joined:
    May 12, 2008
    Messages:
    536
    Location:
    Romania
    Yeah, I knew that as well. That's why I don't know what to think of those results - shouldn't num of blocks mean number of TPCs (or SMs, i don't really know but it doesn't change much) to use in the kernel launch? if 1 is almoust just as good as 2, 16 etc then the ALUs in the extra clusters aren't really needed, right?

    That's what i was trying to say.
     
  9. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    He's only reading the data and not writing to it. It is a look up table, so there would be no problems. Multiple blocks are executed in an embarrassingly parallel manner. So the question of ALU boundedness does not arise.
     
  10. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    No, multiple blocks may be run on on SM simultaneously if reg and shared mem allow it. Ofcourse, one will never be able to know if the other is running too. IOW, they run simultaneously but are invisible to each other.
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,662
    Location:
    London
    Using:

    http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls

    with Compute capability set to 1.2 or 1.3, his reported 2048 blocks each of 32 threads (with 21 registers and 64 bytes per thread in shared memory) implies that there are only 8 warps running on each multiprocessor. That will hide some memory latency, but not much. Since all the fetches from memory are random what's needed is maximum latency hiding.

    The "Varying Block Size" graph indicates that 192 and 384 threads per block are optimal. These both result in there being 24 warps running on each multiprocessor. Much much healthier for hiding latency :grin:

    In summary it seems the benchmark he chose to run is too granular in its steppings, so he missed 192 and 384 threads per block as the best candidates.

    The 3DMark06 shader only uses 8 registers so perhaps he can get the register allocation of his code down? With 16 registers per thread and a block size of 128 he should theoretically get 32 warps (the maximum) per multiprocessor thus maximising the latency hiding.

    According to this:

    http://parlab.eecs.berkeley.edu/pubs/volkov-benchmarking.pdf

    each cluster in 8800GTX has 5KB of cache. So if his lookup tables could be squeezed down to that or smaller in total he could avoid the worst memory latency entirely. Not sure what this size is in GT200 GPUs such as GTX285.

    Jawed
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,662
    Location:
    London
    Whoops, forgot that D3D registers are vec4 whereas CUDA registers are scalar.

    Jawed
     
  13. entity279

    Regular

    Joined:
    May 12, 2008
    Messages:
    536
    Location:
    Romania
    Hmm... thanks, don't know where i got the ideea that a single block runs on a SM from.
     
  14. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    5,857
    There will still be bank conflicts regardless ...

    My probability theory muscles are poorly excercised, so could someone else give the average number of duplicates when you roll a 16 sided dice 16 times? That's the amount of extra cycles this access would take on average.
     
  15. Rayne

    Newcomer

    Joined:
    Jun 23, 2007
    Messages:
    91
    The occupancy in the best case is 0.25, so, i guess that the main problem is the memory latency.

    In my SSE3 version for the CPU, i used a lot of prefetching to hide it.

    I have some questions.

    I have not tried the shared memory yet, but, i did read that it should be as fast as the registers.

    Is it possible to 'preload' the tables in the shared memory, before launching the kernel ? My tables are read only.

    If not, what if i load the 2 tables in the constants, and then copy the tables to the shared memory ? I think that this could be a disaster.

    I have a 2nd version of the algorithm that lowers a little the memory accesses count (from 30 to 28), but the tables are bigger. So, i guess that this won't be a good idea.

    & Thank you very much to everybody :)
     
  16. Tim Murray

    Tim Murray the Windom Earle of mobile SOCs
    Veteran

    Joined:
    May 25, 2003
    Messages:
    3,278
    Location:
    Mountain View, CA
    "The occupancy in the best case is 0.25, so, i guess that the main problem is the memory latency."

    not necessarily true; depends entirely on how many loads you are hiding. if you're preloading 50 registers/shmem variables, it's certainly possible to hide memory latency with 0.25 occupancy (especially on Compute 1.2 or higher devices with the higher occupancy limit).

    What you should do is look at the bandwidth you're using in your kernel; basically, remove all arithmetic and just do loads and stores. Compute (taking into account things like uncoalesced accesses) what percentage of theoretical memory bandwidth you're getting. If it's near peak, then the obvious answer is to fix coalescing.
     
  17. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    Yup, but constant mem allows all reading from 1 only at full speed. Shared allows 16 different bank reads in a single clock at full speed. So shared should still be faster than constant.
     
  18. T.B.

    Newcomer

    Joined:
    Mar 11, 2008
    Messages:
    156
    No, the state of the shared mem is undefined when your kernel starts. I guess some screen-update jobs may run in between.

    Well, at least it won't help. What you want to do is to load the data into shared mem at the start of the kernel. As your entire block will collaborate on that, you want nice coalesced reads from global memory. If you use constant mem, you won't be using broadcasts, so the accesses get serialized. The actual loads will still probably be coalesced, as the are the updates to the constant cache (Tim, throw rocks if this is wrong), so you may not even feel it.

    The best thing to do is to simply do coalesced copies from global memory. Nothing fancy required.
     
  19. Rayne

    Newcomer

    Joined:
    Jun 23, 2007
    Messages:
    91
    I have a new version.

    It uses 2 methods: texture fetching / shared memory.

    [​IMG]

    It's 6.5x times faster than the CPU. It will be hard to make it faster.

    You can leech it here: http://www.speedyshare.com/455357158.html

    I have problems to run it on my old G80. If somebody can try it, i would like to know if it works with other cards.
     
  20. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    3,780
    Location:
    Germany
    Is that with G80 - your results?
     

Share This Page

  • About Beyond3D

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...