Maximum parallel execution?

Discussion in 'GPGPU Technology & Programming' started by PeterVDD, Jan 4, 2012.

  1. PeterVDD


    Jan 4, 2012
    Likes Received:
    Is it normal that my GPU seems to execute only very few (6, while it has 80 stream processors) parallel kernels in JOCL?

    I am trying to let a program run on my ATI GPU and to start with an easy sample I modified the first sample on and to run the following little script:
    I execute n kernels, each of which consist of computing 20000000/n independent tanh()-evaluations. You can see my code here:

    However, it seems that only 6 stream processing units are used

    for n=1 it takes 12.2 seconds
    for n=2 it takes 6.3 seconds
    for n=3 it takes 4.4 seconds
    for n=4 it takes 3.4 seconds
    for n=5 it takes 3.1 seconds
    for n=6 and beyond, it takes 2.7 seconds.

    So after n=6 (be it n=8, n=20, n=100, n=1000 or n=100000), there is no performance increase (always 2.7 seconds), which means only 6 of these are computed in parallel, making the GPU slower than the CPU for even highly parallel tasks.

    It is not a matter of overhead, since increasing or decreasing the 20000000 only matters a linear factor in all the execution times, so I assume I'm not calling OpenCL in the right way. However, I just copied the example from so that would mean the example on the JOCL page doesn't work properly? :?
  2. sebbbi


    Nov 14, 2007
    Likes Received:
    Helsinki, Finland
    Maybe because your local_work_size = 1? Haven't programmed OpenCL myself, but I assume this setting equals thread group size on CUDA and DirectCompute. Try increasing it to at least 64.
  3. Andrew Lauritzen

    Moderator Veteran

    May 21, 2004
    Likes Received:
    British Columbia, Canada
    Yeah, don't set your group size to 1. Think of "work items" as mapping to SIMD lanes and "work groups" as mapping to "cores". Most likely the ATI GPU you're using has 6 cores, hence why you're seeing a stop in the scaling there.

    I think AMD actually recommend group sizes of 256 if possible, since they can run 4 "wavefronts" of 64 elements per "core". This is probably more relevant when you're using shared local memory though.

    Anyways play around with settings between 64 and 1024 and you should get more the scaling results that you're expecting, although remember that there's going to be some overhead of running this on the GPU as well.
  4. mhouston

    mhouston A little of this and that

    Oct 7, 2005
    Likes Received:
    On the current batch of GPUs (both AMD and Nvidia), you want to run more than one wavefront/warp per compute unit (core) to get latency hiding. Running larger workgroups than the native execution width (wavefront/warp) doesn't generally help unless you are using local memory for sharing and reducing computation/communication. The reason we often recommend workgroups of 256 is to get 4 wavefronts per core which provides reasonable latency hiding on our designs and is a good default to aim for until you get to more advanced tuning tradeoffs.

    Or if you leave the workgroup size as NULL in OpenCL, the vendor will try to figure it out. We will choose 64 if possible (evenly divides the global dimensions)

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.