Maximum parallel execution?

Discussion in 'GPGPU Technology & Programming' started by PeterVDD, Jan 4, 2012.

  1. PeterVDD

    Newcomer

    Joined:
    Jan 4, 2012
    Messages:
    1
    Likes Received:
    0
    Is it normal that my GPU seems to execute only very few (6, while it has 80 stream processors) parallel kernels in JOCL?

    I am trying to let a program run on my ATI GPU and to start with an easy sample I modified the first sample on http://www.jocl.org/samples/samples.html and to run the following little script:
    I execute n kernels, each of which consist of computing 20000000/n independent tanh()-evaluations. You can see my code here: http://pastebin.com/DY2pdJzL

    However, it seems that only 6 stream processing units are used

    for n=1 it takes 12.2 seconds
    for n=2 it takes 6.3 seconds
    for n=3 it takes 4.4 seconds
    for n=4 it takes 3.4 seconds
    for n=5 it takes 3.1 seconds
    for n=6 and beyond, it takes 2.7 seconds.

    So after n=6 (be it n=8, n=20, n=100, n=1000 or n=100000), there is no performance increase (always 2.7 seconds), which means only 6 of these are computed in parallel, making the GPU slower than the CPU for even highly parallel tasks.

    It is not a matter of overhead, since increasing or decreasing the 20000000 only matters a linear factor in all the execution times, so I assume I'm not calling OpenCL in the right way. However, I just copied the example from http://www.jocl.org/samples/samples.html so that would mean the example on the JOCL page doesn't work properly? :?
     
  2. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Maybe because your local_work_size = 1? Haven't programmed OpenCL myself, but I assume this setting equals thread group size on CUDA and DirectCompute. Try increasing it to at least 64.
     
  3. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Yeah, don't set your group size to 1. Think of "work items" as mapping to SIMD lanes and "work groups" as mapping to "cores". Most likely the ATI GPU you're using has 6 cores, hence why you're seeing a stop in the scaling there.

    I think AMD actually recommend group sizes of 256 if possible, since they can run 4 "wavefronts" of 64 elements per "core". This is probably more relevant when you're using shared local memory though.

    Anyways play around with settings between 64 and 1024 and you should get more the scaling results that you're expecting, although remember that there's going to be some overhead of running this on the GPU as well.
     
  4. mhouston

    mhouston A little of this and that
    Regular

    Joined:
    Oct 7, 2005
    Messages:
    344
    Likes Received:
    38
    Location:
    Cupertino
    On the current batch of GPUs (both AMD and Nvidia), you want to run more than one wavefront/warp per compute unit (core) to get latency hiding. Running larger workgroups than the native execution width (wavefront/warp) doesn't generally help unless you are using local memory for sharing and reducing computation/communication. The reason we often recommend workgroups of 256 is to get 4 wavefronts per core which provides reasonable latency hiding on our designs and is a good default to aim for until you get to more advanced tuning tradeoffs.

    Or if you leave the workgroup size as NULL in OpenCL, the vendor will try to figure it out. We will choose 64 if possible (evenly divides the global dimensions)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...