N-Queen Solver for OpenCL

Discussion in 'GPGPU Technology & Programming' started by pcchen, Jan 10, 2010.

  1. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,494
    Likes Received:
    405
    Location:
    Varna, Bulgaria
    Code:
    nqueen_cl 17
    
    Platform [0]: ATI Stream
    Select platform 0
    Using GPU device
    Using 81920 threads
    17-queen has 95815104 solutions (11977939 unique)
    Time used: 3.82s
    HD5870 @ 900MHz GPU.
     
  2. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    I've been thinking about using a different algorithm for computing unique solutions. I had tried another approach, using a rotational order instead of current top-down order to completely avoid generating redundant solutions. However, it's more computing intensive so it's actually slower on CPU even though it doesn't need a final rotation check step. The idea back then is that the number of solutions is actually very small compared to the whole problem space, so it makes sense to trade the computation in normal problem space to more computation for solutions only, and the total amount of computation will be smaller.

    But think about it now, it seems to be a good idea for GPU, at least for Cypress. GPU have more computing power than CPU. In the case of Cypress, it's even more obvious because despite its higher theoretical computing power it's still slower than a GT200. That suggests a large amount of computing power is not well utilized.

    Of course, now it's clear that register pressure is also an important issue to consider. So for a new algorithm to be faster, it needs to not only be less branchy but also uses less memory (local memory can cover for some register pressure as evidenced here, but even that has its limit).
     
  3. Arnold Beckenbauer

    Veteran

    Joined:
    Oct 11, 2006
    Messages:
    1,416
    Likes Received:
    350
    Location:
    Germany
    Something is wrong.
    (HD4850 700/993, Stream SDK 2.01)
     
  4. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    The new program is optimized for Evergreen architecture. Basically, it uses local memory to relieve register pressure. Since 48x0 does not support OpenCL style local memory, it may have some problems. I think it's still possible to fix that, but that'll have to wait till I get home on the weekend, as my Radeon 4850 is in my home computer. :)
     
  5. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    I don't understand why you are so concerned with register pressure on the ATI cards, they have a much larger register file than CUDA cards. Also, register pressure effects different features for ATI than Nvidia. It's possible to decrease performance by reducing register pressure.

    Also, as far as ALU utilization, you can just look at the ALUBusy counter.
     
  6. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    Because by eliminating register spill its performance increased many times.
     
  7. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,831
    Likes Received:
    2,121
    Location:
    Germany
    With this version:
    http://forum.beyond3d.com/showpost.php?p=1416167&postcount=60
    I got
    Code:
    nqueen_cl.exe -local -noatomics 17 
    
    Platform [0]: NVIDIA CUDA
    Select platform 0
    Using GPU device
    Using 30720 threads
    17-queen has 95815104 solutions (11977939 unique)
    Time used: 1.14s
    On a GTX 480 (stock)

    Some more stuff (with Board-Soze of 18, this time), best of three runs:
    nqueen 18
    -> 7,74s
    nqueen 18 with local switch
    -> 7,91s
    nqueen 18 with noatomics switch
    ->10.1s
    nqueen 18 with local & noatomics switches
    -> 8,6s

    Interestingly, GPU-z was running alongside and i read the memory controller load.
    local: 1-2 %
    noatomics: 45ish % (38-49 %)
    plain: 55ish % (50-57 %)
     
    #67 CarstenS, Apr 6, 2010
    Last edited by a moderator: Apr 6, 2010
  8. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    On the ATI Cards? How do you know you were getting register spilling? How many GPR was the kernel using? How many wavefronts were running in parallel? Like I said, the ATI cards have a much larger register file than Nvidia cards, I'd be surprised if you were getting actually spilling.

    Are you sure the performance increase wasn't from something else, like cache hit performance?
     
  9. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    It's pretty simple though. Before using local memory to replace private memory, it has many more global memory access than what actually needed in the program. That can only indicate that the private memory are not in registers. Those global memory access are all gone after replacing private memory with local memory. I sort of expected the same as you, that's why I didn't think that using local memory is good for ATI's GPU (I did that for NVIDIA's GPU because NVIDIA's registers can't be indexed, so private arrays can't be in registers). However, I found out that using local memory is also very good for ATI's GPU, even before vectorization.

    ATI does have more total amount of registers, but it seems that the number of registers available to a work item (thread) is limited. This can be problematic when you need to vectorize. Since it's rarely required to vectorize on NVIDIA's GPU, the register pressure is actually less than ATI.
     
  10. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    Thanks for these interesting results :)
    It's very surprising to see that using local memory is a bit slower than not using local memory on Fermi. As your GPU-Z data has shown, using local memory eliminates almost all global memory traffic, which should be good. Although I guess that the L1 cache probably helps quite a lot. It doesn't seem to work for the no-atomics case, though.
     
  11. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    BTW, according some post in ATI OpenCL forum it looks like array are currently stored in global memory (!). You may still get decent performance thanks to HD5xxx cache but they can really hurt performance. This may explain why you see a performance improvement for using local memory on ATI too.
     
  12. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    I'm mostly certain this is not true.

    Though I haven't looked at, nor do I hav, your old code, were you running this on a RV870? What do you mean by "private memory" (since this does not exist under this terminology)? Also, the compiler does a good job of packing scalar values into the 128-bit registers (meaning it doesn't use one to one scalar value to register).
     
  13. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    Yes, this is true also, thanks for remembering this. This is probably the cause of the performance increase, not register pressure.
     
  14. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    Does the local memory become cache by default (if it's not used)?
     
  15. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    Although this could be a reason, but it still does not explain why, when the non-vectorized code does not access global memory (after the local memory trick), it does after vectorized to 4D. The vectorized variables are not arrays, so they should be able to use registers for them. Only after change the vectorization to 2D the global memory access are eliminated.
     
  16. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    Not on ATI's hardware. Fermi has a switchable L1 cache/local memory, but I don't know how it works (but it probably not automatic).
     
  17. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    Yes, I use Radeon 5850. I think I already said that in my previous post.
    "Private memory" is an OpenCL terminology. Since the program is written with OpenCL, I think it's appropriate to use OpenCL terms rather than some ambigious vendor specific terms.
    Register packing is not an issue here, because it can't explain why 4D vectorization needs to access global memory, while 2D doesn't. This is mostly likely to be explained by register pressure.
     
  18. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    Yes, I know, I was asking about Fermi because I know about the switchable L1 cache/local memory.

    So does it?
     
  19. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    I can't comment on that since I haven't been able to get the 4D vectorization to work with your code and the profiler.
     
  20. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    Yes, ok, I see. And Dade is correct that the ATI OpenCL compiler currently uses global memory for this.

    I haven't looked at your latest posted code yet; however, is it possible that with 4D vectorization you are going over the local memory size?

    I will try to run your 4D vec. code and take a look to see. I honestly still don't believe that register pressue is the issue, it doesn't appear that every other option has been exhausted.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...