N-Queen Solver for OpenCL

Discussion in 'GPGPU Technology & Programming' started by pcchen, Jan 10, 2010.

  1. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    Its design is switchable between 16KB local/48 KB cache and 48KB local/16 KB cache, so either way it has a certain amount of L1 cache available.
    Since DirectCompute in DX11 requires 32KB shared memory, I guess it will always be in the 48KB local/16KB cache mode when running DX11. However, when running CUDA, it's probably possible to automatically detect which mode it should use by checking on the required local memory size of the kernel. However, I don't know whether it does this right now.
     
  2. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    The 4D vectorized version uses only 64 work items per work group. 2D version uses 128 work items. The amount of local memory used is the same (actually it's 12KB).

    I just managed to make the OpenCL kernel analyzer to work on my computer by disabling my GTX 285 (it apparently tries to create an OpenCL context with NVIDIA's platform?). The analysis shows that the 4D version does have memory access instructions in the main loop, while the 2D version does not. I don't understand why it should have those memory access instructions though.

    I uploaded the resulting files in the attachment.
     

    Attached Files:

  3. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,831
    Likes Received:
    2,121
    Location:
    Germany
    A note of caution: local memory only slows nqueen not konsistenly down! I'll re-check and run some more board sizes today, but I'm under the impression that it's only with smaller board sizes. With 19 and 20 it's notably faster. :)

    AFAIBT at GPU Tech Conf, you'd need a driver reload for it to switch. But that's from a time when Nvidia was confident to ship Fermi at christmas (2009!), so it may have become invalid info.
     
  4. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    pcchen,

    Can I get a full working 4D version from you? Thanks in advance. I have a your latest 2D version, with both queen1 and queen vectorized and you posted the 4D queen1 kernel, but I don't have a 4D queen kernel.
     
  5. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    Unfortunately, there is no full 4D version, as the nqueen1 4D version is already too slow for the full version to be useful. A 4D version of the nqueen function will be much worse because the huge rotation check part will have to replicated by 4 times, which will make things worse.

    I posted a working 4D nqueen1 executable in the attachment if you want to test it. It runs 64 work items per work group instead of 128 work items in the 2D case, to make all resource usages the same.

    The profiling results looks like this:

    4D:
    Code:
    Method, ExecutionOrder, GlobalWorkSize, GroupWorkSize, KernelTime, LocalMem, MemTransferSize, ALU, Fetch, Write, Wavefront, ALUBusy, ALUFetchRatio, ALUPacking, ALUStalledByLDS, LDSBankConflict, FetchUnitBusy, FetchUnitStalled, WriteUnitStalled
    BufHostToDevice, 1, , , , , 48, , , , , , , , , , , , 
    BufHostToDevice, 2, , , , , 10616832, , , , , , , , , , , , 
    BufHostToDevice, 3, , , , , 4, , , , , , , , , , , , 
    nqueen1_vec_Cypress, 4, {17344; 1; 1}, {64; 1; 1}, 828.42863, 12288,, 1196151.55, 43827.74, 87646.48, 271.00, 10.87, 27.29, 84.66, 1.00, 0.00, 1.57, 0.00, 0.00
    BufDeviceToHost, 5, , , , , 589824, , , , , , , , , , , , 
    2D:
    Code:
    Method, ExecutionOrder, GlobalWorkSize, GroupWorkSize, KernelTime, LocalMem, MemTransferSize, ALU, Fetch, Write, Wavefront, ALUBusy, ALUFetchRatio, ALUPacking, ALUStalledByLDS, LDSBankConflict, FetchUnitBusy, FetchUnitStalled, WriteUnitStalled
    BufHostToDevice, 1, , , , , 48, , , , , , , , , , , , 
    BufHostToDevice, 2, , , , , 10616832, , , , , , , , , , , , 
    BufHostToDevice, 3, , , , , 4, , , , , , , , , , , , 
    nqueen1_vec_Cypress, 4, {34688; 1; 1}, {128; 1; 1}, 131.87619, 12288,, 535959.33, 4.00, 2.00, 542.00, 60.61, 133989.83, 81.67, 11.54, 0.00, 0.00, 0.00, 0.00
    BufDeviceToHost, 5, , , , , 589824, , , , , , , , , , , , 
     

    Attached Files:

  6. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    Ok, thanks, like I said before I couldn't get your 4D vectorized version to profile without it hanging, it simply kept crashing my computer.
     
  7. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    I tried to make a 128 work items 4D version nqueen1, and it crashed my computer too. What's interesting is, the computer was not actually crashed, a music player running in the background kept running, for example. However, the display was frozen. In most similar cases, the display driver should recover by the system, but it didn't in this case. I have to reboot the computer.
     
  8. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,831
    Likes Received:
    2,121
    Location:
    Germany
    Does the vec4-version really do the same amount of work or did i get lost somewhere in the middle? On a GTX480 its actually quite a bit faster than the normal version.
     
  9. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    It only does nqueen1, i.e. a queen in the corner case.
     
  10. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,831
    Likes Received:
    2,121
    Location:
    Germany
    Ah, okay. I read the "1" but didn't know what to make of it. Since the vec4-version accepted the regular board-sizes as well, I was a bit puzzled.
     
  11. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    Ah! You probably have VPU Recover turned off right? I do.
     
  12. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    To my understanding, VPU Recover is only available under Windows XP. Windows Vista and Windows 7 are supposed to be able to restart crashed video driver automatically. I also encountered several occasions of recovered restarts (I just wait for a few seconds and the video driver restarts automatically), but in this case it never restarts.
     
  13. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    BTW, the problem with the vec4D is supposedly going to be fixed in the soon to be upcoming release. Also, I tried using __local uint4 and then accessing .x, I got an error.... this is also suppose to be fixed in the upcoming SDK release (2.02?).... via Micah.
     
  14. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    You can try changing the TdrDelay. Add a DWORD value called TdrDelay to HKLM\SYSTEM\CurrentControlSet\Control\GraphicsDrivers. I changed mine to 0x3c (60s), for example, so I could be certain that the GPU would stay busy while I debugged the driver while running a 10 minute kernel :)

    I tried compiling your 4D vectorized kernel with an internal tool and I don't see any memory accesses generated in your main loop. Can you post the app somewhere so I can try the full version with our OpenCL tool chain to make sure everything's working as expected?
     
  15. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    Next release will be 2.1.
     
  16. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    This is a great tip, thanks :)

    I posted one here. I checked it with the kernel analyzer and the 2D version uses 17 GPRS while 4D version uses 31, which seems to be alright, but the resulting assembly of the 4D version has memory access instructions in the loop, I don't understand why.
     
  17. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    I fixed a small bug in current code so that Radeon 4850 is able to run vectorized code. However, it's slower than the old scalar code, because local memory is not available, unfortunately. I also added a new option to disable vectorization.

    I put the source code and executable in the first post. You can also download them here:

    Source
    Executable
     
  18. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,831
    Likes Received:
    2,121
    Location:
    Germany
    When trying to run the CPU-device, I get an Open CL error: -46 (line:261)

    In my system I had a HD 5870 with OpenCL CPU/GPU principally working, now I'v got temporarily a GTX 480. I'm using the following command-line: nqueen_cl -clcpu -platform 0 8


    edit:
    As was to be exptected, on the GPU-device, -novec is no different in case of GTX 480. I am looking forward to see if AMDs driver can extract as much parallelism from the code as an explicit vectorization.
     
    #98 CarstenS, Apr 10, 2010
    Last edited by a moderator: Apr 10, 2010
  19. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    The CPU OpenCL bug is fixed now :)
    Currently vectorization is not used on NVIDIA's hardware, so -novec has no effect on them. There is currently no way to force vectorization on NVIDIA's hardware, although my previous experiments had shown that it's pretty bad for them.

    A problem with parallelism of this program is that the loop is pretty small, so there is really not much parallelism to be found. The 2D or 4D vectorized version is, in a sense, loop unrolling. In theory, NVIDIA's hardwares could also benefit from this, but the register file is smaller, so it's not a good idea.
     
  20. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,831
    Likes Received:
    2,121
    Location:
    Germany
    Thanks!
    On my CPU (a 3,8-GHz E8500) I'm getting now 109 secs for a board size of 18, which makes the GTX 480 about 14 times faster. If I'm not mistaken, that makes one of the CPUs' cores about 6,3 times as fastas one of the so-called "CUDA-Cores" (per clock). ;)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...