GPU Ray-tracing for OpenCL

Discussion in 'Rendering Technology and APIs' started by fellix, Dec 27, 2009.

  1. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,430
    Likes Received:
    309
    Location:
    Varna, Bulgaria
    Grab it here!

    The result is in Ksamples/sec.
    Here's what is capable my precious one:

    [​IMG]
    Code:
    OpenCL Platform 0: Advanced Micro Devices, Inc.
    OpenCL Device 0: Type = TYPE_GPU
    OpenCL Device 0: Name = Cypress
    OpenCL Device 0: Compute units = 20
    OpenCL Device 0: Max. work group size = 256
    Reading file 'rendering_kernel.cl' (size 2997 bytes)
    OpenCL Device 0: kernel work group size = 256
    
    Radeon HD 5870 @ 900/5000MHz
    Q9450 @ 3608MHz, 0% load! (NV users may get noticeable CPU load for some reason)
     
  2. Anteru

    Newcomer

    Joined:
    Jul 4, 2004
    Messages:
    114
    Likes Received:
    3
    Very nice! With my 8800 GT, I'm getting 430k samples/second :( Moreover, the driver locks up after 50 passes (roughly). This is under Vista x64.

    Code:
    OpenCL Platform 0: NVIDIA Corporation
    OpenCL Device 0: Type = TYPE_GPU
    OpenCL Device 0: Name = GeForce 8800 GT
    OpenCL Device 0: Compute units = 14
    OpenCL Device 0: Max. work group size = 512
    Reading file 'rendering_kernel.cl' (size 2997 bytes)
    OpenCL Device 0: kernel work group size = 192
     
  3. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,401
    Likes Received:
    394
    Location:
    New York
    ~1250 samples/sec on a GTX 285. Maxes out one CPU core. David posts here, maybe he can give us some insight into his approach.
     
  4. Forrest

    Newcomer

    Joined:
    Jul 22, 2008
    Messages:
    39
    Likes Received:
    0
    I get 6560K samples/sec on 5770 at default clocks.
     
  5. gamervivek

    Regular Newcomer

    Joined:
    Sep 13, 2008
    Messages:
    683
    Likes Received:
    206
    Location:
    india
    4500 K samples/sec on a 4850, slows down the GUI to a crawl
     
  6. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    14,367
    Likes Received:
    1,850
    OpenCL Device 0: Compute units = 14

    where does that number come from
     
  7. Arnold Beckenbauer

    Veteran

    Joined:
    Oct 11, 2006
    Messages:
    1,372
    Likes Received:
    313
    Location:
    Germany
    2*7 = 14.
    8800GT has 14 8-way-SIMDs (2 per cluster).
     
  8. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,779
    Likes Received:
    443
    Location:
    Torquay, UK
    Nice!

    [​IMG]

    There were some spikes up to 14480K samples/s every 4-6 seconds, so not too bad I think!

    Win7 x64 and I think still SDK Beta on this one.
     
  9. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    14,367
    Likes Received:
    1,850
    Ahh so the 8800gt has 112 shaders so it takes 14 of them to make a compute unit, is that right?
    and the hd 5870 has 1600 shaders, but it takes 80 of them to make a compute unit ?
     
  10. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,430
    Likes Received:
    309
    Location:
    Varna, Bulgaria
    More like 16, for a fair comparison, putting aside the VLIW nature of the arch.
     
  11. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    14,367
    Likes Received:
    1,850
    but then if only too 16 to make a compute unit
    why does the app only report 20 compute units and not 100
    1600 shaders = 20 units = 80 shaders per unit yes ?
     
  12. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,430
    Likes Received:
    309
    Location:
    Varna, Bulgaria
    It's an ALU, not "shader", stream processor or any other marketing label with narrow meaning. ;)

    80 ALUs per multiprocessor or 16-wide SIMD unit -- pick one. ;)
     
  13. Arnold Beckenbauer

    Veteran

    Joined:
    Oct 11, 2006
    Messages:
    1,372
    Likes Received:
    313
    Location:
    Germany
    8800GT has (8*2)*7= 112 SPs.

    My HD4850 has 10 Compute Units (10 16-way-SIMDs). The HD5870 has 20 Compute Units.
     
  14. stevem

    Regular

    Joined:
    Feb 11, 2002
    Messages:
    632
    Likes Received:
    3
    I'm getting a glitchy ~600 samples/sec with a GTX 280, latest drivers etc... Hmmm...
     
  15. Florin

    Florin Merrily dodgy
    Veteran

    Joined:
    Aug 27, 2003
    Messages:
    1,601
    Likes Received:
    133
    Location:
    The colonies
    [​IMG]

    It seems to max out one CPU core
     
  16. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    Hi, I'm David, the author of SmallptGPU. I think I can clarify few points:

    - About the poor performances on Nvidia, I have developed SmallptGPU on an ATI HD4870. Both ATI and NVIDIA OpenCL drivers are in a early stage of the development and have their fair amount of problems/bugs/etc. I have avoided problematic paths on ATI because is my card while I have never tested SmallptGPU on NVIDIA. I assume I'm doing something that the NVIDIA OpenCL driver doesn't like at all. The high CPU usage is a good hint of this problem.

    - The sources are available on the web site, so if anyone has a fix for NVIDA cards, I will be happy to apply it.

    - SmallptGPU uses the first GPU device available (there is a command line option to run on CPU device). About all the load should be on GPU, CPU is nearly unused. It is not able to use multiple devices at the same time so any SLI/CrossFire configuration will be used only at 50% of its capabilities.

    - 5870 is horrible fast ... I' trying to not buy one :wink:
     
  17. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,430
    Likes Received:
    309
    Location:
    Varna, Bulgaria
    Can you describe the command line switch syntax?
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    You have to provide a complete argument list for the command line, e.g.:

    smallptgpu 0 1 rendering_kernel.cl 640 480 scenes\cornell_large.scn

    You can create new arrangements of spheres:

    Code:
    camera 20 100 300 0 25 0
    size 7
    sphere 1000 0 -1000 0 0 0 0 0.75 0.75 0.75 0
    sphere 10 35 15 0 0 0 0 0.9 0 0 2
    sphere 15 -35 20 0 0 0 0 0 0.9 0 2
    sphere 20 0 25 -35 0 0 0 0 0 0.9 2
    sphere 4 35 15 0 15 15 15 0 0 0 0
    sphere 8 -35 20 0 15 15 15 0 0 0 0
    sphere 8 0 25 -35 100 100 100 0 0 0 0
    e.g. saved as file caustic7.scn. That has 3 light sources (the final 3 spheres), one inside each "caustic" sphere. The one inside the blue sphere is super-bright.

    Jawed
     
    #18 Jawed, Jan 2, 2010
    Last edited by a moderator: Jan 2, 2010
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    I suggest changing the work group size calculation so that it does not use the "maximum". 64 on ATI would be better than 256. 64 is the minimum size on ATI HD5870 or HD4870. Some ATI GPUs (HD43xx HD45xx HD46xx) will work best with a lower size (32 or 16).

    NVidia should be happy with 32 or 64.

    Might be an idea to expose the workgroup size as a command line parameter. Or, have the program try a few different values. Or, just hard code it to 64.

    The register usage appears to be 49 vec4 registers. This is reasonable on ATI, resulting in 5 hardware threads (wavefronts). On NVidia it is a disaster (the equivalent of 196 registers if fiddling with the occupancy calculator), meaning that only 2 hardware threads (warps) can occupy each multiprocessor. As it happens 64 is better than 32 for the workgroup size in this scenario.

    I'm not sure how NVidia handles the situation when 512 work-items are requested but the hardware can only issue 64 - I'm not sure if the hardware is spilling registers in this situation, if it is, then that compounds the disaster. Hopefully making this change will improve things dramatically.

    Jawed
     
  20. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,430
    Likes Received:
    309
    Location:
    Varna, Bulgaria
    Well, let's hope Mr. David takes a note on this. ;)

    But there's still the weird issue with the CPU being under load on NV hardware.
     

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...