GPU Ray-tracing for OpenCL

Discussion in 'Rendering Technology and APIs' started by fellix, Dec 27, 2009.

  1. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    Glad to help. :)
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    NVidia users normally won't have a CPU capability on their systems as they need the ATI driver to be installed (this driver contains both CPU and GPU support for OpenCL, and installs on systems that don't have ATI graphics). I think Vista prevents multiple IHV drivers from being installed, but XP and W7 should be OK.

    Anyway, all the fun is going to be in multiple-GPU. Talonman should be happy then.

    Jawed
     
  3. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    You know me well!!

    Optix Libraries FTW!

    My favorite multi-GPU program is Mandelbulb. Due to it using the Optix libraries, it does do an outstanding job of using all of your GPU's installed in the system. :heart:
    (That's both GPU's operating in SLI mode, and Dedicated PhysX mode.)

    The CPU workload will also make use of multiple cores if available.

    http://forums.nvidia.com/index.php?showtop...50985&st=20
    To Download - Post 29, page 2.

    My GPU workload distribution with the app running:
    [​IMG]

    I wish more CUDA libraries would just handle that, as well as Optix apparently does.

    End Thread Jack!
     
  4. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    I have read this news on NVIDIA OpenCL forum: http://forums.nvidia.com/index.php?showtopic=153438
    No idea how it works but it looks interesting for CPUs.

    I have read also this post: http://forums.nvidia.com/index.php?showtopic=154710
    It could be a good explanation of why the SmallptGPU CPU usage is so high for NVIDIA users.
     
  5. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    Is there a known way to disable blocking when alternating memory buffers are being used?

    or is the better question...

    How can we run SmallptGPU without using alternating memory buffers all-together.
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    Not sure I rate their chances against AMD and Intel...

    So it has no meaningul impact on actual rendering performance then. Also NVidia can change the way this OpenCL event is handled so that it's treated like an event rather than a spin.

    Jawed
     
  7. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    Sorry, I still can't edit. At least my post count is going up, and some day edit will be unlocked!

    Is there any reason to think that the 'RUN_SCENE_SIMPLE_64SIZE', would not need to use alternating memory buffers due to a lot of "nothing to do" rays?
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    NVidia has coded the event triggering "badly", which uses 100% of a core. Same code on ATI doesn't do this. The OpenCL is entirely sane as far as I can tell, and this is purely down to NVidia.

    The other side of the coin is that NVidia appears to have ~2x higher kernel invocation rate than ATI. This would be an advantage if ray tracing needed thousands of separate invocations per second.

    This invocation rate advantage might be due to the way NVidia is handling events (i.e. the high CPU usage) or it might not. I don't know.

    Actually, to be sure that invocation rate is higher on NVidia we'd need to compare an "empty kernel" on both. NVidia may have higher work-item generation rate or something else (i.e. something in the GPU) that makes "rendering infinity" faster...

    Jawed
     
  9. Tim Murray

    Tim Murray the Windom Earle of mobile SOCs
    Veteran

    Joined:
    May 25, 2003
    Messages:
    3,278
    Likes Received:
    66
    Location:
    Mountain View, CA
    I don't know if blocking sync is implemented for OpenCL on NV hardware. The "fastest" way to wait on GPU event completion is to spinlock on a memory location, but that obviously pegs a CPU core. Blocking sync allows the thread to go to sleep and be awakened by the driver later. There can be significant latency penalties (it can be orders of magnitudes higher than a basic spinlock because you get into the vagaries of OS thread scheduling), but this is not a bottleneck for a number of applications.
     
  10. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    I understand the point but it is going to be a problem for anyone interested to CPU+GPU bandwidth (i.e. total rendering time in my case) more than GPU latency. I guess they should use some sort of adaptive strategy (i.e. spinlock for small task, thread suspend for larger one).

    Anyway, ATI beta SDK was showing exactly the same behavior (i.e. high CPU usage on comuncation with the GPU) and it has than been fixed in the ATI SDK 2.0 final release. I hope NVIDIA will do the same.
     
  11. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,441
    Likes Received:
    338
    You could periodically poll with clGetEventInfo and sleep in between ... but that's a bit fugly, it would prove the theory though.

    Busy waits inside the OpenCL driver don't make sense ... if the developer wants to do a busy wait he can do that in his own program (same as above without the sleep) implementing driver level code for a real blocking call he can't.
     
  12. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    I get goosebumps just reading this conversation... :)
     
  13. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,430
    Likes Received:
    309
    Location:
    Varna, Bulgaria
    After brief trial and error I've managed to manually edit the scene files and now I'm able to modify every parameter (colour, luminosity, position and surface type) and add more spheres to the scene. I guess with an editor would be much easy. :p

    Anyway, my immediate observation was that adding more lights considerably decreases the sampling rate -- four versus one light source cuts the performance nearly in half, both for GPU and CPU device selection.
     
  14. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    Fellix, SmallptGPU doesn't include any acceleration structure, it means that if you double the number of objects, the rendering time will double too and it is about the same for light sources.

    For instance, http://davibu.interfree.it/opencl/smallluxgpu/smallluxGPU.html has instead a simple acceleration structure (i.e. BVH) and is able to handle a lot more objects (262000 in the case of the Luxball scene and/or 62 light sources in the case of Loft scene).
     
  15. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,430
    Likes Received:
    309
    Location:
    Varna, Bulgaria
    Thanks. Dade. ;)
     
  16. Tim Murray

    Tim Murray the Windom Earle of mobile SOCs
    Veteran

    Joined:
    May 25, 2003
    Messages:
    3,278
    Likes Received:
    66
    Location:
    Mountain View, CA
    Er, it has no impact on PCIe bandwidth because it's not spinning across PCIe.
     
  17. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    If you guys want to read more about Nvidia performance, this might be worth keeping an eye on?

    It is ongoing performance testing by Steven Robertson (srobertson on Nvidia's forum.)

    http://strobe.cc/articles/cuda_atomics/

    I realize the test is being done in CUDA, but figure some performance lessons might also be applied directly to OpenCL.
     
  18. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
  19. MrGaribaldi

    Regular

    Joined:
    Nov 23, 2002
    Messages:
    611
    Likes Received:
    0
    Location:
    In transit
    OpenCL Platform 0: NVIDIA Corporation
    OpenCL Device 0: Type = TYPE_GPU
    OpenCL Device 0: Name = GeForce GTX 260
    OpenCL Device 0: Compute units = 27
    OpenCL Device 0: Max. work group size = 512
    [SELECTED] OpenCL Device 0: Type = TYPE_GPU
    [SELECTED] OpenCL Device 0: Name = GeForce GTX 260
    [SELECTED] OpenCL Device 0: Compute units = 27
    [SELECTED] OpenCL Device 0: Max. work group size = 512
    Reading file 'rendering_kernel.cl' (size 3230 bytes)
    OpenCL Device 0: kernel work group size = 128
    OpenCL Device 0: forced kernel work group size = 128

    Samples/sec 46k (fluctuates a bit up and down, but seems to stay withint 46k).

    But it doesn't work for any sizes over 128, Reading file 'rendering_kernel.cl' (size 3230 bytes)
    OpenCL Device 0: kernel work group size = 128
    OpenCL Device 0: forced kernel work group size = 512
    Failed to enqueue OpenCL work: -5

    On Cornell I get about 10500/10600k.

    Ubuntu 9.10 x64 with Core i7-920
     
  20. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    Working here... ;)
    [​IMG]
     

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...