GPU Ray-tracing for OpenCL

Discussion in 'Rendering Technology and APIs' started by fellix, Dec 27, 2009.

  1. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,804
    Likes Received:
    475
    Location:
    Torquay, UK
    Changing OCL_CONSTANT_BUFFER for _constant has no effect on performance on HD5870 with Cat. 8.12Beta

    Replacing the second bit of code gave me minimum improvement from 14834Ks/s to 15000Ks/s.

    Tested on SmallPT 2.0 alpha 2 using only GPU accel. on HD5870 stock clocks.
     
  2. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    859
    Likes Received:
    262
    Hy, I also played around with it on my 5870.

    I was wondering if regular alignment and pre-calculation-rules with CPU would still apply. So I tried aligning the structures, and padding with pre-computed values under the assumption that the cache-banks will be filled by the padded sizes anyway, so we can use it for free:

    Vec becomes x, y, z, l (so 16 byte, I didn't convert this to float4 though, still float[4])
    Sphere becomes vectors first, then rad, rade2 = rad * rad (so hopefully 64bytes)

    I saw no way no use l (vector-length) in an efficient way, so that's just empty.
    Interesting is that there was no negative performance impact, which could mean the structures were already fully aligned, or the algorithm is not memory-throughput bound (because I added ~15% padding, hard to say how much because Vec is inside other structures, but's less than Vec's +33%). The Sphere-buffer grew from 33k to 45k (on "complex.scn").
    Analysing the situation with Stream Analyzer shows that ALU occupation is 20% (on "complex.scn"), so I couldn't say where it stumbles over it's own feet.

    before change / after change
    ALU 117619.04 117622.73
    Fetch 25327.94 25329.79
    Write 6.00 6.00
    Wavefront 12288.00 12288.00
    ALUBusy 20.28 18.16
    ALUFetchRatio 4.64 4.64
    ALUPacking 31.61 31.61
    FetchUnitBusy 12.86 11.53
    FetchUnitStalled 0.02 0.03
    WriteUnitStalled 0.00 0.00
    ALUStalledByLDS 0.00 0.00
    LDSBankConflict 0.00 0.00

    Worksize was 256, here is 64:

    before change / after change
    ALU 117619.04 117622.73
    Fetch 25327.94 25329.79
    Write 6.00 6.00
    Wavefront 12288.00 12288.00
    ALUBusy 22.69 22.17
    ALUFetchRatio 4.64 4.64
    ALUPacking 31.61 31.61
    FetchUnitBusy 14.53 14.21
    FetchUnitStalled 0.02 0.03
    WriteUnitStalled 0.00 0.00
    ALUStalledByLDS 0.00 0.00
    LDSBankConflict 0.00 0.00

    Fetch does less afterwards, even though has more data to fetch ...
    ALU has less to do probably because of rade2 = rad * rad only.

    When applied to "complex.scn" speed grows from 542k to 552k (2%), the performance-measure is pretty constant because of the numbers of spheres, so this really can attributed to the changes, and not some background-task.
    When applied to "cornell.scn" the performance-measure is identicall with the exception that the measure does not fluctuate really wild in the beginning anymore, but is more calm and leads to asymt. aprox. same samples/sec. as without alignment/padding.

    Well anyway, I know nothing about GPU-especificalities, and I just had fun triggering the switches in various way, getting into OpenCL finally. :)=

    Ciao
    Niels
     
  3. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    I'll take your word for it. Never heard of non-texture data getting pushed to the texture cache before.

    Oh wasn't implying that you didn't find it on your own. Just saying that profilers and debuggers (that work properly) are key to finding this sort of unexpected behavior.

    Are there 100 samples per ray? Not following the conversion :oops:
     
  4. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I'm not trying to toot my own horn or anything. Just saying that the profilers, debuggers, and even the compiler sucked, and I basically had to go with trial and error. AMD seems to be getting some flak for not having OpenCL support in the standard Catalyst, but IMO NVidia's is definately not ready for the public either.

    Uh oh, math fail :razz:

    69,000 kS/sec * 10 r/S = .69 Gr/s

    Actually, I made a slight error there because I've been using a depth of 5 and the originial SmallptGPU uses a depth of 6. Thus it should be 12 when comparing apples to apples.
     
  5. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    Lol, yeah I realized but didn't feel like editing because the numbers still wouldn't have made sense without your clarification :)

    So it's 10 rays per sample then? Thanks.
     
  6. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    The feature to select OpenCL platform and single OpenCL devices was asked some time ago and it is now available in http://davibu.interfree.it/opencl/smallluxgpu/smallluxgpu-v1.2beta3.tgz via configuration file (check the render.cfg file for an example).

    Now if someone is brave enough to install NVIDIA/ATI cards and drivers on the same PC :?:

    There are also few new features:
    - added support for vertex colours interpolation;
    - added support for configuration file;
    - added support for OpenCL platform and devices selection via configuration file;
    - new surface integrator architecture, it is able to generate 2 rays per step.

    The new surface integrator architecture decrease the CPU load required to keep the GPU busy and this means more spare CPU cycles to render more samples. It is faster:

    [​IMG]

    Seeing 3.7M samples/secs on scene with a 150k triangles and 4 light sources is quite impressive. I wonder when this GPGPU thing will stop to surprise me.
     
  7. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,804
    Likes Received:
    475
    Location:
    Torquay, UK
    Thanks Dade!

    Shame I'm away from home for next couple of days :(
    I would give new version a spin!
     
  8. ElMoIsEviL

    Newcomer

    Joined:
    Nov 3, 2003
    Messages:
    21
    Likes Received:
    0
    Location:
    Ottawa, Canada
    I have them both installed on my system. What would you like to know? (provided this works)

    PS. I extracted the files yet don't see a render.cfg file in there.
     
  9. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    The render.cfg look like this one:

    Code:
    image.width = 1280
    image.height = 720
    # Use a value > 0 to enable batch mode
    batch.halttime = 0
    scene.file = scenes/simple.scn
    scene.fieldofview = 60
    opencl.latency.mode = 1
    opencl.nativethread.count = 3
    opencl.cpu.use = 0
    opencl.gpu.use = 1
    # Select the OpenCL platform to use (0=first platform available, 1=second, etc.)
    opencl.platform.index = 0
    # The string select the OpenCL devices to use (i.e. first "0" disable the first
    # device, second "1" enable the second).
    #opencl.devices.select = 10
    # Use a value of 0 to enable default value
    opencl.gpu.workgroup.size = 64
    screen.refresh.interval = 50
    path.maxdepth = 6
    
    You can use "opencl.platform.index" to select the platform to use and the string "opencl.devices.select" to enable/disable single devices.

    By normally running SmallLuxGPU, you should have 2 platforms listed: ATI and NVIDIA (if the 2 OpenCL drivers really work and can coexist). In that case you can select which platform to use.
     
  10. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,750
    Likes Received:
    127
    Location:
    Taiwan
    By my own tests, the OpenCL ICD in NVIDIA's latest driver (196.21 or 196.34) still can't work with ATI Stream SDK 2.0. Apparently they have different function call for getting platform ID (it's called clIcdGetPlatformIDsKHR in nvcuda.dll and clIcdDispatchGetPlatformIDsKHR in atiocl.dll and atiocl64.dll).
     
  11. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    I have posted today the first rendering done with LuxrenderGPU on Lux forums: http://www.luxrender.net/forum/viewtopic.php?f=13&t=3439

    [​IMG]

    This is a quite important milestone because the OpenCL code starts to move from the "toy" field (i.e. SmallptGPU, SmallLuxGPU) to the "production" field (i.e. Luxrender). Even if it is still very much experimental.

    LuxrenderGPU supports some of nice feature of Luxrender Classic out of the box, including network rendering:

    [​IMG]
     
  12. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,489
    Likes Received:
    400
    Location:
    Varna, Bulgaria
    Kewl!

    But the GPU render output on the right is somewhat different -- the blue translucent enclosure is darker!?
     
  13. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,804
    Likes Received:
    475
    Location:
    Torquay, UK
    Same time of rendering, a lot more samples taken for GPU renderer :?:
     
  14. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    Can we download that version?

    I can find the download link.
     
  15. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    Talonman, It is still too experimental for the "public". Most of the configuration is hard-coded (buffer size, number of threads, etc.) inside the sources so you would have to modifying the code and recompile to run on your hardware.
     
  16. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    Thanks for the post...

    I will just wait then. The boys at EVGA still have their eye on you. :wink:

    http://www.evga.com/FORUMS/tm.aspx?m=91863&mpage=4
     
  17. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
  18. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
  19. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    Some update about the progresses done so far, I uploaded a new video about SLG at http://www.vimeo.com/10048897

    It includes 5 different scenes with real-time (or better interactive) rendering and 3 small animations.

    [​IMG]
    [​IMG]
    [​IMG]
     
  20. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,804
    Likes Received:
    475
    Location:
    Torquay, UK
    Anyone with GTX480 willing to join the party?
    I'm curious about performance of it in this great OpenCL raytracer :eek:
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...