GPU Ray-tracing for OpenCL

Discussion in 'Rendering Technology and APIs' started by fellix, Dec 27, 2009.

  1. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    Yes, I am running at: C=621, Shaders=1512, and Memory=1152

    BTW - Where is the 'edit post' button?
     
  2. Florin

    Florin Merrily dodgy
    Veteran

    Joined:
    Aug 27, 2003
    Messages:
    1,601
    Likes Received:
    133
    Location:
    The colonies
    Err I think that's reserved for members over a certain post count. Kind of a pain, but I guess newbie stealth edits ruined it..
     
  3. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    Thanks.... Drag!!
     
  4. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    Thanks all, the performances on NVIDIA are still quite disappointing, I will try to give a look to NVIDIA OpenCL samples to see if I can find an explanation to the poor performances.
     
  5. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    If we knew what the CPU was up to, I think it would help to put our finger on the performance issue.
     
  6. Florin

    Florin Merrily dodgy
    Veteran

    Joined:
    Aug 27, 2003
    Messages:
    1,601
    Likes Received:
    133
    Location:
    The colonies
    I've tried a mild overclock (133 bus speed to 145) on the i7, which gave a single core top speed of 3.625 Ghz, and it didn't seem to affect the score.

    If you run the 6600 at stock speed, do the scores change?
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    What is the CPU load like when running simple.scn? That might be a clue.

    Jawed
     
  8. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    odd...

    Still high!

    [​IMG]
     
  9. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
  10. Florin

    Florin Merrily dodgy
    Veteran

    Joined:
    Aug 27, 2003
    Messages:
    1,601
    Likes Received:
    133
    Location:
    The colonies
    12-13% of 4 cores/4 HT. About 91800k samples/sec

    The Win7 scheduler bounces the process around between the 4 real cores. If I force the affinity of smallptgpu.exe to just 1 core, it just maxes that out 100% and the score doesn't change.

    Hmm when overclocking the CPU to 145 (actually gives a top speed of 3760 as the max turbo multiplier seems to be 26 on this CPU instead of 25 like I thought) this score does seem to go a bit higher, like a mean of 92400k or so. Too small to be significant though I reckon.
     
  11. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,427
    Likes Received:
    305
    Location:
    Varna, Bulgaria
    That's strange, I'm getting just over 70000K in this scene with the 5870 -- no where near Talonman's numbers.

    Looks like some specific workload is hogging NV hardware or the driver in the heavier scene?!
     
  12. Sinistar

    Sinistar I LIVE
    Regular Subscriber

    Joined:
    Aug 11, 2004
    Messages:
    645
    Likes Received:
    60
    Location:
    Indiana
    Could the NV Opencl be offloading some of the work to the cpu, explaining why simple on one core of the GTX 295 is faster than a HD 5870?

    Simple on my HD 5870 get 71000K
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    The simple scene consists of a lot of "nothing to do" rays, I guess (off to infinity and beyond). This should mean the duration of the kernel is short. The variation with CPU clock seems to suggest that CPU-side stuff is some kind of bottleneck. Finally the lower ATI performance for this scene seems to suggest that kernel launch overhead is higher on ATI.

    A way to check this is simply to point the camera at "black", i.e. press cursor-up-arrow until everything disappears.

    Jawed
     
  14. Sinistar

    Sinistar I LIVE
    Regular Subscriber

    Joined:
    Aug 11, 2004
    Messages:
    645
    Likes Received:
    60
    Location:
    Indiana
    249000K doing that.
     
  15. Florin

    Florin Merrily dodgy
    Veteran

    Joined:
    Aug 27, 2003
    Messages:
    1,601
    Likes Received:
    133
    Location:
    The colonies
    ~444000k with simple in the black.
    ~470000k with bus speed overclock / process affinity set
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    So, that's effectively about 1500 frames per second (passes). If you maximise the application then the sample rate should increase as the CPU time should become less of a bottleneck.

    Jawed
     
  17. Florin

    Florin Merrily dodgy
    Veteran

    Joined:
    Aug 27, 2003
    Messages:
    1,601
    Likes Received:
    133
    Location:
    The colonies
    Again, you are correct. Simple in the black w/o OC becomes 565000k/s when maximised.

    This one I don't get. Why does the CPU become less of a bottleneck at full screen?

    Edit: oh wait I see it actually ups the resolution when maximising
     
  18. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,712
    Likes Received:
    83
    Location:
    Taiwan
    Do you use any private array in your kernel? NVIDIA does not support indexed register right now, so it uses global memory to do that. So it's going to be slow if you use private arrays in non-predictable manner. If this is the case, and your arrays are small enough, it may be beneficial to substitute them with shared memory (the local buffer in OpenCL). Of course, you'd need one for each thread and the local buffer is not very big (only 16KB on NVIDIA's hardware) so it's probably not going to work well if you need to have large arrays.
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    On ATI the register allocation has gone down by 1 and there's slightly less instructions. The reduction in register allocation doesn't affect the count of hardware threads which is approximately: floor(256/NUM_GPRS). The number of clause temporaries used can have a small impact, since these count towards register allocation.

    I see the new kernel function GenerateCameraRay which looks like an attempt to control some variables' lifetime. Seems to have worked on NVidia...

    Jawed
     
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    There's a 2-element array of integer seeds that's private per work-item. This array is put into local memory. On ATI the compiler just uses registers.

    Jawed
     

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...