GPU Ray-tracing for OpenCL

Discussion in 'Rendering Technology and APIs' started by fellix, Dec 27, 2009.

  1. chavvdarrr

    Veteran

    Joined:
    Feb 25, 2003
    Messages:
    1,165
    Likes Received:
    34
    Location:
    Sofia, BG
    Can someone make new binary?
     
  2. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,750
    Likes Received:
    127
    Location:
    Taiwan
    I'm thinking that the problem is probably not the loop, but the loop's direction. In the original loop, data is accessed in reverse order, which can be pretty bad for a GPU, but CPU can easily tolerate that. Just my two cents.
     
  3. Arnold Beckenbauer

    Veteran

    Joined:
    Oct 11, 2006
    Messages:
    1,415
    Likes Received:
    348
    Location:
    Germany
  4. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    Why would it be bad for a GPU to access data in the opposite direction? I mean..why would it be worse than it is for a CPU?
     
  5. chavvdarrr

    Veteran

    Joined:
    Feb 25, 2003
    Messages:
    1,165
    Likes Received:
    34
    Location:
    Sofia, BG
    Because CPU compilers will recognize the loop and optimize it to ++, while GPU ones are not mature enough (NV opencl at least)
     
  6. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,750
    Likes Received:
    127
    Location:
    Taiwan
    It probably doesn't matter if the accessed memory is constant (which is cached on GPU), but if accessing the global memory directly, reversed order can be pretty bad because there is no cache.

    By the way, I checked the code and I found that the major difference between the two for loops is that in the first one (where the change matters) the index variable is written to a memory location outside of the function, while in the second loop (where the change doesn't matter) there is no other use of the index variable. Maybe that's why the compiler doesn't want to optimize the first loop for fear of possible side effects.
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    No constant buffer usage on ATI

    Final message in this thread:

    http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=125954&enterthread=y

     
  8. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    Sure. I guess read-only L2s are not used in this (non graphics) context? Fermi should behave better in this case..
     
  9. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
  10. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I think nAo was objecting to the idea that it would be bad for the hardware, not the compiler. The loop condition is based on a constant, too, so it's 100% coherent. The weird thing is that it doesn't affect the second loop whether it's -- or ++.

    That sort of makes sense, but note that the index isn't written to a memory location but rather a local variable.

    That's an interesting point. It's very different from the way CUDA works, that's for sure. It may be that NVidia isn't using constant memory either, because when I put the sphere data in an image object, performance was the same as when using a constant array declared inside geomfunc.h.

    I don't think it matters, as the problem is the same unless your hardware decides to eliminate constant registers altogether. The only solution I can think of is to have the compiler create some bytecode that can be dynamically modified to change the type of read.

    I think so, given the speedups I get by using an image object. The data set is 400 bytes, so it will fit in the tiniest of caches.
     
  11. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    Doesn't that indicate that it is in fact using constant memory? We know it's not using global memory (any more) so where else could it be stored?

    Would be nice if someone could check the impact on Cypress of the image approach too.
     
  12. chavvdarrr

    Veteran

    Joined:
    Feb 25, 2003
    Messages:
    1,165
    Likes Received:
    34
    Location:
    Sofia, BG
    Well. If a program accesses sequental data, I'm fairly sure that incremental access will be faster - DDR/DDR2/3/4/5 all are much faster when bursting lots of data with one address command. And they burst incrementing source address, no?
    Maybe 10y ago I made some tests with Watcom and multiplying matrices incrementally was way faster than decrementing data. Using caches may level off the difference, also sophisticated prediction hardware in current CPUs
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    ATI OpenCL doesn't support images currently.

    Examination of the PTX (I presume that's possible) might confirm whether the specification of constant is using a constant or a texture buffer.

    Jawed
     
  14. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,804
    Likes Received:
    475
    Location:
    Torquay, UK
    Sorry, but I can't yet run OpenCL on my HTC Diamond2 :wink:
    I'm coming back home today so I will play with my precious!
     
  15. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    __constant items can be accessed globally, but they're read only. I think a lot of caches in GPUs only work with read-only items, so the speedup doesn't necessarily imply that constants aren't using global memory.

    Are you serious? The whole reason that aspect was put in was to take advantage of the filtering hardware on GPUs. That's pretty sad.

    I can see why they don't want to make OpenCL available in Catalyst yet.
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    It seems to me the OpenCL memory model is quite loose - much looser than seen in DirectCompute. This might be why NVidia's OpenCL performance is falling short of CUDA's in certain cases - though I also suspect that the comfort blanket of explicit warp-size aligned execution that's missing in OpenCL might be causing problems too.

    Also I think DirectCompute is a higher priority for AMD than OpenCL - the benchmarking of games is higher profile than OpenCL noodlers' experiments...

    Jawed
     
  17. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Yeah, that makes sense. Compliance has something to do with it, too, as I just found that the compiler option "-cl-fast-relaxed-math" adds another 30%.

    True. I forgot about DirectCompute, actually. For some reason I thought G80 didn't support it, but I think it does.
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I think CUDA provides similar optionality - though some of it might require explicit function calls rather than compiler options.

    Yeah, limited workgroup size, owner-writes only thread local share memory and a single UAV are the main limitations there I believe.

    Jawed
     
  19. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    Can you clarify that a bit? There are dedicated read-only constant and texture caches. Of course the underlying data all resides in global memory but where else could constants get cached except for the constant cache?
     
  20. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Probably the texture cache (hence being the same speed as when I use an image texture). When NVidia compiles the shader, it has no idea how big the constant buffers are going to be, which is what Micah was talking about in the quote above. Thus they may be too big to work with the dedicated constant cache (CUDA reports constant memory as 64k on my comp). With ATI, The R700 ISA document says that it can work with 16 constant buffers that have 4K float4s (i.e. 16x64k). In both cases the addressable space for constants is limited.

    I also forgot to reply to your earlier post:
    Actually the profilers and debuggers did nothing for me. NVidia's OpenCL profiler wouldn't even give me a window before crashing, and their OpenCL compiler was a real pain as it kept crashing for certain situations with the source code (seems really random, like one time I passed a variable to a function with a pointer instead of by value and the crash stopped).

    It was just a wild stab in the dark. I knew something was wrong just by working through a rough estimate of where performance should be. I rewrote the loop with a short and perf went up, then changed it back to int and perf stayed the same, so it was changing from increment to decrement that was the difference. I still don't know why it affected one part of the code but not another similar area.

    Regarding performance, the .69 Gr/s is equivalent to 69,000 ks/s, so OpenCL has some catching up to do for it to reach CUDA performance, but at least it's within a factor of two now. I was really hoping the newer cards would do over 1 Gs/s, hence the choice of units :lol:
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...