GPU Ray-tracing for OpenCL

I will refrain myself to write comments about a compiler that require that kind of "optimization" ;)

I'm thinking that the problem is probably not the loop, but the loop's direction. In the original loop, data is accessed in reverse order, which can be pretty bad for a GPU, but CPU can easily tolerate that. Just my two cents.
 
I'm thinking that the problem is probably not the loop, but the loop's direction. In the original loop, data is accessed in reverse order, which can be pretty bad for a GPU, but CPU can easily tolerate that. Just my two cents.
Why would it be bad for a GPU to access data in the opposite direction? I mean..why would it be worse than it is for a CPU?
 
Why would it be bad for a GPU to access data in the opposite direction? I mean..why would it be worse than it is for a CPU?
Because CPU compilers will recognize the loop and optimize it to ++, while GPU ones are not mature enough (NV opencl at least)
 
Why would it be bad for a GPU to access data in the opposite direction? I mean..why would it be worse than it is for a CPU?

It probably doesn't matter if the accessed memory is constant (which is cached on GPU), but if accessing the global memory directly, reversed order can be pretty bad because there is no cache.

By the way, I checked the code and I found that the major difference between the two for loops is that in the first one (where the change matters) the index variable is written to a memory location outside of the function, while in the second loop (where the change doesn't matter) there is no other use of the index variable. Maybe that's why the compiler doesn't want to optimize the first loop for fear of possible side effects.
 
No constant buffer usage on ATI

Final message in this thread:

http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=125954&enterthread=y

Micah Villmow said:
constant address space pointers do not use the constant cache since the size of the pointer is not known at compile time and the hardware constant cache sizes do not conform to the OpenCL spec. We are working on a way to allow programmers to put the data in the constant caches, but currently the data resides in global memory.
 
It probably doesn't matter if the accessed memory is constant (which is cached on GPU), but if accessing the global memory directly, reversed order can be pretty bad because there is no cache.
Sure. I guess read-only L2s are not used in this (non graphics) context? Fermi should behave better in this case..
 
Because CPU compilers will recognize the loop and optimize it to ++, while GPU ones are not mature enough (NV opencl at least)
I think nAo was objecting to the idea that it would be bad for the hardware, not the compiler. The loop condition is based on a constant, too, so it's 100% coherent. The weird thing is that it doesn't affect the second loop whether it's -- or ++.

By the way, I checked the code and I found that the major difference between the two for loops is that in the first one (where the change matters) the index variable is written to a memory location outside of the function, while in the second loop (where the change doesn't matter) there is no other use of the index variable. Maybe that's why the compiler doesn't want to optimize the first loop for fear of possible side effects.
That sort of makes sense, but note that the index isn't written to a memory location but rather a local variable.

That's an interesting point. It's very different from the way CUDA works, that's for sure. It may be that NVidia isn't using constant memory either, because when I put the sphere data in an image object, performance was the same as when using a constant array declared inside geomfunc.h.

That refers to a r7xx series device. I wonder what is the status on evergreen. After all, it was supposed to have a *full* hw implementation of the ocl 1.0 spec.
I don't think it matters, as the problem is the same unless your hardware decides to eliminate constant registers altogether. The only solution I can think of is to have the compiler create some bytecode that can be dynamically modified to change the type of read.

Sure. I guess read-only L2s are not used in this (non graphics) context? Fermi should behave better in this case..
I think so, given the speedups I get by using an image object. The data set is 400 bytes, so it will fit in the tiniest of caches.
 
That's an interesting point. It's very different from the way CUDA works, that's for sure. It may be that NVidia isn't using constant memory either, because when I put the sphere data in an image object, performance was the same as when using a constant array declared inside geomfunc.h.

Doesn't that indicate that it is in fact using constant memory? We know it's not using global memory (any more) so where else could it be stored?

Would be nice if someone could check the impact on Cypress of the image approach too.
 
I think nAo was objecting to the idea that it would be bad for the hardware, not the compiler. The loop condition is based on a constant, too, so it's 100% coherent. The weird thing is that it doesn't affect the second loop whether it's -- or ++.
Well. If a program accesses sequental data, I'm fairly sure that incremental access will be faster - DDR/DDR2/3/4/5 all are much faster when bursting lots of data with one address command. And they burst incrementing source address, no?
Maybe 10y ago I made some tests with Watcom and multiplying matrices incrementally was way faster than decrementing data. Using caches may level off the difference, also sophisticated prediction hardware in current CPUs
 
Would be nice if someone could check the impact on Cypress of the image approach too.
ATI OpenCL doesn't support images currently.

Examination of the PTX (I presume that's possible) might confirm whether the specification of constant is using a constant or a texture buffer.

Jawed
 
Doesn't that indicate that it is in fact using constant memory? We know it's not using global memory (any more) so where else could it be stored?
__constant items can be accessed globally, but they're read only. I think a lot of caches in GPUs only work with read-only items, so the speedup doesn't necessarily imply that constants aren't using global memory.

ATI OpenCL doesn't support images currently.
Are you serious? The whole reason that aspect was put in was to take advantage of the filtering hardware on GPUs. That's pretty sad.

I can see why they don't want to make OpenCL available in Catalyst yet.
 
It seems to me the OpenCL memory model is quite loose - much looser than seen in DirectCompute. This might be why NVidia's OpenCL performance is falling short of CUDA's in certain cases - though I also suspect that the comfort blanket of explicit warp-size aligned execution that's missing in OpenCL might be causing problems too.

Also I think DirectCompute is a higher priority for AMD than OpenCL - the benchmarking of games is higher profile than OpenCL noodlers' experiments...

Jawed
 
It seems to me the OpenCL memory model is quite loose - much looser than seen in DirectCompute. This might be why NVidia's OpenCL performance is falling short of CUDA's in certain cases - though I also suspect that the comfort blanket of explicit warp-size aligned execution that's missing in OpenCL might be causing problems too.
Yeah, that makes sense. Compliance has something to do with it, too, as I just found that the compiler option "-cl-fast-relaxed-math" adds another 30%.

Also I think DirectCompute is a higher priority for AMD than OpenCL - the benchmarking of games is higher profile than OpenCL noodlers' experiments...
True. I forgot about DirectCompute, actually. For some reason I thought G80 didn't support it, but I think it does.
 
Yeah, that makes sense. Compliance has something to do with it, too, as I just found that the compiler option "-cl-fast-relaxed-math" adds another 30%.
I think CUDA provides similar optionality - though some of it might require explicit function calls rather than compiler options.

True. I forgot about DirectCompute, actually. For some reason I thought G80 didn't support it, but I think it does.
Yeah, limited workgroup size, owner-writes only thread local share memory and a single UAV are the main limitations there I believe.

Jawed
 
__constant items can be accessed globally, but they're read only. I think a lot of caches in GPUs only work with read-only items, so the speedup doesn't necessarily imply that constants aren't using global memory.

Can you clarify that a bit? There are dedicated read-only constant and texture caches. Of course the underlying data all resides in global memory but where else could constants get cached except for the constant cache?
 
Can you clarify that a bit? There are dedicated read-only constant and texture caches. Of course the underlying data all resides in global memory but where else could constants get cached except for the constant cache?
Probably the texture cache (hence being the same speed as when I use an image texture). When NVidia compiles the shader, it has no idea how big the constant buffers are going to be, which is what Micah was talking about in the quote above. Thus they may be too big to work with the dedicated constant cache (CUDA reports constant memory as 64k on my comp). With ATI, The R700 ISA document says that it can work with 16 constant buffers that have 4K float4s (i.e. 16x64k). In both cases the addressable space for constants is limited.

I also forgot to reply to your earlier post:
Only 30x faster? Lame :p

Nice job man. Goes to show how important profilers and debuggers are for catching this sort of thing.

GTX 285:

OpenCL Dade: 1,700 ks/s
OpenCL Mint Loop: 10,000 ks/s
OpenCL Mint Constant: 37,000 ks/s
CUDA Mint: 0.69 Gr/s
Actually the profilers and debuggers did nothing for me. NVidia's OpenCL profiler wouldn't even give me a window before crashing, and their OpenCL compiler was a real pain as it kept crashing for certain situations with the source code (seems really random, like one time I passed a variable to a function with a pointer instead of by value and the crash stopped).

It was just a wild stab in the dark. I knew something was wrong just by working through a rough estimate of where performance should be. I rewrote the loop with a short and perf went up, then changed it back to int and perf stayed the same, so it was changing from increment to decrement that was the difference. I still don't know why it affected one part of the code but not another similar area.

Regarding performance, the .69 Gr/s is equivalent to 69,000 ks/s, so OpenCL has some catching up to do for it to reach CUDA performance, but at least it's within a factor of two now. I was really hoping the newer cards would do over 1 Gs/s, hence the choice of units :LOL:
 
Back
Top