chavvdarrr
Veteran
Can someone make new binary?
I will refrain myself to write comments about a compiler that require that kind of "optimization"
...
[/I]The latest version of SmallLuxGPU is available here: http://davibu.interfree.it/opencl/smallluxgpu/smallluxgpu-v1.2beta2.tgz
Why would it be bad for a GPU to access data in the opposite direction? I mean..why would it be worse than it is for a CPU?I'm thinking that the problem is probably not the loop, but the loop's direction. In the original loop, data is accessed in reverse order, which can be pretty bad for a GPU, but CPU can easily tolerate that. Just my two cents.
Because CPU compilers will recognize the loop and optimize it to ++, while GPU ones are not mature enough (NV opencl at least)Why would it be bad for a GPU to access data in the opposite direction? I mean..why would it be worse than it is for a CPU?
Why would it be bad for a GPU to access data in the opposite direction? I mean..why would it be worse than it is for a CPU?
Micah Villmow said:constant address space pointers do not use the constant cache since the size of the pointer is not known at compile time and the hardware constant cache sizes do not conform to the OpenCL spec. We are working on a way to allow programmers to put the data in the constant caches, but currently the data resides in global memory.
Sure. I guess read-only L2s are not used in this (non graphics) context? Fermi should behave better in this case..It probably doesn't matter if the accessed memory is constant (which is cached on GPU), but if accessing the global memory directly, reversed order can be pretty bad because there is no cache.
Final message in this thread:
http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=125954&enterthread=y
I think nAo was objecting to the idea that it would be bad for the hardware, not the compiler. The loop condition is based on a constant, too, so it's 100% coherent. The weird thing is that it doesn't affect the second loop whether it's -- or ++.Because CPU compilers will recognize the loop and optimize it to ++, while GPU ones are not mature enough (NV opencl at least)
That sort of makes sense, but note that the index isn't written to a memory location but rather a local variable.By the way, I checked the code and I found that the major difference between the two for loops is that in the first one (where the change matters) the index variable is written to a memory location outside of the function, while in the second loop (where the change doesn't matter) there is no other use of the index variable. Maybe that's why the compiler doesn't want to optimize the first loop for fear of possible side effects.
That's an interesting point. It's very different from the way CUDA works, that's for sure. It may be that NVidia isn't using constant memory either, because when I put the sphere data in an image object, performance was the same as when using a constant array declared inside geomfunc.h.Final message in this thread:
http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=125954&enterthread=y
I don't think it matters, as the problem is the same unless your hardware decides to eliminate constant registers altogether. The only solution I can think of is to have the compiler create some bytecode that can be dynamically modified to change the type of read.That refers to a r7xx series device. I wonder what is the status on evergreen. After all, it was supposed to have a *full* hw implementation of the ocl 1.0 spec.
I think so, given the speedups I get by using an image object. The data set is 400 bytes, so it will fit in the tiniest of caches.Sure. I guess read-only L2s are not used in this (non graphics) context? Fermi should behave better in this case..
That's an interesting point. It's very different from the way CUDA works, that's for sure. It may be that NVidia isn't using constant memory either, because when I put the sphere data in an image object, performance was the same as when using a constant array declared inside geomfunc.h.
Well. If a program accesses sequental data, I'm fairly sure that incremental access will be faster - DDR/DDR2/3/4/5 all are much faster when bursting lots of data with one address command. And they burst incrementing source address, no?I think nAo was objecting to the idea that it would be bad for the hardware, not the compiler. The loop condition is based on a constant, too, so it's 100% coherent. The weird thing is that it doesn't affect the second loop whether it's -- or ++.
ATI OpenCL doesn't support images currently.Would be nice if someone could check the impact on Cypress of the image approach too.
Sorry, but I can't yet run OpenCL on my HTC Diamond2Okay, but can you at least take 30 seconds to make those two changes and tell me what you get?
__constant items can be accessed globally, but they're read only. I think a lot of caches in GPUs only work with read-only items, so the speedup doesn't necessarily imply that constants aren't using global memory.Doesn't that indicate that it is in fact using constant memory? We know it's not using global memory (any more) so where else could it be stored?
Are you serious? The whole reason that aspect was put in was to take advantage of the filtering hardware on GPUs. That's pretty sad.ATI OpenCL doesn't support images currently.
Yeah, that makes sense. Compliance has something to do with it, too, as I just found that the compiler option "-cl-fast-relaxed-math" adds another 30%.It seems to me the OpenCL memory model is quite loose - much looser than seen in DirectCompute. This might be why NVidia's OpenCL performance is falling short of CUDA's in certain cases - though I also suspect that the comfort blanket of explicit warp-size aligned execution that's missing in OpenCL might be causing problems too.
True. I forgot about DirectCompute, actually. For some reason I thought G80 didn't support it, but I think it does.Also I think DirectCompute is a higher priority for AMD than OpenCL - the benchmarking of games is higher profile than OpenCL noodlers' experiments...
I think CUDA provides similar optionality - though some of it might require explicit function calls rather than compiler options.Yeah, that makes sense. Compliance has something to do with it, too, as I just found that the compiler option "-cl-fast-relaxed-math" adds another 30%.
Yeah, limited workgroup size, owner-writes only thread local share memory and a single UAV are the main limitations there I believe.True. I forgot about DirectCompute, actually. For some reason I thought G80 didn't support it, but I think it does.
__constant items can be accessed globally, but they're read only. I think a lot of caches in GPUs only work with read-only items, so the speedup doesn't necessarily imply that constants aren't using global memory.
Probably the texture cache (hence being the same speed as when I use an image texture). When NVidia compiles the shader, it has no idea how big the constant buffers are going to be, which is what Micah was talking about in the quote above. Thus they may be too big to work with the dedicated constant cache (CUDA reports constant memory as 64k on my comp). With ATI, The R700 ISA document says that it can work with 16 constant buffers that have 4K float4s (i.e. 16x64k). In both cases the addressable space for constants is limited.Can you clarify that a bit? There are dedicated read-only constant and texture caches. Of course the underlying data all resides in global memory but where else could constants get cached except for the constant cache?
Actually the profilers and debuggers did nothing for me. NVidia's OpenCL profiler wouldn't even give me a window before crashing, and their OpenCL compiler was a real pain as it kept crashing for certain situations with the source code (seems really random, like one time I passed a variable to a function with a pointer instead of by value and the crash stopped).Only 30x faster? Lame
Nice job man. Goes to show how important profilers and debuggers are for catching this sort of thing.
GTX 285:
OpenCL Dade: 1,700 ks/s
OpenCL Mint Loop: 10,000 ks/s
OpenCL Mint Constant: 37,000 ks/s
CUDA Mint: 0.69 Gr/s