Thanks Jawed, a lot of interesting information. I'm asking directly to OpenCL the suggested workgroup size for my kernel ... I guess the default answer from the driver isn't that good,
The spec says this is supposed to take account of the resource requirements of the kernel (table 5.14). It seems that ATI is giving a reasonable answer, if maximal-sharing of local memory would be advantageous to the kernel (but pointless in this case, as local memory is not actually being used as far as I can tell). But it seems NVidia's just responding with a nonsense number, ignoring resource consumption. So, both are suffering from immaturity there. Honestly, I'm dubious this'll ever really be of much use.
I will add a command line option to overwrite the suggested size so we can do some test.
I think any positive number less than maximum device size is technically valid, but I dare say common multiples of powers of 2 are safe.
I assume 196 is the maximum number of registers used during the execution on NVIDIA. Do you have any suggestion on how to reduce this number ? For instance, if I try reduce life span of local variables, it should reduce this number
Actually, this number is a total guess - I simply took the ATI allocation and multiplied by 4. It could be substantially wrong, e.g. 100 registers - it's all down to how the compiler treats the lifetime of variables and whether it decides to use static spillage to global memory. I don't know if NVidia's tools can provide a register count for an OpenCL kernel. NVidia's GPUs also have varying capacities of register file, which affects the count of registers per work-item for the different cards - the Occupancy Calculator can help there, you just need to match up the CUDA Compute Capability with the models of cards.
ATI's new profiler for OpenCL, with its ISA listing feature, provides the NUM_GPRs statistic, amongst other things.
Are NVIDIA register vec4/float4 as the ATI one ? In this case I could higly reducing the register usage by switching to OpenCL vector types.
Registers are always allocated as vec4 in ATI (128-bit). The compiler will try to pack kernel variables into registers as tightly as possible, but there are plenty of foibles there - e.g. smallptGPU might only need 46 registers with a perfect packing.
NVidia allocates all registers as scalars (32-bit), which is why I multiplied by 4.
I have to admit I've only noticed, today, that float3 is not welcome in much of OpenCL, e.g. the Geometric Functions in 6.11.5.
NVidia tends to advise against vector types (even though Direct3D and OpenGL are the primary APIs), but I have no practical experience for the situations when there's a real benefit in kernels as complex as used in smallptGPU.
So, I'm unsure if switching to OpenCL's vectors is good. I dare say they'd be my starting point, but I don't have the practical experience and the compilers are immature and the float3 gotcha in OpenCL might make things moot anyway.
I noticed that SmallLuxGPU uses quite a few OpenCL float4s, padded with 0.f. In theory these should be fine as the compiler will optimise .w away in adds/muls and intrinsics such as dot-product should ignore .w. But it's just another part of the learning curve I'm afraid...
P.S. SmallLuxGPU is a quite different beast from SmallptGPU. SmallptGPU is a GPU-only application while SmallLuxGPU is a test on how a large amount of existing code (i.e. Luxrender) could be adapted to take advantage of GPGPU technology.
Yeah, I noticed the GPU is not being worked very hard as yet - it seems that various host side tasks are the major bottleneck. This performance will also vary dramatically depending on the quality of the motherboard chipset, i.e. PCI Express bandwidth looks like it'll cause quite variable results.
Jawed