Follow along with the video below to see how to install our site as a web app on your home screen.
Note: This feature may not be available in some browsers.
The idea is to use the GPGPU only for ray intersections in order to minimize the amount of the brand new code to write and to not loose any of the functionality already available in Luxrender In order to test this idea, I wrote a very simplified path tracer and ported Luxrender's BVH accelerator to OpenCL.
I suggest changing the work group size calculation so that it does not use the "maximum". 64 on ATI would be better than 256. 64 is the minimum size on ATI HD5870 or HD4870. Some ATI GPUs (HD43xx HD45xx HD46xx) will work best with a lower size (32 or 16).
NVidia should be happy with 32 or 64.
Might be an idea to expose the workgroup size as a command line parameter. Or, have the program try a few different values. Or, just hard code it to 64.
The register usage appears to be 49 vec4 registers. This is reasonable on ATI, resulting in 5 hardware threads (wavefronts). On NVidia it is a disaster (the equivalent of 196 registers if fiddling with the occupancy calculator), meaning that only 2 hardware threads (warps) can occupy each multiprocessor. As it happens 64 is better than 32 for the workgroup size in this scenario.
I'm not sure how NVidia handles the situation when 512 work-items are requested but the hardware can only issue 64 - I'm not sure if the hardware is spilling registers in this situation, if it is, then that compounds the disaster. Hopefully making this change will improve things dramatically.
The spec says this is supposed to take account of the resource requirements of the kernel (table 5.14). It seems that ATI is giving a reasonable answer, if maximal-sharing of local memory would be advantageous to the kernel (but pointless in this case, as local memory is not actually being used as far as I can tell). But it seems NVidia's just responding with a nonsense number, ignoring resource consumption. So, both are suffering from immaturity there. Honestly, I'm dubious this'll ever really be of much use.Thanks Jawed, a lot of interesting information. I'm asking directly to OpenCL the suggested workgroup size for my kernel ... I guess the default answer from the driver isn't that good,
I think any positive number less than maximum device size is technically valid, but I dare say common multiples of powers of 2 are safe.I will add a command line option to overwrite the suggested size so we can do some test.
Actually, this number is a total guess - I simply took the ATI allocation and multiplied by 4. It could be substantially wrong, e.g. 100 registers - it's all down to how the compiler treats the lifetime of variables and whether it decides to use static spillage to global memory. I don't know if NVidia's tools can provide a register count for an OpenCL kernel. NVidia's GPUs also have varying capacities of register file, which affects the count of registers per work-item for the different cards - the Occupancy Calculator can help there, you just need to match up the CUDA Compute Capability with the models of cards.I assume 196 is the maximum number of registers used during the execution on NVIDIA. Do you have any suggestion on how to reduce this number ? For instance, if I try reduce life span of local variables, it should reduce this number![]()
Registers are always allocated as vec4 in ATI (128-bit). The compiler will try to pack kernel variables into registers as tightly as possible, but there are plenty of foibles there - e.g. smallptGPU might only need 46 registers with a perfect packing.Are NVIDIA register vec4/float4 as the ATI one ? In this case I could higly reducing the register usage by switching to OpenCL vector types.
Yeah, I noticed the GPU is not being worked very hard as yet - it seems that various host side tasks are the major bottleneck. This performance will also vary dramatically depending on the quality of the motherboard chipset, i.e. PCI Express bandwidth looks like it'll cause quite variable results.P.S. SmallLuxGPU is a quite different beast from SmallptGPU. SmallptGPU is a GPU-only application while SmallLuxGPU is a test on how a large amount of existing code (i.e. Luxrender) could be adapted to take advantage of GPGPU technology.
NVidia tends to advise against vector types (even though Direct3D and OpenGL are the primary APIs), ...
The CUDA architecture is a scalar architecture. Therefore, there is no performance benefit from using vector types and instructions. These should only be used for convenience. It is also in general better to have more work-items than fewer using large vectors.
~16700K samples/s with 64 WG size on my 5870 -- a steady improvement up from 13700K with the v1.4.
Chaps with NV hardware need to report, now.
1/2 of a 295:
RUN_SCENE_CORNELL_32SIZE = 2072.2k
RUN_SCENE_CORNELL_64SIZE = 2898.2K
RUN_SCENE_CORNELL_128SIZE = 2898.2K
RUN_SCENE_SIMPLE_64SIZE = 113,564.9K
Ooh, that's a nice bump on ATI performance. Nice to see a big jump in NVidia performance![]()
NVidia users might want to create a BAT file with other sizes, e.g. 96, 160, 192, 224, 256, 320 and 384 - just to see where the sweetspot is...
Jawed
run_scene_cornell_96size = 2662.0kooh, that's a nice bump on ati performance. Nice to see a big jump in nvidia performance![]()
Nvidia users might want to create a bat file with other sizes, e.g. 96, 160, 192, 224, 256, 320 and 384 - just to see where the sweetspot is...
Jawed