GPU Ray-tracing for OpenCL

Outstanding job again Dave!!

By manually adjusting the workload, I now can get 5173.9K Samples/sec. (A new record on my system.)
First 1/2 of my 295 - 96%
Second 1/2 of my 295 - 97%
280 Dedicated PhysX - 96%
Q6600 - 30% (Exactly the same as the first 2.0Alpha, without the manual GPU workload distribution feature.)

Thanks Talonman, good to see it works fine ;)

I have received the parts for my new PC yesterday (wow, the 5870 is really HUGE :oops:). Once I have installed everything (i.e. 2x OS, all the tools, etc.) I'm thinking to buy a cheap NVIDIA card for my old PC. This should be finally allow me to squeeze some better performance out of NVIDIA hardware.

I was thinking to buy a 250GTS for my old PC as cheap test platform. Any better idea ? I don't know the line of NVIDIA cards very well.
 
For low cost, I would go for that, or a GTX 260 Core 216.
http://hothardware.com/Articles/NVIDIA-GeForce-GTS-250-Mainstream-GPU/?page=5

For a while, I was trying to figure out how I was getting 114,000K Samples/sec running the Simple Scene, when only running on 1 GPU...
But then I remembered the res was smaller then, and the light came on.
highc.jpg
 
Nice work on the manual tuning!

I was thinking to buy a 250GTS for my old PC as cheap test platform. Any better idea ? I don't know the line of NVIDIA cards very well.
8800GT, 9800GT, 9800GTX, GTS250 - pick the cheapest you can find, they're basically slightly different speed grades of the "same chip" (two different revisions of the same chip).

http://en.wikipedia.org/wiki/GeForce_8_Series
http://en.wikipedia.org/wiki/GeForce_9_Series
http://en.wikipedia.org/wiki/GeForce_200_Series

Those pages seem accurate on first glance.

GT240 is also an option, though slower.

Technically GT220 or 9600GT would work, too.

GT220/GT240 should have the better memory controllers of compute 1.2 devices (as opposed to compute 1.1 for the other cards). This behaviour might make a better match for the way GTX285 cards work and is slightly closer to the way the near-future generation of cards work. Though that's probably over-complicating things :???:

Jawed
 
After this day: http://www.evga.com/FORUMS/tm.aspx?m=117623

You know the prices will plunge, but if you need a GPU now, a mans gotta do, what a mans gotta do. :)

Have fun with that 5870 too. She is a runner.

kind of odd, that EVGA link points to Fudzilla parroting the suggestion that a dual fermi product will launch 1-2 months after the GTX285/260 replacements but the article right after talks about fermi being excessively hot. The idea of not one but two very hot gpus on a single pcb sounds like a cooling nightmare. Are they going to come with H20/Phase or maybe TEC/Peltier ? lol Rumblings are already saying very limited (5970) March launch and in volume numbers wouldn't show up until May.
 
Getting the heat from the die into the air is not the problem, heatpipes are ridiculously efficient and even with only a dual slot solution there is enough surface area available for fins. Getting the hot air out of the case is the problem (because most people's cases suck, and the room on the backplate to exhaust air is limited ... also if it has to go out back the the air makes inefficient use of the fins).
 
I imagine 3 Billion transistors per chip may throw some heat... ;)

If I go for the Dual Fermi, I will opt to water cool it.
 
If it produces that much heat with that many transistors, I'm shuddering to think what actual power consumption numbers will be for it.

Regards,
SB
 
Getting the heat from the die into the air is not the problem, heatpipes are ridiculously efficient and even with only a dual slot solution there is enough surface area available for fins. Getting the hot air out of the case is the problem (because most people's cases suck, and the room on the backplate to exhaust air is limited ... also if it has to go out back the the air makes inefficient use of the fins).

Hmm, makes one wonder about the (in)validity of supposed "certified" cases for Fermi based SLI and if any such X2 products would require the use of any such case.
 
That thread misinterprets how the ICD model works. The whole point is to avoid having to compile against one vendor's implementation or another. The actual part that matters, device binary generation, is done at runtime. Nvidia's current ICD is a little out of date which is why the AMD and Nvidia ICDs don't play nice together, but a single code compiled on one should run on another. All of the API calls are standard.
 
To my understanding, there shouldn't be a difference (at least on Windows) between using AMD or NVIDIA's SDK. Basically the cl.h is almost the same (they are different in only one line, which is a comment). The opencl.lib files are different, but they both linked to opencl.dll, with completely the same functions and calling conventions. So there shouldn't be any benefit from recompiling with different SDK.
 
Still would be a fun test, just to see...

Would there be much work on Daves end?

I don't want to ask him for a major re-write.
 
Dave doesn't need to do anything, anyone with the SDK and card should be able to do this. Obviously, knowing one's way around compiler tools helps...

Jawed
 
Still would be a fun test, just to see...

Would there be much work on Daves end?

I don't want to ask him for a major re-write.
Not getting any change in performance compiling ver1.6 with the nvidia 3.0 beta sdk. Now trying the -cl-mad-enable flag I did get a boost from 262K to 289K samples/sec on a 8800m gts still had the same cpu load of 50%.:???:
 
Dave, I was doing more testing running the latest version, after getting all 3 GPU's to 33% utilization...


Size 8 -> 2721k
Size 16 -> 2967k
Size 32 -> 3054k
Size 64 -> 4915k
Size 96 -> 3602k
Size 128 -> 4451k
Size 160 -> 4915k
Size 192 -> 5041k
Size 224 -> 3978k
Size 256 -> 4321k
Size 320 -> 4802k
Size 384 -> 5173k

Size 448 -> Would not run
Size 512 -> Would not run
Size 576 -> Would not run

smallptGPU -> 5173k

I noticed when I run the standard smallptGPU file, it says the 'Suggested work group size: 384'.

That is also the work group size, that I get my best performance on...

I tried to increase the work group size further, but the program would not run.

Is it a known fact that Nvidia can't allocate a larger work group size than 384? Just wondering... :)
 
Last edited by a moderator:
1 small typo...

Lightman ----------- 5870 ------ Sample/sec -- 19,088.2K v1.6 (GPU=1056, M=1247)

Lightman is running a 5870, not a 5850. ;)

Oh, thanks, fixed.

Size 8 -> 2721k
Size 16 -> 2967k
Size 32 -> 3054k
Size 64 -> 4915k
Size 96 -> 3602k
Size 128 -> 4451k
Size 160 -> 4915k
Size 192 -> 5041k
Size 224 -> 3978k
Size 256 -> 4321k
Size 320 -> 4802k
Size 384 -> 5173k

Did you noticed the recurring pattern ? The best performances are for any multiple of 64. I think 64 is the maximum number of threads that can run on one of the NVIDIA SIMT processor. Jawed can probably answer to this question.

Keep in mind that not always "larger is better", the optimal workgroup size influenced by a lot of factors: hardware, size of the kernel, register usage, etc. At the moment choosing the best size looks a bit like black magic. The best practice is probably to do some field test and look for the best size.

P.S. thanks for the NVIDIA 240 hint, it looks like a very cheap and good candidate for a testing platform. My main concern about the 250 was how old the architecture was.
 
Thanks Dave... (Yes, there is a pattern) ;)

I also asked in the Nvidia OpenCL Developers area these 2 questions:

Question one: Is it a known fact that Nvidia can't allocate a larger work group size than 384? Just wondering...
If so, what is the limiting factor? GPU memory?

Answer posted bt avidday:
"Workgroup size (the equivalent of blocksize in CUDA) is limited by the resources the OpenCL code uses. It will be different for every piece of code. The basic mutliprocessor unit in NVIDIA GPUs has limits on Workgroup size (512 is the current limited per workgroup, 768 or 1024 total per MP depending on hardware version), registers (128 per thread and 8192 or 16384 total per MP), and shared memory (16kb per MP). How much of each of those things the kernel uses dictates the maximum workgroup size. The only way to increase it is to make the code use less resources. Sometimes it helps performance, sometimes it doesn't."

Second question: If we could further increase the Work Group size past 384, do you think we might see some additional performance?
"That is totally hypothetical, and depends on the code for the reasons outlined above. It should improve up to a maxima as the workgroup size is increased, and then stay stable or even reduce after that. Whether this code has reached that point is a question I can't answer."
 
Last edited by a moderator:
Back
Top