Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 27-Dec-2009, 19:25   #1
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,832
Send a message via Skype™ to fellix
Default GPU Ray-tracing for OpenCL

Grab it here!

The result is in Ksamples/sec.
Here's what is capable my precious one:


Code:
OpenCL Platform 0: Advanced Micro Devices, Inc.
OpenCL Device 0: Type = TYPE_GPU
OpenCL Device 0: Name = Cypress
OpenCL Device 0: Compute units = 20
OpenCL Device 0: Max. work group size = 256
Reading file 'rendering_kernel.cl' (size 2997 bytes)
OpenCL Device 0: kernel work group size = 256
Radeon HD 5870 @ 900/5000MHz
Q9450 @ 3608MHz, 0% load! (NV users may get noticeable CPU load for some reason)
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 27-Dec-2009, 22:20   #2
Anteru
Member
 
Join Date: Jul 2004
Posts: 110
Default

Very nice! With my 8800 GT, I'm getting 430k samples/second Moreover, the driver locks up after 50 passes (roughly). This is under Vista x64.

Code:
OpenCL Platform 0: NVIDIA Corporation
OpenCL Device 0: Type = TYPE_GPU
OpenCL Device 0: Name = GeForce 8800 GT
OpenCL Device 0: Compute units = 14
OpenCL Device 0: Max. work group size = 512
Reading file 'rendering_kernel.cl' (size 2997 bytes)
OpenCL Device 0: kernel work group size = 192
__________________
None ... really, none :)
Anteru is offline   Reply With Quote
Old 28-Dec-2009, 01:12   #3
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,809
Default

~1250 samples/sec on a GTX 285. Maxes out one CPU core. David posts here, maybe he can give us some insight into his approach.
__________________
What the deuce!?
trinibwoy is offline   Reply With Quote
Old 28-Dec-2009, 04:06   #4
Forrest
Junior Member
 
Join Date: Jul 2008
Posts: 36
Default

I get 6560K samples/sec on 5770 at default clocks.
Forrest is offline   Reply With Quote
Old 28-Dec-2009, 05:19   #5
gamervivek
Member
 
Join Date: Sep 2008
Location: india
Posts: 121
Default

4500 K samples/sec on a 4850, slows down the GUI to a crawl
gamervivek is offline   Reply With Quote
Old 28-Dec-2009, 12:49   #6
Davros
Darlek ******
 
Join Date: Jun 2004
Posts: 9,663
Default

OpenCL Device 0: Compute units = 14

where does that number come from
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™
Davros is offline   Reply With Quote
Old 28-Dec-2009, 13:12   #7
Arnold Beckenbauer
Senior Member
 
Join Date: Oct 2006
Location: Germany
Posts: 1,003
Default

Quote:
Originally Posted by Davros View Post
OpenCL Device 0: Compute units = 14

where does that number come from
2*7 = 14.
8800GT has 14 8-way-SIMDs (2 per cluster).
__________________
Hail Brothers and Sisters! Coranon Silaria, Ozoo Mahoke
Eta Kooram Nah Smech!

Find Chuck Norris.
Arnold Beckenbauer is offline   Reply With Quote
Old 29-Dec-2009, 16:50   #8
Lightman
Member
 
Join Date: Jun 2008
Location: Torquay, UK
Posts: 944
Default

Nice!



There were some spikes up to 14480K samples/s every 4-6 seconds, so not too bad I think!

Win7 x64 and I think still SDK Beta on this one.
Lightman is online now   Reply With Quote
Old 29-Dec-2009, 20:32   #9
Davros
Darlek ******
 
Join Date: Jun 2004
Posts: 9,663
Default

Quote:
Originally Posted by Arnold Beckenbauer View Post
2*7 = 14.
8800GT has 14 8-way-SIMDs (2 per cluster).
Ahh so the 8800gt has 112 shaders so it takes 14 of them to make a compute unit, is that right?
and the hd 5870 has 1600 shaders, but it takes 80 of them to make a compute unit ?
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™
Davros is offline   Reply With Quote
Old 29-Dec-2009, 20:50   #10
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,832
Send a message via Skype™ to fellix
Default

Quote:
Originally Posted by Davros View Post
and the hd 5870 has 1600 shaders, but it takes 80 of them to make a compute unit ?
More like 16, for a fair comparison, putting aside the VLIW nature of the arch.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 30-Dec-2009, 00:16   #11
Davros
Darlek ******
 
Join Date: Jun 2004
Posts: 9,663
Default

but then if only too 16 to make a compute unit
why does the app only report 20 compute units and not 100
1600 shaders = 20 units = 80 shaders per unit yes ?
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™
Davros is offline   Reply With Quote
Old 30-Dec-2009, 08:51   #12
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,832
Send a message via Skype™ to fellix
Default

It's an ALU, not "shader", stream processor or any other marketing label with narrow meaning.

80 ALUs per multiprocessor or 16-wide SIMD unit -- pick one.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 30-Dec-2009, 11:37   #13
Arnold Beckenbauer
Senior Member
 
Join Date: Oct 2006
Location: Germany
Posts: 1,003
Default

Quote:
Originally Posted by Davros View Post
Ahh so the 8800gt has 112 shaders so it takes 14 of them to make a compute unit, is that right?
and the hd 5870 has 1600 shaders, but it takes 80 of them to make a compute unit ?
8800GT has (8*2)*7= 112 SPs.

My HD4850 has 10 Compute Units (10 16-way-SIMDs). The HD5870 has 20 Compute Units.
__________________
Hail Brothers and Sisters! Coranon Silaria, Ozoo Mahoke
Eta Kooram Nah Smech!

Find Chuck Norris.
Arnold Beckenbauer is offline   Reply With Quote
Old 01-Jan-2010, 08:07   #14
stevem
Member
 
Join Date: Feb 2002
Posts: 632
Default

I'm getting a glitchy ~600 samples/sec with a GTX 280, latest drivers etc... Hmmm...
stevem is offline   Reply With Quote
Old 01-Jan-2010, 16:09   #15
Florin
Merrily dodgy
 
Join Date: Aug 2003
Location: The colonies
Posts: 1,403
Default



It seems to max out one CPU core
__________________
"A man generally has two reasons for doing a thing. One that sounds good, and a real one." - J.P. Morgan
Florin is online now   Reply With Quote
Old 02-Jan-2010, 00:48   #16
Dade
Member
 
Join Date: Dec 2009
Posts: 172
Default

Hi, I'm David, the author of SmallptGPU. I think I can clarify few points:

- About the poor performances on Nvidia, I have developed SmallptGPU on an ATI HD4870. Both ATI and NVIDIA OpenCL drivers are in a early stage of the development and have their fair amount of problems/bugs/etc. I have avoided problematic paths on ATI because is my card while I have never tested SmallptGPU on NVIDIA. I assume I'm doing something that the NVIDIA OpenCL driver doesn't like at all. The high CPU usage is a good hint of this problem.

- The sources are available on the web site, so if anyone has a fix for NVIDA cards, I will be happy to apply it.

- SmallptGPU uses the first GPU device available (there is a command line option to run on CPU device). About all the load should be on GPU, CPU is nearly unused. It is not able to use multiple devices at the same time so any SLI/CrossFire configuration will be used only at 50% of its capabilities.

- 5870 is horrible fast ... I' trying to not buy one
Dade is offline   Reply With Quote
Old 02-Jan-2010, 08:43   #17
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,832
Send a message via Skype™ to fellix
Default

Can you describe the command line switch syntax?
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 02-Jan-2010, 10:25   #18
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,867
Send a message via Skype™ to Jawed
Default

You have to provide a complete argument list for the command line, e.g.:

smallptgpu 0 1 rendering_kernel.cl 640 480 scenes\cornell_large.scn

You can create new arrangements of spheres:

Code:
camera 20 100 300 0 25 0
size 7
sphere 1000 0 -1000 0 0 0 0 0.75 0.75 0.75 0
sphere 10 35 15 0 0 0 0 0.9 0 0 2
sphere 15 -35 20 0 0 0 0 0 0.9 0 2
sphere 20 0 25 -35 0 0 0 0 0 0.9 2
sphere 4 35 15 0 15 15 15 0 0 0 0
sphere 8 -35 20 0 15 15 15 0 0 0 0
sphere 8 0 25 -35 100 100 100 0 0 0 0
e.g. saved as file caustic7.scn. That has 3 light sources (the final 3 spheres), one inside each "caustic" sphere. The one inside the blue sphere is super-bright.

Jawed
__________________
Can it play WoW?

Last edited by Jawed; 02-Jan-2010 at 14:22. Reason: kernel name was wrong - though I don't think it makes a difference + dimensions didn't match defaults
Jawed is offline   Reply With Quote
Old 02-Jan-2010, 14:47   #19
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,867
Send a message via Skype™ to Jawed
Default

I suggest changing the work group size calculation so that it does not use the "maximum". 64 on ATI would be better than 256. 64 is the minimum size on ATI HD5870 or HD4870. Some ATI GPUs (HD43xx HD45xx HD46xx) will work best with a lower size (32 or 16).

NVidia should be happy with 32 or 64.

Might be an idea to expose the workgroup size as a command line parameter. Or, have the program try a few different values. Or, just hard code it to 64.

The register usage appears to be 49 vec4 registers. This is reasonable on ATI, resulting in 5 hardware threads (wavefronts). On NVidia it is a disaster (the equivalent of 196 registers if fiddling with the occupancy calculator), meaning that only 2 hardware threads (warps) can occupy each multiprocessor. As it happens 64 is better than 32 for the workgroup size in this scenario.

I'm not sure how NVidia handles the situation when 512 work-items are requested but the hardware can only issue 64 - I'm not sure if the hardware is spilling registers in this situation, if it is, then that compounds the disaster. Hopefully making this change will improve things dramatically.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 02-Jan-2010, 15:45   #20
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,832
Send a message via Skype™ to fellix
Default

Well, let's hope Mr. David takes a note on this.

But there's still the weird issue with the CPU being under load on NV hardware.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 02-Jan-2010, 15:59   #21
Talonman
Junior Member
 
Join Date: Jan 2010
Posts: 64
Default

Indeed...

http://www.evga.com/FORUMS/tm.aspx?m=91863

Just joined!
Talonman is offline   Reply With Quote
Old 02-Jan-2010, 16:49   #22
Florin
Merrily dodgy
 
Join Date: Aug 2003
Location: The colonies
Posts: 1,403
Default

Quote:
Originally Posted by Talonman View Post
Hi,
It doesn't look like the app (or rather, Nvidia's OpenCL implementation) actually uses 4 CPU cores, rather that it doesn't have its affinity tied and gets bounced around to different cores. It still only amounts to 100% use of one core, so it might be some single threaded operation which may also be holding the GPU performance back. Wouldn't surprise me if the GPU score increases if you overclock the CPU here.
__________________
"A man generally has two reasons for doing a thing. One that sounds good, and a real one." - J.P. Morgan
Florin is online now   Reply With Quote
Old 02-Jan-2010, 17:18   #23
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,832
Send a message via Skype™ to fellix
Default

This one is fresh: OCL accelerated LuxRenderer

Quote:
The idea is to use the GPGPU only for ray intersections in order to minimize the amount of the brand new code to write and to not loose any of the functionality already available in Luxrender In order to test this idea, I wrote a very simplified path tracer and ported Luxrender's BVH accelerator to OpenCL.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 02-Jan-2010, 17:28   #24
Talonman
Junior Member
 
Join Date: Jan 2010
Posts: 64
Default

Considering this is my flat line shot, with nothing running...



And this is with the app running:


That all 4 cores are in use...

I agree affinity could be way better.
The thing is, I don't think we are supposed to be using near that much ideally.
Especially considering Nvidia doesn't support OpenCL on the CPU, that would have to be 100% overhead.
Talonman is offline   Reply With Quote
Old 02-Jan-2010, 17:46   #25
Talonman
Junior Member
 
Join Date: Jan 2010
Posts: 64
Default

Quote:
Originally Posted by fellix View Post
This one is fresh: OCL accelerated LuxRenderer
Thanks!

Very light GPU use, and heavy CPU use...
Talonman is offline   Reply With Quote

Reply

Tags
opencl, ray-tracing

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 11:53.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.