If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 |
|
Senior Member
|
Grab it here!
The result is in Ksamples/sec. Here's what is capable my precious one: ![]() Code:
OpenCL Platform 0: Advanced Micro Devices, Inc. OpenCL Device 0: Type = TYPE_GPU OpenCL Device 0: Name = Cypress OpenCL Device 0: Compute units = 20 OpenCL Device 0: Max. work group size = 256 Reading file 'rendering_kernel.cl' (size 2997 bytes) OpenCL Device 0: kernel work group size = 256 Q9450 @ 3608MHz, 0% load! (NV users may get noticeable CPU load for some reason)
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#2 |
|
Member
Join Date: Jul 2004
Posts: 110
|
Very nice! With my 8800 GT, I'm getting 430k samples/second
Code:
OpenCL Platform 0: NVIDIA Corporation OpenCL Device 0: Type = TYPE_GPU OpenCL Device 0: Name = GeForce 8800 GT OpenCL Device 0: Compute units = 14 OpenCL Device 0: Max. work group size = 512 Reading file 'rendering_kernel.cl' (size 2997 bytes) OpenCL Device 0: kernel work group size = 192
__________________
None ... really, none :) |
|
|
|
|
|
#3 |
|
Meh
Join Date: Mar 2004
Location: New York
Posts: 9,809
|
~1250 samples/sec on a GTX 285. Maxes out one CPU core. David posts here, maybe he can give us some insight into his approach.
__________________
What the deuce!? |
|
|
|
|
|
#4 |
|
Junior Member
Join Date: Jul 2008
Posts: 36
|
I get 6560K samples/sec on 5770 at default clocks.
|
|
|
|
|
|
#5 |
|
Member
Join Date: Sep 2008
Location: india
Posts: 121
|
4500 K samples/sec on a 4850, slows down the GUI to a crawl
|
|
|
|
|
|
#6 |
|
Darlek ******
Join Date: Jun 2004
Posts: 9,663
|
OpenCL Device 0: Compute units = 14
where does that number come from
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™ |
|
|
|
|
|
#7 | |
|
Senior Member
Join Date: Oct 2006
Location: Germany
Posts: 1,003
|
Quote:
8800GT has 14 8-way-SIMDs (2 per cluster).
__________________
Hail Brothers and Sisters! Coranon Silaria, Ozoo Mahoke Eta Kooram Nah Smech! Find Chuck Norris. |
|
|
|
|
|
|
#8 |
|
Member
Join Date: Jun 2008
Location: Torquay, UK
Posts: 944
|
Nice!
![]() There were some spikes up to 14480K samples/s every 4-6 seconds, so not too bad I think! Win7 x64 and I think still SDK Beta on this one. |
|
|
|
|
|
#9 |
|
Darlek ******
Join Date: Jun 2004
Posts: 9,663
|
Ahh so the 8800gt has 112 shaders so it takes 14 of them to make a compute unit, is that right?
and the hd 5870 has 1600 shaders, but it takes 80 of them to make a compute unit ?
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™ |
|
|
|
|
|
#10 |
|
Senior Member
|
More like 16, for a fair comparison, putting aside the VLIW nature of the arch.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#11 |
|
Darlek ******
Join Date: Jun 2004
Posts: 9,663
|
but then if only too 16 to make a compute unit
why does the app only report 20 compute units and not 100 1600 shaders = 20 units = 80 shaders per unit yes ?
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™ |
|
|
|
|
|
#12 |
|
Senior Member
|
It's an ALU, not "shader", stream processor or any other marketing label with narrow meaning.
80 ALUs per multiprocessor or 16-wide SIMD unit -- pick one.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#13 | |
|
Senior Member
Join Date: Oct 2006
Location: Germany
Posts: 1,003
|
Quote:
My HD4850 has 10 Compute Units (10 16-way-SIMDs). The HD5870 has 20 Compute Units.
__________________
Hail Brothers and Sisters! Coranon Silaria, Ozoo Mahoke Eta Kooram Nah Smech! Find Chuck Norris. |
|
|
|
|
|
|
#14 |
|
Member
Join Date: Feb 2002
Posts: 632
|
I'm getting a glitchy ~600 samples/sec with a GTX 280, latest drivers etc... Hmmm...
|
|
|
|
|
|
#15 |
|
Merrily dodgy
Join Date: Aug 2003
Location: The colonies
Posts: 1,403
|
![]() It seems to max out one CPU core
__________________
"A man generally has two reasons for doing a thing. One that sounds good, and a real one." - J.P. Morgan |
|
|
|
|
|
#16 |
|
Member
Join Date: Dec 2009
Posts: 172
|
Hi, I'm David, the author of SmallptGPU. I think I can clarify few points:
- About the poor performances on Nvidia, I have developed SmallptGPU on an ATI HD4870. Both ATI and NVIDIA OpenCL drivers are in a early stage of the development and have their fair amount of problems/bugs/etc. I have avoided problematic paths on ATI because is my card while I have never tested SmallptGPU on NVIDIA. I assume I'm doing something that the NVIDIA OpenCL driver doesn't like at all. The high CPU usage is a good hint of this problem. - The sources are available on the web site, so if anyone has a fix for NVIDA cards, I will be happy to apply it. - SmallptGPU uses the first GPU device available (there is a command line option to run on CPU device). About all the load should be on GPU, CPU is nearly unused. It is not able to use multiple devices at the same time so any SLI/CrossFire configuration will be used only at 50% of its capabilities. - 5870 is horrible fast ... I' trying to not buy one |
|
|
|
|
|
#17 |
|
Senior Member
|
Can you describe the command line switch syntax?
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#18 |
|
Regular
|
You have to provide a complete argument list for the command line, e.g.:
smallptgpu 0 1 rendering_kernel.cl 640 480 scenes\cornell_large.scn You can create new arrangements of spheres: Code:
camera 20 100 300 0 25 0 size 7 sphere 1000 0 -1000 0 0 0 0 0.75 0.75 0.75 0 sphere 10 35 15 0 0 0 0 0.9 0 0 2 sphere 15 -35 20 0 0 0 0 0 0.9 0 2 sphere 20 0 25 -35 0 0 0 0 0 0.9 2 sphere 4 35 15 0 15 15 15 0 0 0 0 sphere 8 -35 20 0 15 15 15 0 0 0 0 sphere 8 0 25 -35 100 100 100 0 0 0 0 Jawed
__________________
Can it play WoW? Last edited by Jawed; 02-Jan-2010 at 14:22. Reason: kernel name was wrong - though I don't think it makes a difference + dimensions didn't match defaults |
|
|
|
|
|
#19 |
|
Regular
|
I suggest changing the work group size calculation so that it does not use the "maximum". 64 on ATI would be better than 256. 64 is the minimum size on ATI HD5870 or HD4870. Some ATI GPUs (HD43xx HD45xx HD46xx) will work best with a lower size (32 or 16).
NVidia should be happy with 32 or 64. Might be an idea to expose the workgroup size as a command line parameter. Or, have the program try a few different values. Or, just hard code it to 64. The register usage appears to be 49 vec4 registers. This is reasonable on ATI, resulting in 5 hardware threads (wavefronts). On NVidia it is a disaster (the equivalent of 196 registers if fiddling with the occupancy calculator), meaning that only 2 hardware threads (warps) can occupy each multiprocessor. As it happens 64 is better than 32 for the workgroup size in this scenario. I'm not sure how NVidia handles the situation when 512 work-items are requested but the hardware can only issue 64 - I'm not sure if the hardware is spilling registers in this situation, if it is, then that compounds the disaster. Hopefully making this change will improve things dramatically. Jawed
__________________
Can it play WoW? |
|
|
|
|
|
#20 |
|
Senior Member
|
Well, let's hope Mr. David takes a note on this.
But there's still the weird issue with the CPU being under load on NV hardware.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#21 |
|
Junior Member
Join Date: Jan 2010
Posts: 64
|
|
|
|
|
|
|
#22 | |
|
Merrily dodgy
Join Date: Aug 2003
Location: The colonies
Posts: 1,403
|
Quote:
It doesn't look like the app (or rather, Nvidia's OpenCL implementation) actually uses 4 CPU cores, rather that it doesn't have its affinity tied and gets bounced around to different cores. It still only amounts to 100% use of one core, so it might be some single threaded operation which may also be holding the GPU performance back. Wouldn't surprise me if the GPU score increases if you overclock the CPU here.
__________________
"A man generally has two reasons for doing a thing. One that sounds good, and a real one." - J.P. Morgan |
|
|
|
|
|
|
#23 | |
|
Senior Member
|
This one is fresh: OCL accelerated LuxRenderer
Quote:
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
|
#24 |
|
Junior Member
Join Date: Jan 2010
Posts: 64
|
Considering this is my flat line shot, with nothing running...
![]() And this is with the app running: ![]() That all 4 cores are in use... I agree affinity could be way better. The thing is, I don't think we are supposed to be using near that much ideally. Especially considering Nvidia doesn't support OpenCL on the CPU, that would have to be 100% overhead. |
|
|
|
|
|
#25 | |
|
Junior Member
Join Date: Jan 2010
Posts: 64
|
Quote:
Very light GPU use, and heavy CPU use...
|
|
|
|
|
![]() |
| Tags |
| opencl, ray-tracing |
| Thread Tools | |
| Display Modes | |
|
|