PDA

View Full Version : GPU Ray-tracing for OpenCL


Pages : 1 [2]

trinibwoy
29-Jan-2010, 02:37
That's an interesting point. It's very different from the way CUDA works, that's for sure. It may be that NVidia isn't using constant memory either, because when I put the sphere data in an image object, performance was the same as when using a constant array declared inside geomfunc.h.

Doesn't that indicate that it is in fact using constant memory? We know it's not using global memory (any more) so where else could it be stored?

Would be nice if someone could check the impact on Cypress of the image approach too.

chavvdarrr
29-Jan-2010, 07:27
I think nAo was objecting to the idea that it would be bad for the hardware, not the compiler. The loop condition is based on a constant, too, so it's 100% coherent. The weird thing is that it doesn't affect the second loop whether it's -- or ++.
Well. If a program accesses sequental data, I'm fairly sure that incremental access will be faster - DDR/DDR2/3/4/5 all are much faster when bursting lots of data with one address command. And they burst incrementing source address, no?
Maybe 10y ago I made some tests with Watcom and multiplying matrices incrementally was way faster than decrementing data. Using caches may level off the difference, also sophisticated prediction hardware in current CPUs

Jawed
29-Jan-2010, 10:22
Would be nice if someone could check the impact on Cypress of the image approach too.
ATI OpenCL doesn't support images currently.

Examination of the PTX (I presume that's possible) might confirm whether the specification of constant is using a constant or a texture buffer.

Jawed

Lightman
29-Jan-2010, 11:02
Okay, but can you at least take 30 seconds to make those two changes and tell me what you get?


Sorry, but I can't yet run OpenCL on my HTC Diamond2 :wink:
I'm coming back home today so I will play with my precious!

Mintmaster
29-Jan-2010, 12:59
Doesn't that indicate that it is in fact using constant memory? We know it's not using global memory (any more) so where else could it be stored?__constant items can be accessed globally, but they're read only. I think a lot of caches in GPUs only work with read-only items, so the speedup doesn't necessarily imply that constants aren't using global memory.

ATI OpenCL doesn't support images currently.Are you serious? The whole reason that aspect was put in was to take advantage of the filtering hardware on GPUs. That's pretty sad.

I can see why they don't want to make OpenCL available in Catalyst yet.

Jawed
29-Jan-2010, 13:11
It seems to me the OpenCL memory model is quite loose - much looser than seen in DirectCompute. This might be why NVidia's OpenCL performance is falling short of CUDA's in certain cases - though I also suspect that the comfort blanket of explicit warp-size aligned execution that's missing in OpenCL might be causing problems too.

Also I think DirectCompute is a higher priority for AMD than OpenCL - the benchmarking of games is higher profile than OpenCL noodlers' experiments...

Jawed

Mintmaster
29-Jan-2010, 15:39
It seems to me the OpenCL memory model is quite loose - much looser than seen in DirectCompute. This might be why NVidia's OpenCL performance is falling short of CUDA's in certain cases - though I also suspect that the comfort blanket of explicit warp-size aligned execution that's missing in OpenCL might be causing problems too.Yeah, that makes sense. Compliance has something to do with it, too, as I just found that the compiler option "-cl-fast-relaxed-math" adds another 30%.

Also I think DirectCompute is a higher priority for AMD than OpenCL - the benchmarking of games is higher profile than OpenCL noodlers' experiments...True. I forgot about DirectCompute, actually. For some reason I thought G80 didn't support it, but I think it does.

Jawed
29-Jan-2010, 16:05
Yeah, that makes sense. Compliance has something to do with it, too, as I just found that the compiler option "-cl-fast-relaxed-math" adds another 30%.
I think CUDA provides similar optionality - though some of it might require explicit function calls rather than compiler options.

True. I forgot about DirectCompute, actually. For some reason I thought G80 didn't support it, but I think it does.
Yeah, limited workgroup size, owner-writes only thread local share memory and a single UAV are the main limitations there I believe.

Jawed

trinibwoy
29-Jan-2010, 16:11
__constant items can be accessed globally, but they're read only. I think a lot of caches in GPUs only work with read-only items, so the speedup doesn't necessarily imply that constants aren't using global memory.

Can you clarify that a bit? There are dedicated read-only constant and texture caches. Of course the underlying data all resides in global memory but where else could constants get cached except for the constant cache?

Mintmaster
29-Jan-2010, 18:12
Can you clarify that a bit? There are dedicated read-only constant and texture caches. Of course the underlying data all resides in global memory but where else could constants get cached except for the constant cache?Probably the texture cache (hence being the same speed as when I use an image texture). When NVidia compiles the shader, it has no idea how big the constant buffers are going to be, which is what Micah was talking about in the quote above. Thus they may be too big to work with the dedicated constant cache (CUDA reports constant memory as 64k on my comp). With ATI, The R700 ISA document says that it can work with 16 constant buffers that have 4K float4s (i.e. 16x64k). In both cases the addressable space for constants is limited.

I also forgot to reply to your earlier post:
Only 30x faster? Lame :p

Nice job man. Goes to show how important profilers and debuggers are for catching this sort of thing.

GTX 285:

OpenCL Dade: 1,700 ks/s
OpenCL Mint Loop: 10,000 ks/s
OpenCL Mint Constant: 37,000 ks/s
CUDA Mint: 0.69 Gr/sActually the profilers and debuggers did nothing for me. NVidia's OpenCL profiler wouldn't even give me a window before crashing, and their OpenCL compiler was a real pain as it kept crashing for certain situations with the source code (seems really random, like one time I passed a variable to a function with a pointer instead of by value and the crash stopped).

It was just a wild stab in the dark. I knew something was wrong just by working through a rough estimate of where performance should be. I rewrote the loop with a short and perf went up, then changed it back to int and perf stayed the same, so it was changing from increment to decrement that was the difference. I still don't know why it affected one part of the code but not another similar area.

Regarding performance, the .69 Gr/s is equivalent to 69,000 ks/s, so OpenCL has some catching up to do for it to reach CUDA performance, but at least it's within a factor of two now. I was really hoping the newer cards would do over 1 Gs/s, hence the choice of units :lol:

Lightman
29-Jan-2010, 19:28
Or a faster OpenCL implementation :wink: (I got 1200 kSamples per second with my CUDA version on a 8800GTS).

So good news: I figured out how to get the OpenCL version to speed up by a lot (30x on my GPU), and the change was so simple compared to the time I spent experimenting. In "geomfunc.h" and "rendering_kernel.cl", replace all occurences of OCL_CONSTANT_BUFFER with __constant (I thought the former was already defined as the latter all this time). I knew constant memory wasn't being used properly. I'm curious if ATI users get the same speedup.

For my 8800GTS, there is also one other thing that has to be changed, or else the speedup is only 2x. On line 82 in "geomfunc.h", replace unsigned int i = sphereCount;
for (; i--;) {
with for (unsigned int i = 0; i < sphereCount; i++) {
It's really strange that this would make such a difference, but it did for me. Even weirder is that doing the same on line 102 had barely any effect, and it's called almost as frequently. Probably a compiler bug. ATI users: do you see a difference?

I'll try to package it all later tonight along with other changes. I wish NVidia's 64-bit SDK didn't make it such a pain to create 32-bit binaries. I might have to uninstall it.

Changing OCL_CONSTANT_BUFFER for _constant has no effect on performance on HD5870 with Cat. 8.12Beta

Replacing the second bit of code gave me minimum improvement from 14834Ks/s to 15000Ks/s.

Tested on SmallPT 2.0 alpha 2 using only GPU accel. on HD5870 stock clocks.

Ethatron
30-Jan-2010, 18:46
Hy, I also played around with it on my 5870.

I was wondering if regular alignment and pre-calculation-rules with CPU would still apply. So I tried aligning the structures, and padding with pre-computed values under the assumption that the cache-banks will be filled by the padded sizes anyway, so we can use it for free:

Vec becomes x, y, z, l (so 16 byte, I didn't convert this to float4 though, still float[4])
Sphere becomes vectors first, then rad, rade2 = rad * rad (so hopefully 64bytes)

I saw no way no use l (vector-length) in an efficient way, so that's just empty.
Interesting is that there was no negative performance impact, which could mean the structures were already fully aligned, or the algorithm is not memory-throughput bound (because I added ~15% padding, hard to say how much because Vec is inside other structures, but's less than Vec's +33%). The Sphere-buffer grew from 33k to 45k (on "complex.scn").
Analysing the situation with Stream Analyzer shows that ALU occupation is 20% (on "complex.scn"), so I couldn't say where it stumbles over it's own feet.

before change / after change
ALU 117619.04 117622.73
Fetch 25327.94 25329.79
Write 6.00 6.00
Wavefront 12288.00 12288.00
ALUBusy 20.28 18.16
ALUFetchRatio 4.64 4.64
ALUPacking 31.61 31.61
FetchUnitBusy 12.86 11.53
FetchUnitStalled 0.02 0.03
WriteUnitStalled 0.00 0.00
ALUStalledByLDS 0.00 0.00
LDSBankConflict 0.00 0.00

Worksize was 256, here is 64:

before change / after change
ALU 117619.04 117622.73
Fetch 25327.94 25329.79
Write 6.00 6.00
Wavefront 12288.00 12288.00
ALUBusy 22.69 22.17
ALUFetchRatio 4.64 4.64
ALUPacking 31.61 31.61
FetchUnitBusy 14.53 14.21
FetchUnitStalled 0.02 0.03
WriteUnitStalled 0.00 0.00
ALUStalledByLDS 0.00 0.00
LDSBankConflict 0.00 0.00

Fetch does less afterwards, even though has more data to fetch ...
ALU has less to do probably because of rade2 = rad * rad only.

When applied to "complex.scn" speed grows from 542k to 552k (2%), the performance-measure is pretty constant because of the numbers of spheres, so this really can attributed to the changes, and not some background-task.
When applied to "cornell.scn" the performance-measure is identicall with the exception that the measure does not fluctuate really wild in the beginning anymore, but is more calm and leads to asymt. aprox. same samples/sec. as without alignment/padding.

Well anyway, I know nothing about GPU-especificalities, and I just had fun triggering the switches in various way, getting into OpenCL finally. :)=

Ciao
Niels

trinibwoy
30-Jan-2010, 20:46
Probably the texture cache (hence being the same speed as when I use an image texture). When NVidia compiles the shader, it has no idea how big the constant buffers are going to be, which is what Micah was talking about in the quote above. Thus they may be too big to work with the dedicated constant cache (CUDA reports constant memory as 64k on my comp). With ATI, The R700 ISA document says that it can work with 16 constant buffers that have 4K float4s (i.e. 16x64k). In both cases the addressable space for constants is limited.

I'll take your word for it. Never heard of non-texture data getting pushed to the texture cache before.

I also forgot to reply to your earlier post:
Actually the profilers and debuggers did nothing for me. NVidia's OpenCL profiler wouldn't even give me a window before crashing, and their OpenCL compiler was a real pain as it kept crashing for certain situations with the source code (seems really random, like one time I passed a variable to a function with a pointer instead of by value and the crash stopped).

Oh wasn't implying that you didn't find it on your own. Just saying that profilers and debuggers (that work properly) are key to finding this sort of unexpected behavior.

Regarding performance, the .69 Gr/s is equivalent to 69,000 ks/s, so OpenCL has some catching up to do for it to reach CUDA performance, but at least it's within a factor of two now. I was really hoping the newer cards would do over 1 Gs/s, hence the choice of units :lol:

Are there 100 samples per ray? Not following the conversion :oops:

Mintmaster
31-Jan-2010, 00:47
Oh wasn't implying that you didn't find it on your own. Just saying that profilers and debuggers (that work properly) are key to finding this sort of unexpected behavior.I'm not trying to toot my own horn or anything. Just saying that the profilers, debuggers, and even the compiler sucked, and I basically had to go with trial and error. AMD seems to be getting some flak for not having OpenCL support in the standard Catalyst, but IMO NVidia's is definately not ready for the public either.

Are there 100 samples per ray? Not following the conversion :oops:Uh oh, math fail :razz:

69,000 kS/sec * 10 r/S = .69 Gr/s

Actually, I made a slight error there because I've been using a depth of 5 and the originial SmallptGPU uses a depth of 6. Thus it should be 12 when comparing apples to apples.

trinibwoy
31-Jan-2010, 01:28
Uh oh, math fail :razz:

69,000 kS/sec * 10 r/S = .69 Gr/s

Actually, I made a slight error there because I've been using a depth of 5 and the originial SmallptGPU uses a depth of 6. Thus it should be 12 when comparing apples to apples.

Lol, yeah I realized but didn't feel like editing because the numbers still wouldn't have made sense without your clarification :)

So it's 10 rays per sample then? Thanks.

Dade
31-Jan-2010, 22:19
The feature to select OpenCL platform and single OpenCL devices was asked some time ago and it is now available in http://davibu.interfree.it/opencl/smallluxgpu/smallluxgpu-v1.2beta3.tgz via configuration file (check the render.cfg file for an example).

Now if someone is brave enough to install NVIDIA/ATI cards and drivers on the same PC :?:

There are also few new features:
- added support for vertex colours interpolation;
- added support for configuration file;
- added support for OpenCL platform and devices selection via configuration file;
- new surface integrator architecture, it is able to generate 2 rays per step.

The new surface integrator architecture decrease the CPU load required to keep the GPU busy and this means more spare CPU cycles to render more samples. It is faster:

http://www.luxrender.net/forum/download/file.php?id=6452&mode=view

Seeing 3.7M samples/secs on scene with a 150k triangles and 4 light sources is quite impressive. I wonder when this GPGPU thing will stop to surprise me.

Lightman
01-Feb-2010, 18:59
Thanks Dade!

Shame I'm away from home for next couple of days :(
I would give new version a spin!

ElMoIsEviL
03-Feb-2010, 01:50
The feature to select OpenCL platform and single OpenCL devices was asked some time ago and it is now available in http://davibu.interfree.it/opencl/smallluxgpu/smallluxgpu-v1.2beta3.tgz via configuration file (check the render.cfg file for an example).

Now if someone is brave enough to install NVIDIA/ATI cards and drivers on the same PC :?:

There are also few new features:
- added support for vertex colours interpolation;
- added support for configuration file;
- added support for OpenCL platform and devices selection via configuration file;
- new surface integrator architecture, it is able to generate 2 rays per step.

The new surface integrator architecture decrease the CPU load required to keep the GPU busy and this means more spare CPU cycles to render more samples. It is faster:

http://www.luxrender.net/forum/download/file.php?id=6452&mode=view

Seeing 3.7M samples/secs on scene with a 150k triangles and 4 light sources is quite impressive. I wonder when this GPGPU thing will stop to surprise me.

I have them both installed on my system. What would you like to know? (provided this works)

PS. I extracted the files yet don't see a render.cfg file in there.

Dade
03-Feb-2010, 10:38
I have them both installed on my system. What would you like to know? (provided this works)

PS. I extracted the files yet don't see a render.cfg file in there.

The render.cfg look like this one:


image.width = 1280
image.height = 720
# Use a value > 0 to enable batch mode
batch.halttime = 0
scene.file = scenes/simple.scn
scene.fieldofview = 60
opencl.latency.mode = 1
opencl.nativethread.count = 3
opencl.cpu.use = 0
opencl.gpu.use = 1
# Select the OpenCL platform to use (0=first platform available, 1=second, etc.)
opencl.platform.index = 0
# The string select the OpenCL devices to use (i.e. first "0" disable the first
# device, second "1" enable the second).
#opencl.devices.select = 10
# Use a value of 0 to enable default value
opencl.gpu.workgroup.size = 64
screen.refresh.interval = 50
path.maxdepth = 6


You can use "opencl.platform.index" to select the platform to use and the string "opencl.devices.select" to enable/disable single devices.

By normally running SmallLuxGPU, you should have 2 platforms listed: ATI and NVIDIA (if the 2 OpenCL drivers really work and can coexist). In that case you can select which platform to use.

pcchen
03-Feb-2010, 11:45
By normally running SmallLuxGPU, you should have 2 platforms listed: ATI and NVIDIA (if the 2 OpenCL drivers really work and can coexist). In that case you can select which platform to use.

By my own tests, the OpenCL ICD in NVIDIA's latest driver (196.21 or 196.34) still can't work with ATI Stream SDK 2.0. Apparently they have different function call for getting platform ID (it's called clIcdGetPlatformIDsKHR in nvcuda.dll and clIcdDispatchGetPlatformIDsKHR in atiocl.dll and atiocl64.dll).

Dade
04-Feb-2010, 12:09
I have posted today the first rendering done with LuxrenderGPU on Lux forums: http://www.luxrender.net/forum/viewtopic.php?f=13&t=3439

http://www.luxrender.net/forum/download/file.php?id=6523&mode=view

This is a quite important milestone because the OpenCL code starts to move from the "toy" field (i.e. SmallptGPU, SmallLuxGPU) to the "production" field (i.e. Luxrender). Even if it is still very much experimental.

LuxrenderGPU supports some of nice feature of Luxrender Classic out of the box, including network rendering:

http://www.luxrender.net/forum/download/file.php?id=6524&mode=view

fellix
04-Feb-2010, 13:34
Kewl!

But the GPU render output on the right is somewhat different -- the blue translucent enclosure is darker!?

Lightman
04-Feb-2010, 16:29
Kewl!

But the GPU render output on the right is somewhat different -- the blue translucent enclosure is darker!?

Same time of rendering, a lot more samples taken for GPU renderer :?:

Talonman
12-Feb-2010, 00:44
Can we download that version?

I can find the download link.

Dade
12-Feb-2010, 10:04
Can we download that version?

I can find the download link.

Talonman, It is still too experimental for the "public". Most of the configuration is hard-coded (buffer size, number of threads, etc.) inside the sources so you would have to modifying the code and recompile to run on your hardware.

Talonman
12-Feb-2010, 17:03
Talonman, It is still too experimental for the "public". Most of the configuration is hard-coded (buffer size, number of threads, etc.) inside the sources so you would have to modifying the code and recompile to run on your hardware.

Thanks for the post...

I will just wait then. The boys at EVGA still have their eye on you. :wink:

http://www.evga.com/FORUMS/tm.aspx?m=91863&mpage=4

Dade
12-Feb-2010, 23:17
Thanks for the post...

I will just wait then. The boys at EVGA still have their eye on you. :wink:

http://www.evga.com/FORUMS/tm.aspx?m=91863&mpage=4

Ah, thanks, could you write there to Nathe72 that the ".py" file is a ".ply" and there is an SmallLuxGPU exporter for Blender 2.5 available here: http://www.luxrender.net/forum/viewtopic.php?f=34&t=3420

LuxrenderGPU uses instead the same .lxs file of Luxrender.

Talonman
13-Feb-2010, 14:07
Done!

Dade
10-Mar-2010, 11:43
Some update about the progresses done so far, I uploaded a new video about SLG at http://www.vimeo.com/10048897

It includes 5 different scenes with real-time (or better interactive) rendering and 3 small animations.

http://www.luxrender.net/forum/download/file.php?id=7047&mode=view
http://www.luxrender.net/forum/download/file.php?id=7048&mode=view
http://www.luxrender.net/forum/download/file.php?id=7050&mode=view

Lightman
27-Mar-2010, 18:44
Anyone with GTX480 willing to join the party?
I'm curious about performance of it in this great OpenCL raytracer :eek:

MrGaribaldi
28-Mar-2010, 13:44
Are there any released in the wild yet? I thought they wouldn't be available until 12th of April.
But as soon as I can get access to one, I'll try to get some results. Have great hopes for the results!

Dade
28-Mar-2010, 22:31
Anyone with GTX480 willing to join the party?


Not yet, however if you want to see some big number follow some screenshot posted by KyungSoo in LuxRender forum dedicated to GPU accelleration.

8 GPUs (!) at work:

http://davibu.interfree.it/tmp/SLG_8gpu.png

4 Tesla at work:

http://davibu.interfree.it/tmp/4tesla.png

I'm looking forward to the first test with Fermi too :wink:

cho
29-Mar-2010, 06:37
GTX 480:
http://we.pcinlife.com/attachments/forumid_340/100329132504635b1413d44fb3.png

GTX 285:
http://we.pcinlife.com/attachments/forumid_340/10032913256bb37194c8e04cc5.png

HD 5870:
http://we.pcinlife.com/attachments/forumid_340/10032913265d1305bc875b9fe0.png

CNCAddict
29-Mar-2010, 06:42
CAN YOU SAY WOOOHOOOO. Looks like I may trade in my 5850 afterall :shock:

rpg.314
29-Mar-2010, 07:21
GTX 480:
http://we.pcinlife.com/attachments/forumid_340/100329132504635b1413d44fb3.png

GTX 285:
http://we.pcinlife.com/attachments/forumid_340/10032913256bb37194c8e04cc5.png


Holy cow... :shock:

Is that a ~20x jump I am seeing there. With just a tiny L1. I am assuming you used 48K for L1 cache.

Come on Dave, give us some cachey goodness on radeon 6xx0.

fellix
29-Mar-2010, 08:56
GTX 480:
http://we.pcinlife.com/attachments/forumid_340/100329132504635b1413d44fb3.png

DAMN! (http://www.youtube.com/watch?v=95SYdjRVCR0)

Is that a ~20x jump I am seeing there. With just a tiny L1. I am assuming you used 48K for L1 cache.
The default LDS/L1 partitioning for GF100 (as of current) is 48/16KB.

Dade
29-Mar-2010, 09:27
Omg :!:

"Old" NVIDIA cards have always shown some problem with SmallptGPU (I wouldn't focus too much on the speed up when compared with GTX285) but running more than 2 time faster than a 5870 is eye popping :shock:

Cho, any chance to run one of the latest SmallLuxGPU (http://davibu.interfree.it/opencl/smallluxgpu/slg-v1.4beta3.tgz) ?

Psycho
29-Mar-2010, 10:05
Ehm.. how can the 5870 do more passes in the same time (and show a *very* similar image that if anything is slightly better - like the number of passes indicate), but get a much lower samples/sec count? Looks like it's doing same/more work in the same time

Jawed
29-Mar-2010, 10:22
That's very tasty. Some nice combination of dynamic branching and cache I suppose.

A much more stressful test:

http://forum.beyond3d.com/showpost.php?p=1385754&postcount=222

Jawed

cho
29-Mar-2010, 10:24
GTX 480:
http://we.pcinlife.com/attachments/forumid_340/10032917146936034c27d775a5.png

GTX 285:
http://we.pcinlife.com/attachments/forumid_340/1003291714242af38362bbf851.png

HD 5870
http://we.pcinlife.com/attachments/forumid_340/100329171416390597782e765d.png

Jawed
29-Mar-2010, 10:25
Ehm.. how can the 5870 do more passes in the same time (and show a *very* similar image that if anything is slightly better - like the number of passes indicate), but get a much lower samples/sec count? Looks like it's doing same/more work in the same time
The time shown is the time between screen updates. The application varies workload per invocation of the OpenCL kernel in order to produce a consistent 0.5s update interval.

Jawed

jj99
29-Mar-2010, 12:01
Very good result of GTX 480 for smallptGPU, but performance in smallluxGPU is rather disappointing...

Dade
29-Mar-2010, 12:09
Thanks, Cho, however you need to tune a bit the configuration for your hardware and for still rendering (instead of preview). You had only a 50% load on the 480.

You should edit the scenes/luxball/render-fast.cfg file and replace the content with:

image.width = 640
image.height = 480
batch.halttime = 0
scene.file = scenes/luxball/luxball.scn
scene.fieldofview = 45
opencl.latency.mode = 0
opencl.nativethread.count = 0
opencl.cpu.use = 0
opencl.gpu.use = 1
opencl.platform.index = 0
opencl.renderthread.count = 4
opencl.gpu.workgroup.size = 64
screen.refresh.interval = 2000
screen.type = 3
screen.gamma = 2.2
path.maxdepth = 6
path.russianroulette.depth = 5
path.russianroulette.prob = 0.75
path.shadowrays = 1

If you use this configuration, first of all it will use only GPU for the rendering, it will use 4 threads to feed the GPU (I assume you have a quad core) and it will disable preview mode.

For reference, this is the result of my i7 860+5870+5850:

http://davibu.interfree.it/tmp/i860+hd5870+hd5850.jpg

Indeed, tuning the configuration is very important.

Dade
29-Mar-2010, 12:12
Very good result of GTX 480 for smallptGPU, but performance in smallluxGPU is rather disappointing...

I think Cho just need a bit of tuning for SmallLuxGPU, however keep in mind SmallptGPU uses a very small dataset (i.e. few bytes). While SmallLuxGPU uses dataset of several MBs.

May be the size of the Fermi cache shines in the first case while it is nearly useless in the second.

jj99
29-Mar-2010, 12:17
Thanks, Dade, I understand that. I was wondering how Fermi's cache will help in real world scenario like in case of SmallLuxGPU. Will wait to see the updated results of Cho.

Lightman
29-Mar-2010, 12:33
Indeed very good showing from GTX480 in SmallPT :shock:

It all finally starts going in the right direction with GPGPU. I only can hope AMD and nVidia can keep up this rate of development for another 3-5 years and real-time RT will be concurred!

cho
29-Mar-2010, 14:51
http://we.pcinlife.com/attachments/forumid_340/1003292141a5c1b8da98d996b1.png

I am using a i7-920 with HT enabled..

The thread number is set to 16 . The GPU load is about 67~78%.

fellix
29-Mar-2010, 15:18
The thread number is set to 16 . The GPU load is about 67~78%.
Wow -- 93°C for just 78% load?

Anyway, here is my HD5870 @ 900MHz GPU:

http://img69.imageshack.us/img69/9458/luxball.jpg

This is with 8 threads on Q9450. Four wouldn't saturate it enough, giving me lower sample rates.

cho
29-Mar-2010, 15:24
yes, but the fan noise is ok at this speed.

Dade
29-Mar-2010, 15:31
I am using a i7-920 with HT enabled..

The thread number is set to 16 . The GPU load is about 67~78%.

Thanks Cho, the correct value for the thread count should be 8 (4 real cores + 4 virtual for HT).

Anyway, the result seems to confirm 480 about 2 times faster than 5870 on GPGPU tasks (about 8M rays/secs Vs about 4M rays/secs).

Arnold Beckenbauer
29-Mar-2010, 16:48
Dade: With SmallLuxGPU 1.3 and 1.4 beta 3 I get this error:
http://www.abload.de/img/errorp4oy.png
Older versions and SmallPTGPU work fine.

Win 7 x64, Cat. 8.712.3.1 (OpenGL 4.0&3.3 Preview Driver), Stream SDK 2.01, HD4850.

Dade
29-Mar-2010, 18:28
Dade: With SmallLuxGPU 1.3 and 1.4 beta 3 I get this error:
http://www.abload.de/img/errorp4oy.png
Older versions and SmallPTGPU work fine.

Win 7 x64, Cat. 8.712.3.1 (OpenGL 4.0&3.3 Preview Driver), Stream SDK 2.01, HD4850.

ATI OpenCL SDK 2.01 has a known problem with HD48xx family. According a post in ATI forums, it will be fixed in the next SDK release.

However, for the moment, the only solution is to downgrade to SDK 2.0 :???:

Arnold Beckenbauer
29-Mar-2010, 20:55
ATI OpenCL SDK 2.01 has a known problem with HD48xx family. According a post in ATI forums, it will be fixed in the next SDK release.

However, for the moment, the only solution is to downgrade to SDK 2.0 :???:

Wow, great. :sad:

Dade
29-Mar-2010, 22:41
Wow, great. :sad:

Yup, not very nice, anyway 2.0 is still available at http://developer.amd.com/Downloads/ati-stream-sdk-v2.0-xp32.exe (32bit) and at http://developer.amd.com/Downloads/ati-stream-sdk-v2.0-xp64.exe (64bit).

Jawed
30-Mar-2010, 00:38
Hasn't 2.0 expired?

Jawed

trinibwoy
30-Mar-2010, 03:33
Is that a ~20x jump I am seeing there.

I don't think it's 20x. Mintmaster found some gross problems with the code two pages back which once fixed increased perf on the 285 30x. What we're seeing could simply be the Fermi compiler automatically taking care of those.

jj99
30-Mar-2010, 04:26
Does someone know if SmallLuxGPU can use the two chips in 5970? I think there is some problem, and the program is compiled only on the first device. The second one gives black.

Mintmaster
30-Mar-2010, 07:19
I don't think it's 20x. Mintmaster found some gross problems with the code two pages back which once fixed increased perf on the 285 30x. What we're seeing could simply be the Fermi compiler automatically taking care of those.Yeah, Fermi is still slower than my 8800 GTS on CUDA :cool:

Can anyone with Fermi try my CUDA code from a few pages back? I think it will do around 1.5 GRays per second. www.its.caltech.edu/~nandra/SmallptGPU.zip

If I find some free time, I'll try to make a DirectCompute port. Seems like ATI and NVidia are more focussed on that than OpenCL.

Dade
30-Mar-2010, 08:50
Hasn't 2.0 expired?

It think only the beta version did. The final release doesn't, I know people that are using it right now (because of the problems with HD48xx).

Does someone know if SmallLuxGPU can use the two chips in 5970? I think there is some problem, and the program is compiled only on the first device. The second one gives black.

It is a problem with crossfire configuration. I have a 5870 and a 5850, if I connect them I get the same result you are describing (and the 5850 is erroneously recognized as a 5870: 20 compute units). Everything works fine when crossfire cable is not used.

It is yet another problem with ATI OpenCL driver, it has been reported a couple of time on their forum. It is another problem it is supposed to be fix in the next release :???:

cho
30-Mar-2010, 09:55
the performance is not stable ... about 0.92 ~ 1.20 GRays/s .

http://we.pcinlife.com/attachments/forumid_340/1003301645c390b4c921d1e12b.png

CarstenS
30-Mar-2010, 11:46
Yeah, Fermi is still slower than my 8800 GTS on CUDA :cool:

Can anyone with Fermi try my CUDA code from a few pages back? I think it will do around 1.5 GRays per second. www.its.caltech.edu/~nandra/SmallptGPU.zip

If I find some free time, I'll try to make a DirectCompute port. Seems like ATI and NVidia are more focussed on that than OpenCL.

The Link doesn't work anymore?

pcchen
30-Mar-2010, 12:19
If I find some free time, I'll try to make a DirectCompute port. Seems like ATI and NVidia are more focussed on that than OpenCL.

To my understanding (please correct me if I'm wrong), the compiler in DirectCompute is provided by Microsoft. That is, the compiler compiles from HLSL into some intermidiate assembly-like language (probably similar to how vertex shader and pixel shader work), then the driver compiles the assembly into hardware binary codes. Therefore, the compiler quality is more consistent (although not perfect, but still consistent over different vendors).

In the case of OpenCL, although the compilers are all based on LLVM (I heard from a friend that Apple requires this), they still varies in compiler quality.

rpg.314
30-Mar-2010, 12:25
To my understanding (please correct me if I'm wrong), the compiler in DirectCompute is provided by Microsoft. That is, the compiler compiles from HLSL into some intermidiate assembly-like language (probably similar to how vertex shader and pixel shader work), then the driver compiles the assembly into hardware binary codes. Therefore, the compiler quality is more consistent (although not perfect, but still consistent over different vendors).
AFAIK, it only does basic optimizations. The final optimizations and codegen is still left to IHV compiler.

The advantage for IHVs is that they can ignore the lexing/parsing/sema/dce phases, which are the most boring in a compiler anyway.

Jawed
30-Mar-2010, 13:48
It seems to me AMD's in a nightmare tussle with LLVM.

Jawed

rpg.314
30-Mar-2010, 14:49
It seems to me AMD's in a nightmare tussle with LLVM.

Jawed

why?

Jawed
30-Mar-2010, 15:05
why?
The first clue is all the talk of irreducible control flow.

Then threads like this:

http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=130787&enterthread=y

Jawed

cho
30-Mar-2010, 18:23
http://we.pcinlife.com/attachments/forumid_340/100329142951bf93b243c2a27a.png

only 240 KB Global Cache ?

MfA
30-Mar-2010, 20:35
The first clue is all the talk of irreducible control flow.
On the upside, if they fix it inside their compiler backend (code explosion ahoy) they will be able to support goto for OpenCL as well.

Although in the end if they really want to they can just turn off all the optimization passes and do it all internally, I doubt the translation step introduces irreducible control flow.

Jawed
31-Mar-2010, 00:22
only 240 KB Global Cache ?
Global memory is wrong too, so I think you can assume it's broken in some way.

Jawed

Ike Turner
31-Mar-2010, 13:05
Quick bump...something isn't right on my side. I can't get it to run more than 6 renderthreads even though I've set it to 8 (or more) :
EDIT: THansk to Tomb at the Lux forum I found my error: http://www.luxrender.net/forum/viewtopic.php?f=34&t=3643&start=30#p35294

http://img202.imageshack.us/img202/5136/luxki.jpg (http://img202.imageshack.us/i/luxki.jpg/)

I'm on a i7 i920 + 5870.

image.width = 640
image.height = 480
batch.halttime = 0
scene.file = scenes/luxball/luxball.scn
scene.fieldofview = 45
opencl.latency.mode = 0
opencl.nativethread.count = 4
opencl.cpu.use = 0
opencl.gpu.use = 1
opencl.platform.index = 0
opencl.renderthread.count = 8
opencl.gpu.workgroup.size = 64
screen.refresh.interval = 2000
screen.type = 3
screen.gamma = 2.2
path.maxdepth = 6
path.russianroulette.depth = 5
path.russianroulette.prob = 0.75
path.shadowrays = 1


An other thing (that isn't related..) I noticed whe running Sisoft Sandra's benchs is that Only single Precision is working in the DirectCompute benchs (Double is emulated). Wtth is going on (running the CAT 10.3b with the latest DX11 runtime and ATI Stream 2.0.1 SDK on Win7 64Bit)

http://img177.imageshack.us/img177/3581/dc11.jpg (http://img177.imageshack.us/i/dc11.jpg/)

jj99
31-Mar-2010, 15:23
About sandra... On my machine it's the same. Double precision is too low. In OpenCL is the same. But in Stream test, results are normal. It's just another "mystery" of ATI's drivers...

pcchen
31-Mar-2010, 19:08
About sandra... On my machine it's the same. Double precision is too low. In OpenCL is the same. But in Stream test, results are normal. It's just another "mystery" of ATI's drivers...

It could be because that currently AMD's OpenCL implementation does not expose double precision support. It's still sort of supported (as in "you can still use it"), but not advertised in the extension string. Maybe Sandra didn't get that note so they don't use double precision on OpenCL?

Mintmaster
01-Apr-2010, 19:29
The Link doesn't work anymore?Sorry, wrong link. It was right on the earlier page.
http://www.its.caltech.edu/~nandra/SmallptCUDA.zip

I checked it, and it's an old version with a few bugs. I'll put the new one up soon...

CarstenS
06-Apr-2010, 12:34
It's quitting on my system with an error.
Something about a missing cudart64_30_8.dll.

Do I need to install some SDK for it?

pcchen
06-Apr-2010, 18:03
To my understanding, cudart64_30_8.dll comes with the older CUDA SDK 3.0 beta. The release version only comes with 30_14. So either this has to be recompiled with the latest SDK, or you'll have to get that file from some source (I don't have it, unfortunately, as I uninstalled the beta SDK and those files are gone).

CarstenS
07-Apr-2010, 08:57
Maybe Mr. Mint can help out?

willardjuice
07-Apr-2010, 16:01
You are lucky I'm lazy and take forever to upgrade things. :razz: I think this is what you want: http://www.megaupload.com/?d=PSR7QTOV (I'm assuming the download will be available eventually, atm it seems megaupload is still checking my file)

Btw, why not just download the old sdk?

Mintmaster
07-Apr-2010, 17:55
To my understanding, cudart64_30_8.dll comes with the older CUDA SDK 3.0 beta. The release version only comes with 30_14. So either this has to be recompiled with the latest SDK, or you'll have to get that file from some source (I don't have it, unfortunately, as I uninstalled the beta SDK and those files are gone).Damnit. How can I figure out which DLL's are needed to run a CUDA program? I thought cudart.dll and cutil64.dll were the only files I needed to put in the zip.

CarstenS
07-Apr-2010, 19:02
willardjuice,
Thanks man! But you filehoster won't let me get my virtual sticky fingers on the files. :-(

Mintmaster,
You happen to have that Cuda-file at hand, don't you?

willardjuice
07-Apr-2010, 19:46
http://rapidshare.com/files/373152268/BeyondPhysX.7z.html try that

pcchen
07-Apr-2010, 21:29
Damnit. How can I figure out which DLL's are needed to run a CUDA program? I thought cudart.dll and cutil64.dll were the only files I needed to put in the zip.

They used to be only cudart.dll, but they keep changing the files and make them incompatible, so they eventually decided to put some version numbers in the filename, therefore the cuda_30_8 thingy.

CarstenS
07-Apr-2010, 22:37
http://rapidshare.com/files/373152268/BeyondPhysX.7z.html try that

That worked! Thanks :)
My results are the same as cho's though. Performance is alternating between 0,94 and 1,25 GRays/sec.

straaljager
08-Apr-2010, 18:46
Hello Beyond3D,

After seeing these crazy results with SmallptGPU, I was wondering if someone with a GTX480 could take the time to do a little benchmark that measures the GPU raytracing performance with the free demo of Octane render (a GPU-only raytracing renderer). The latest demo (version 1.02 beta2) can be downloaded at http://www.refractivesoftware.com/downloads.html.

I uploaded two testscenes (spaceships and chess) that came with the previous demo versions of Octane render, but for some reason were not include in the latest version. If you don't know how to load scenes into Octane and use the render, these tutorial videos explain how it's done:

http://vimeo.com/groups/octanerender/videos/10155587
http://vimeo.com/groups/octanerender/videos/8965151

Would greatly appreciate it!

straaljager
09-Apr-2010, 08:17
Sorry, forgot the link to the testscenes but couldn't edit my post http://www.mediafire.com/?31k1xrmmw1b (7MB zipped).

To load a scene, right-click in the graph editor (lower left panel), choose Add > Objects > Mesh and then choose the testscene file. The scene will be imported and a new button will appear in the graph editor. If you click on it, the scene will start rendering and will show samples/pixel, Msamples/s, FPS and rendertime.

Broken Hope
09-Apr-2010, 12:22
Is it normal to get a ton of errors in the command window when running LuxRays?

http://img245.imageshack.us/img245/179/errorsu.png

If not how do I fix them?

cho
09-Apr-2010, 14:48
Hello Beyond3D,

After seeing these crazy results with SmallptGPU, I was wondering if someone with a GTX480 could take the time to do a little benchmark that measures the GPU raytracing performance with the free demo of Octane render (a GPU-only raytracing renderer). The latest demo (version 1.02 beta2) can be downloaded at http://www.refractivesoftware.com/downloads.html.

I uploaded two testscenes (spaceships and chess) that came with the previous demo versions of Octane render, but for some reason were not include in the latest version. If you don't know how to load scenes into Octane and use the render, these tutorial videos explain how it's done:

http://vimeo.com/groups/octanerender/videos/10155587
http://vimeo.com/groups/octanerender/videos/8965151

Would greatly appreciate it!


I can import the .obj file, but when click on the object, the program just close.

It detected gtx480 as compute caps 1.0.

straaljager
09-Apr-2010, 16:14
I can import the .obj file, but when click on the object, the program just close.

It detected gtx480 as compute caps 1.0.

Thanks for trying this, cho. I'm not sure why the program closes. Did you download CUDA drivers 3.0 (needed for the latest demo version 1.02 beta2)? It may be that Octane doesn't recognize GTX 480 yet. I have uploaded an .obj file (the chess scene) to http://www.mediafire.com/?wmemmuywdxm. (To load the scene just right-click in the graph editor, choose Add>objects>mesh and find "chess.obj". The scene will be imported and a new button with the same name will appear in the graph editor. When you click on it, it should start rendering right away).

cho
09-Apr-2010, 18:04
the scene can be import, but when select it, the program closes too.

straaljager
09-Apr-2010, 19:43
Thanks Cho. I have contacted Octane's developer about the problem. Sorry for repeating myself, but are you sure you have installed the CUDA 3.0 driver (can be downloaded at http://developer.nvidia.com/object/cuda_3_0_downloads.html)?

If that doesn't work, there's probably a compatibility issue with Octane itself. Thanks again.

straaljager
09-Apr-2010, 20:00
Last try: maybe an older demo version (v.07a) will work. I uploaded it to http://www.mediafire.com/?2mvtiwtj4mj

Test scenes are included in this one and you don't necessarily need cuda 3.0.

Dade
09-Apr-2010, 21:10
Is it normal to get a ton of errors in the command window when running LuxRays?

If not how do I fix them?

You have to run the application with administrator rights in order to allow each thread to set its own priority. However it is more a warning than an error, it works better if GPU feeding threads have higher priority than CPU native threads but the difference shouldn't be dramatic especially with only one GPU.

CarstenS
10-Apr-2010, 07:58
straaljager,

With the new 1.02 Beta2, I'm getting the same error as cho, despite having the brandnew 197.41 WHQL with Cuda 3.0 installed.

Version 0.7a runs "partially" fine though.
For chess, I'm getting 12.75 MSamples/sec., for spaceship 15.96.
But I'm afraid the render output for chess is somewhat broken:
http://img222.imageshack.us/img222/6558/octanedemo07a9000gtx480.th.png (http://img222.imageshack.us/i/octanedemo07a9000gtx480.png/)
Spaceship looks normal I think.
http://img13.imageshack.us/img13/6558/octanedemo07a9000gtx480.th.png (http://img13.imageshack.us/i/octanedemo07a9000gtx480.png/)

straaljager
10-Apr-2010, 10:35
Thank you very much CarstenS!

If by broken render output, you mean the white dots in the image, that's perfectly normal for the algorithm being used (brute force pathtracing) and is commonly referred to as "fireflies". Currently they can be removed in postprocessing, but there are algorithms being developed (like Metropolis light transport) that don't exhibit these fireflies.

Thanks again for taking the time to do this bench.

straaljager
10-Apr-2010, 11:18
Forgot to mention: octane performance is about 3.20 - 3.70 MSamples/sec for the spaceship scene on a GTX260 (same demo version, same camera view). So GTX480 is about 4.5x faster than GTX260 in path tracing.

CarstenS
10-Apr-2010, 12:35
Ah - good to hear that my card isn't broken then! :)
And it's also interesting to see, that Nvidias claims wrt raytracing/pathtracing performance are somewhat reproducible with code that's not their own, too!

trinibwoy
10-Apr-2010, 12:42
Works fine on my 285, and I'm only on 196.34 drivers so I'm not sure why 197.13 is required. Anyway, the 1.02 version is much faster than the 0.7 on both scenes.

Spaceship
v0.7 - 7.3 MSamples/sec
v1.02 - 17.2 MSamples/sec

Chess
v0.7 - 6.0 MSamples/sec
v1.02 - 13.4 MSamples/sec

[edit] upgraded to 197.13 and now v1.02 crashes on startup, restored 196.34 and it's good again.

straaljager
10-Apr-2010, 13:07
Works fine on my 285, and I'm only on 196.34 drivers so I'm not sure why 197.13 is required. Anyway, the 1.02 version is much faster than the 0.7 on both scenes.

Spaceship
v0.7 - 7.3 MSamples/sec
v1.02 - 17.2 MSamples/sec

Chess
v0.7 - 6.0 MSamples/sec
v1.02 - 13.4 MSamples/sec

[edit] upgraded to 197.13 and now v1.02 doesn't work.

Hi, the sudden boost in performance you see in v.1.02 is because this version uses direct lighting to render instead of pathtracing by default (v.07a uses pathtracing by default).

To change to pathtracing, you must double click on the button "Preview Configuration" in the Graph editor, then click once on the button "Mesh Preview Kernel" and in the right-panel change "directlighting" to "pathtracing" in the drop-down list. You'll see that Megasamples/sec will halve.

trinibwoy
10-Apr-2010, 13:13
Hi, the sudden boost in performance you see in v.1.02 is because this version uses direct lighting to render instead of pathtracing by default (v.07a uses pathtracing by default).

To change to pathtracing, you must double click on the button "Preview Configuration" in the Graph editor, then click once on the button "Mesh Preview Kernel" and in the right-panel change "directlighting" to "pathtracing" in the drop-down list. You'll see that Megasamples/sec will halve.

Ah, thanks :) I'm still seeing higher performance though (slightly in chess but significant in spaceship). Are there any other settings that differ between versions that could affect performance?

chess - 6.13 MSamples/sec
spaceship - 11.7 MSamples/sec

straaljager
10-Apr-2010, 13:28
Ah, thanks :) I'm still seeing higher performance though (slightly in chess but significant in spaceship). Are there any other settings that differ between versions that could affect performance?

chess - 6.13 MSamples/sec
spaceship - 11.7 MSamples/sec

There have been some performance improvements in v.1.02 over v.07a and mostly extra features, but the path tracing algorithm has stayed the same afaik, so the render output of the default scene should be identical.

cho
27-Apr-2010, 19:13
new octane demo version :

http://we.pcinlife.com/attachments/forumid_206/100428015266f0a571f715dec8.jpg

straaljager
28-Apr-2010, 07:51
Wow, 24 MSamples/s :grin:. But I don't think you're using path tracing. There's a video that shows how to set up this benchmark scene here: http://vimeo.com/10699771 . You should use physical sun and change the material to glossy. When all the changes are applied and path tracing is used the samples/s on the gtx480 should be 4.1 Msamples/s.

Dade
28-Apr-2010, 09:25
new octane demo version :

It is just me or there are a load of artefacts (i.e. black triangles) :?:

straaljager
28-Apr-2010, 13:01
You're right, Dade. Weird, it doesn't happen on my 8600M GT. The benchmark scene can be downloaded from http://www.francescolegrenzi.com/extra/60_GPGPU/---GPU_Benchmark_v0.1_Octane_v1.02b2---.rar if anyone wants to try.

straaljager
28-Apr-2010, 13:18
I've found the error: you have to check the box "Enable" at "State" to get rid of the black triangles. Watch this video around 0:15 http://www.vimeo.com/10699771

Arnold Beckenbauer
03-May-2010, 21:36
ATI OpenCL SDK 2.01 has a known problem with HD48xx family. According a post in ATI forums, it will be fixed in the next SDK release.

However, for the moment, the only solution is to downgrade to SDK 2.0 :???:

Works fine with Stream SDK 2.1. :smile:

Dade
04-May-2010, 11:40
Works fine with Stream SDK 2.1. :smile:

Oh, good news :wink: