g80/Cuda for raytracing ?

Very interesting paper, it confuses me wrt your earlier comment though!
We currently are talking ~19Mray/s on an X1900XTX (Conference room, shadow rays), and about the same on a G80 with DirectX and the current state of the drivers and shader compilers.
The R580 uses four-wide in your implementation, while G80 would be full-speed at scalar. Are you saying the state of the G80 drivers was so bad that it wasn't faster, even though its coherence requirements were 6 times lower than R580?! :eek:
 
Are you saying the state of the G80 drivers was so bad that it wasn't faster, even though its coherence requirements were 6 times lower than R580?! :eek:
Yeah it kind of confuses me that the G80 is slower as well, as even in the most R580-friendly cases that I've tested, the G80 wins, excepting CPM of course although CUDA promises something similar.
 
The coherence isn't that much better (32 vs 48), and we have high register requirements, 31 packetized, and ~20 non-packetized. Register pressure does seem to effect performance on G80, but it more difficult to discern where things degrade than on previous boards, and the available registers appears higher than previous boards. We have parts of the code that are still not completely scalar, even when running non-packetized. The G80 does do a little better than the R580 when running the non-packetized version. We also have issues with compilers as commented on in the paper. I should also mention that since we can actually see the raw R5XX assembly hitting the R580, we know they are doing an amazing job on our code of making use of the preadders. Remember that if you add in the preadders to the R580's flop rating, you approach the G80's computational power. G80 seems to have a slightly higher overhead for branching, and definately has a high startup cost (seems to be shader compilation), but the lateer is ammortized out in our testing since we don't time the first frame rendered. We have other GPGPU apps where we have seen a similar trend of the G80 underperforming expectations, but often running faster than R580, but some, like those that are purely scalar, are performing somtimes >2X the R580 (as expected really).

Regardless, I expect as the G80 drivers and compilers mature, we'll begin to see improvements in the quality of the shader code produced and however the board is being scheduled. As seen from the initial GPUBench tests, driver and compiler revisions can have a large impact on things. Drivers before the 100 series perform 25-30% slower than the 100 series drivers we used on compute intensive tasks. Those drivers give close to the expected performance for bandwidth, compute, and latency hiding based on available information. I think there is still some shaking out in the shader compiler and scheduler that is still going on, and I expect to see improvements in the future. We are also running through the DX9 and GL paths, and not using CUDA. In theory, CUDA might give use finer control over the code hitting the board, but as hinted to in the paper, it doesn't solve our divergence issues.
 
We have other GPGPU apps where we have seen a similar trend of the G80 underperforming expectations, but often running faster than R580, but some, like those that are purely scalar, are performing somtimes >2X the R580 (as expected really).
Yeah that's odd. In almost all of the examples that we're tried the G80 outperforms the R580 by at least 2x - in one benchmark as much as 10x. There are a few cases where the race is closer, but the G80 is always at least slightly faster.

Interesting to hear your results. I suspect things will improve with drivers, etc. as well.
 
How "big" are your shaders/apps and how many registers do you use? And, what drivers are you running? We are running the latest official public XP drivers and using DX9. Also, are your apps bandwidth bound into the shader, or can be on R580? Do you use filtering on the textures? G80 has much more bandwidth from texture that R580 unless you use R580's fetch4 features, which are tricky to use.
 
How "big" are your shaders/apps and how many registers do you use?
Don't know the precise register counts, but the different benchmarks vary widely in shader size and complexity.

And, what drivers are you running? We are running the latest official public XP drivers and using DX9.
Latest drivers, using OpenGL on Linux and Windows. Admittedly ATI doesn't have terribly good OpenGL drivers (especially on Linux), but after some working around bugs, etc. the R580 numbers are reasonably good and comparable to a 7900GTX when the latter isn't getting killed by branching, etc.

Also, are your apps bandwidth bound into the shader, or can be on R580?
Some of them are bandwidth bound, but some are certainly not. There's a good range of tests.

Do you use filtering on the textures? G80 has much more bandwidth from texture that R580 unless you use R580's fetch4 features, which are tricky to use.
No texture filtering in any of the current benchmarks I don't think.
 
Last edited by a moderator:
Yes, ATI's GL support isn't the best in the world. I'm hoping this will improve, but until GL catches full DX9/10 functionality, it's becoming less useful. I'm going to try to rerun the HMMer code soon on G80 and see where it stands as that app is bandwidth bound on R580.
 
Interesting, I just reran the GPUBench numbers for DX, and the numbers are below (and a different pattern) and the GL results with 100 series drivers with the 97.92 drivers. I see ~10% decrease in ADD/MUL, and a 25% drop in MAD rate. I'll attempt the newest beta/leak/"whatever you want to call them" drivers and see if the results go back up. The 97.92 drivers seem to have quite poor GL performance for GPGPU. What drivers are you running exactly? This may be telling as all our GPGPU tests thus far have used the DX path as we've been having some compiler issues with Cg with large numbers of registers and complex looping/branching chains.
 
What drivers are you running exactly?
That I don't know exactly... it would have been a 9 series driver. In any case I'm quite sure that the drivers are pretty fragile right now, and probably in a few months should be a lot better. Still, by then we'll have the R600 hopefully, albeit probably with similarly messed up drivers ;)
 
The coherence isn't that much better (32 vs 48), and we have high register requirements, 31 packetized, and ~20 non-packetized.
20 scalar registers, or 20 Vec4 ones? The former shouldn't a problem, while the latter would be. Let us consider it this way: the G80 is divided in 16 parts, each with its own register file and parallel data cache. Since the latter is presumably 16KiB, you'd assume the former to be between 16KiB and 64KiB, so let's guestimate 32KiB.

So, taking a total temporary storage area of 48KiB, with 20 scalar registers, you can have 600+ threads running, which is more than enough. If those were Vec4, you could only have 150 threads running, which might slightly degrade performance already. If the register file was only 16KiB, then you'd only be able to run 100 threads. If it was more, it would make it less of a problem, of course.

Anyway, this is all theorical, and assumes the compiler does its job perfectly, which I very much doubt at this point ;) If you indeed were using 20+ Vec4 registers, and none of that would be optimizable, the compiler could still help by finding independent instructions, which would increase the effective latency tolerance. I'm sure that still has some room for improvement in the future.

As for coherence - unless I am missing something, 32 non-packetized vs 48 packetized is a 6x difference. So that doesn't look to me like it'd be the primary problem anymore - although if the actual branching instructions were significantly more expensive than on R580, that might be problematic indeed...
 
18 float4's via fxc, 16 if written by hand. I was comparing non-packetized granularity on both. You are correct that running packets increases the effective branch granularity by 4, as stated in the paper. As can be seen, as you increase ray divergence and hence instruction divergence, the non-packetized ray tracer is actually faster on R580, and is on G80 as well.
 
18 float4's via fxc, 16 if written by hand.
Ouch! :) Assuming a I'48KiB register file, that's 192 threads, which is borderline I guess. I'm not sure how it'd affect performance with likely very unpredictable data fetches...
R580 certainly still has a few tricks in its bag, given how nicely it seems to handle that, anyway!
As can be seen, as you increase ray divergence and hence instruction divergence, the non-packetized ray tracer is actually faster on R580, and is on G80 as well.
Shouldn't the G80 *always* be (a fair bit) faster with the non-packetized version than with the packetized version, though? Since it has no benefit from Vec4 instructions in theory. Although, in practice, it could use those Vec4 instructions to extract instruction level parallelism and improve effective latency tolerance...

Based on the limitations discussed, it seems to me that G80 should be ridiculously good here, although future parts with higher ALU ratios would obviously help, should these exist. I'm very curious as to whether this is mostly a software problem right now, or if hardware is actually also a major limitation for whatever reason. Hmmm!
 
The packetized version's loops are <4X the size of the non-packetized, so the packetized code is "tighter" as we can reuse math for the kd-tree traversal and triangle intersection for all the rays in the packet if they don't diverge. So, the non-packetized isn't always better as we can save work in the optimal cases. Also, the non-packetized code is not fully scalar as we are still dealing with multiple component vectors, but not all the code is 4-wide. G80 does do a little better than R580 in non-packetized mode, just not leaps and bounds, we are talking <10%. This is all vanilla DX. If we compare against the hand optimized version we did hand writing assembly and using CTM for the packetized version (as mentioned in the paper), ATI is a good deal faster for eye rays and shadows (fxc is convinced that the shader requires > 32 registers, and the current ATI compilers generate incorrect code if we patch the ps30 asm by hand). So FXC+vendor compilers isn't doing the best job in the world, but it's a big shader.

As I said, I think things will get better as the drivers and compilers mature. Also, redesigning the code from the ground up for a scalar architecture may help, but you'd expect that the compiler would do a good job unrolling optimized vector code into good scalar code. But, with the state of compilers for GPUs, I'm not sure I'm willing to bet on it.

--

On a different note, we just reran ClawHMMer on G80 with 100 series drivers and it's getting ~2.2X over R580 (DX9). This app is bandwidth bound on R580 without fetch4/CTM. This app doesn't use lots of registers and has chunks of code that are scalar by nature. So that is the current hmmsearch performance record I believe. Cool. This fits with Andy's results. Some of our other apps are still not up to speed, but those are massively compute bound and/or use large register files.

But, this thread has devolved into a GPGPU discussion and should probably be split off and moved to the GPGPU section.
 
Let us consider it this way: the G80 is divided in 16 parts, each with its own register file and parallel data cache. Since the latter is presumably 16KiB, you'd assume the former to be between 16KiB and 64KiB, so let's guestimate 32KiB.
According to Buck_NVIDIA_Cuda.pdf, the PDC is 16KB per cluster, i.e. 8KB per 1/16th part.

As for coherence, surely the ideal is to issue 16-object threads in G80, i.e. as vertices, rather than pixels. Obviously that requires D3D10 or CUDA.

I wonder if there's a way of reducing the coherency problem by sacrificing some parallelism - effectively making each thread consist of less objects, i.e. leaving some "null". You might tweak the sacrificial threads according to tree depth or something.

Still haven't gotten round to reading the paper though...

Jawed
 
20 scalar registers, or 20 Vec4 ones? The former shouldn't a problem, while the latter would be. Let us consider it this way: the G80 is divided in 16 parts, each with its own register file and parallel data cache. Since the latter is presumably 16KiB, you'd assume the former to be between 16KiB and 64KiB, so let's guestimate 32KiB.

Where do you get "divided in 16 parts" from? All the diagrams I've seen show 8 clusters * 16 ALUs/cluster = 128 ALUs.

From this CUDA presentation it says the Parallel Data Cache is "As fast as registers" and confirms the 16KiB number. If you have a chunk of shared memory that's as fast as your registers, why would you make it exclusive of normal registers? For shaders that don't use the entire PDC the only thing that makes sense is that it is somehow shared with the normal register file. My guess is either you have a larger register file (like 32KiB or 64KiB) from which 16KiB can be shared, or you only have 16KiB of memory and the RF and PDC are one and the same.

Once CUDA comes out this should be fairly easy to test: write a shader with lots of registers, one with lots of shared memory, and one with lots of both. See how everything scales as either or both are increased.
 
Last edited by a moderator:
It seems PDC acts as a cache for the register file, i.e. reads and writes against the register file have to progress via PDC. PDC appears to translate the memory formatting of the register file (which has to be able to cater for 4096 registers being defined for a pixel - though how it would physically cope in this situation is another matter :!: ) into the source operands and destination of an instruction (a maximum of 3 source operands and 1 destination).

If you take a 4xfp32 MAD for 32 pixels, that's 48 bytes per pixel, or 1536 bytes for 32 pixels. Add in the 4xfp32 destination for the MAD and you get 2048 bytes. That's just one thread.

Each cluster actually executes two distinct threads in parallel, so each half-cluster has access to 8KB of PDC. Which is enough for 4 such threads, per half-cluster, to have their data in PDC for a MAD instruction.

Jawed
 
mhouston said:
Cell is actually a raytracing monster, compared to other non-custom architectures, in certain situations. The Saarland folks (and others including Stanford) have Cell raytracers >60Mrays/s for primary rays.
Wow, that's really impressive. I'm always wondering if any future console has balls to go with raytracing based graphic subsystem, because the rasterization pipeline already showed too much innate deficiencies. Besides the raytracing hardware prototype showed in recent SIGGRAPH, it seems the Cell processor also has the potential to do it. In my opinion, Sony should've just use their PS2 architecture to make PS3: RSX only for rasterization, pixel shading and ROP, and let SPU handle all vertex computing, animation, physics and even raytracing. Of course that requires fast data access between SPU and RSX like VU->GS.

I believe the smoke demo from the G80 launch as marketed as "raytraced".
A few years ago I wrote a similiar 3D smoke demo based on Naive-Stokes equation, on a 6800 card. :) It ran at about 50fps for simulation and rendering the smoke object only. The most expensive part is not the simulation, but the volumetric lighting. I implemented the slice-based volume rendering at first, then tried to move on to raytraced volume rendering. Unfortunately the result was frustrating: hardware raytraced volume rendering trended to show more artifacts than slice-based methods, and almost no good way to calculate volumetric lighting. So guess that nVidia smoke demo is based on raytraced volume rendering, that means - tracing a pixel through a volume texture and accumulating sampled values along the ray.
 
Back
Top